LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH v4 0/5] x86: Rework IST interrupts
@ 2014-11-21 21:26 Andy Lutomirski
  2014-11-21 21:26 ` [PATCH v4 1/5] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME Andy Lutomirski
                   ` (4 more replies)
  0 siblings, 5 replies; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-21 21:26 UTC (permalink / raw)
  To: Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Andy Lutomirski

Following Linus' feedback, this series has grown.  When I say "IST
interrupt" below, I'm excluding NMIs.  They are their own thing.  This
patch does *not* change NMI handling.

This adds new functions ist_enter and ist_exit to handle entry to and
exit from ist handlers.  This is, I think, a bugfix for RCU handling,
and I've cc'd some RCU people on that patch.  It will also catch bugs
involving scheduling on the IST stack.

This series then changes the entry code to switch to the normal stack if
an IST interrupt came from userspace and wasn't a double-fault.  This
simplifies the exit logic.

Building on that, I add new functions ist_begin_non_atomic and
ist_end_non_atomic.  They next inside ist_enter and ist_exit, and they
allow scheduling if the right conditions are met and they BUG if not.
They're meant to be used in do_machine_check to handle userspace memory
failures.

NB: Tony has seen odd behavior when stress-testing injected machine
checks with earlier versions of this series applied.  I suspect that
it's a bug in something else, possibly his BIOS.  Bugs in this series
shouldn't be ruled out, though.

Changes from v3:
 - Reworked everything except the asm.

Changes from v2:
 - double_fault is no longer affected.
 - Other changes, but those are obsolete now.

Changes from RFC/v1: I forgot.  Sorry.

Andy Lutomirski (5):
  uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
  x86, traps: Track entry into and exit from IST context
  x86, entry: Switch stacks on a paranoid entry from userspace
  x86: Clean up current_stack_pointer
  x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic

 Documentation/x86/entry_64.txt         |  18 ++++--
 Documentation/x86/x86_64/kernel-stacks |   8 ++-
 arch/x86/include/asm/thread_info.h     |  13 +++-
 arch/x86/include/asm/traps.h           |   6 ++
 arch/x86/kernel/cpu/mcheck/mce.c       |   5 ++
 arch/x86/kernel/cpu/mcheck/p5.c        |   6 ++
 arch/x86/kernel/cpu/mcheck/winchip.c   |   5 ++
 arch/x86/kernel/entry_64.S             |  86 ++++++++++++++------------
 arch/x86/kernel/irq_32.c               |  13 +---
 arch/x86/kernel/traps.c                | 110 +++++++++++++++++++++++++--------
 kernel/events/uprobes.c                |   1 -
 11 files changed, 183 insertions(+), 88 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v4 1/5] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
  2014-11-21 21:26 [PATCH v4 0/5] x86: Rework IST interrupts Andy Lutomirski
@ 2014-11-21 21:26 ` Andy Lutomirski
  2014-11-22 16:55   ` Borislav Petkov
  2014-11-21 21:26 ` [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context Andy Lutomirski
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-21 21:26 UTC (permalink / raw)
  To: Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Andy Lutomirski

x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set
but not on non-paranoid returns.  I suspect that this is a mistake
and that the code only works because int3 is paranoid.

Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a
workaround for the x86 bug.  With that bug fixed, we can remove
_TIF_NOTIFY_RESUME from the uprobes code.

Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/thread_info.h | 2 +-
 kernel/events/uprobes.c            | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 854053889d4d..547e344a6dc6 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -141,7 +141,7 @@ struct thread_info {
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
 	(_TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_NOTIFY_RESUME |	\
-	 _TIF_USER_RETURN_NOTIFY)
+	 _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW							\
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1d0af8a2c646..ed8f2cde34c5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1640,7 +1640,6 @@ bool uprobe_deny_signal(void)
 		if (__fatal_signal_pending(t) || arch_uprobe_xol_was_trapped(t)) {
 			utask->state = UTASK_SSTEP_TRAPPED;
 			set_tsk_thread_flag(t, TIF_UPROBE);
-			set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
 		}
 	}
 
-- 
1.9.3


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 21:26 [PATCH v4 0/5] x86: Rework IST interrupts Andy Lutomirski
  2014-11-21 21:26 ` [PATCH v4 1/5] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME Andy Lutomirski
@ 2014-11-21 21:26 ` Andy Lutomirski
  2014-11-21 21:32   ` Andy Lutomirski
                     ` (3 more replies)
  2014-11-21 21:26 ` [PATCH v4 3/5] x86, entry: Switch stacks on a paranoid entry " Andy Lutomirski
                   ` (2 subsequent siblings)
  4 siblings, 4 replies; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-21 21:26 UTC (permalink / raw)
  To: Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Andy Lutomirski, Paul E. McKenney, Josh Triplett,
	Frédéric Weisbecker

We currently pretend that IST context is like standard exception
context, but this is incorrect.  IST entries from userspace are like
standard exceptions except that they use per-cpu stacks, so they are
atomic.  IST entries from kernel space are like NMIs from RCU's
perspective -- they are not quiescent states even if they
interrupted the kernel during a quiescent state.

Add and use ist_enter and ist_exit to track IST context.  Even
though x86_32 has no IST stacks, we track these interrupts the same
way.

This fixes two issues:

 - Scheduling from an IST interrupt handler will now warn.  It would
   previously appear to work as long as we got lucky and nothing
   overwrote the stack frame.  (I don't know of any bugs in this
   that would trigger the warning, but it's good to be on the safe
   side.)

 - RCU handling in IST context was dangerous.  As far as I know,
   only machine checks were likely to trigger this, but it's good to
   be on the safe side.

Note that the machine check handlers appears to have been missing
any context tracking at all before this patch.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/traps.h         |  4 +++
 arch/x86/kernel/cpu/mcheck/mce.c     |  5 ++++
 arch/x86/kernel/cpu/mcheck/p5.c      |  6 +++++
 arch/x86/kernel/cpu/mcheck/winchip.c |  5 ++++
 arch/x86/kernel/traps.c              | 49 ++++++++++++++++++++++++++++++------
 5 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index bc8352e7010a..eb16a61bfd06 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_TRAPS_H
 #define _ASM_X86_TRAPS_H
 
+#include <linux/context_tracking_state.h>
 #include <linux/kprobes.h>
 
 #include <asm/debugreg.h>
@@ -109,6 +110,9 @@ asmlinkage void smp_thermal_interrupt(void);
 asmlinkage void mce_threshold_interrupt(void);
 #endif
 
+extern enum ctx_state ist_enter(struct pt_regs *regs);
+extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
+
 /* Interrupts/Exceptions */
 enum {
 	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 61a9668cebfd..b72509d77337 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -43,6 +43,7 @@
 #include <linux/export.h>
 
 #include <asm/processor.h>
+#include <asm/traps.h>
 #include <asm/mce.h>
 #include <asm/msr.h>
 
@@ -1016,6 +1017,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 {
 	struct mca_config *cfg = &mca_cfg;
 	struct mce m, *final;
+	enum ctx_state prev_state;
 	int i;
 	int worst = 0;
 	int severity;
@@ -1038,6 +1040,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
 	char *msg = "Unknown";
 
+	prev_state = ist_enter(regs);
+
 	this_cpu_inc(mce_exception_count);
 
 	if (!cfg->banks)
@@ -1168,6 +1172,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
 out:
 	sync_core();
+	ist_exit(regs, prev_state);
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
diff --git a/arch/x86/kernel/cpu/mcheck/p5.c b/arch/x86/kernel/cpu/mcheck/p5.c
index a3042989398c..ec2663a708e4 100644
--- a/arch/x86/kernel/cpu/mcheck/p5.c
+++ b/arch/x86/kernel/cpu/mcheck/p5.c
@@ -8,6 +8,7 @@
 #include <linux/smp.h>
 
 #include <asm/processor.h>
+#include <asm/traps.h>
 #include <asm/mce.h>
 #include <asm/msr.h>
 
@@ -17,8 +18,11 @@ int mce_p5_enabled __read_mostly;
 /* Machine check handler for Pentium class Intel CPUs: */
 static void pentium_machine_check(struct pt_regs *regs, long error_code)
 {
+	enum ctx_state prev_state;
 	u32 loaddr, hi, lotype;
 
+	prev_state = ist_enter(regs);
+
 	rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
 	rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
 
@@ -33,6 +37,8 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)
 	}
 
 	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
+
+	ist_exit(regs, prev_state);
 }
 
 /* Set up machine check reporting for processors with Intel style MCE: */
diff --git a/arch/x86/kernel/cpu/mcheck/winchip.c b/arch/x86/kernel/cpu/mcheck/winchip.c
index 7dc5564d0cdf..bd5d46a32210 100644
--- a/arch/x86/kernel/cpu/mcheck/winchip.c
+++ b/arch/x86/kernel/cpu/mcheck/winchip.c
@@ -7,14 +7,19 @@
 #include <linux/types.h>
 
 #include <asm/processor.h>
+#include <asm/traps.h>
 #include <asm/mce.h>
 #include <asm/msr.h>
 
 /* Machine check handler for WinChip C6: */
 static void winchip_machine_check(struct pt_regs *regs, long error_code)
 {
+	enum ctx_state prev_state = ist_enter(regs);
+
 	printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
 	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
+
+	ist_exit(regs, prev_state);
 }
 
 /* Set up machine check reporting on the Winchip C6 series */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 0d0e922fafc1..f5c4b8813774 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
 	preempt_count_dec();
 }
 
+enum ctx_state ist_enter(struct pt_regs *regs)
+{
+	/*
+	 * We are atomic because we're on the IST stack (or we're on x86_32,
+	 * in which case we still shouldn't schedule.
+	 */
+	preempt_count_add(HARDIRQ_OFFSET);
+
+	if (user_mode_vm(regs)) {
+		/* Other than that, we're just an exception. */
+		return exception_enter();
+	} else {
+		/*
+		 * We might have interrupted pretty much anything.  In
+		 * fact, if we're a machine check, we can even interrupt
+		 * NMI processing.  We don't want in_nmi() to return true,
+		 * but we need to notify RCU.
+		 */
+		rcu_nmi_enter();
+		return IN_KERNEL;  /* the value is irrelevant. */
+	}
+}
+
+void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
+{
+	preempt_count_sub(HARDIRQ_OFFSET);
+
+	if (user_mode_vm(regs))
+		return exception_exit(prev_state);
+	else
+		rcu_nmi_exit();
+}
+
 static nokprobe_inline int
 do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
 		  struct pt_regs *regs,	long error_code)
@@ -244,14 +277,14 @@ dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code)
 {
 	enum ctx_state prev_state;
 
-	prev_state = exception_enter();
+	prev_state = ist_enter(regs);
 	if (notify_die(DIE_TRAP, "stack segment", regs, error_code,
 		       X86_TRAP_SS, SIGBUS) != NOTIFY_STOP) {
 		preempt_conditional_sti(regs);
 		do_trap(X86_TRAP_SS, SIGBUS, "stack segment", regs, error_code, NULL);
 		preempt_conditional_cli(regs);
 	}
-	exception_exit(prev_state);
+	ist_exit(regs, prev_state);
 }
 
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
@@ -259,8 +292,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	static const char str[] = "double fault";
 	struct task_struct *tsk = current;
 
-	exception_enter();
-	/* Return not checked because double check cannot be ignored */
+	ist_enter(regs);
+	/* Return not checked because double fault cannot be ignored */
 	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
 	tsk->thread.error_code = error_code;
@@ -343,7 +376,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 	if (poke_int3_handler(regs))
 		return;
 
-	prev_state = exception_enter();
+	prev_state = ist_enter(regs);
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
 	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
 				SIGTRAP) == NOTIFY_STOP)
@@ -369,7 +402,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 	preempt_conditional_cli(regs);
 	debug_stack_usage_dec();
 exit:
-	exception_exit(prev_state);
+	ist_exit(regs, prev_state);
 }
 NOKPROBE_SYMBOL(do_int3);
 
@@ -433,7 +466,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	unsigned long dr6;
 	int si_code;
 
-	prev_state = exception_enter();
+	prev_state = ist_enter(regs);
 
 	get_debugreg(dr6, 6);
 
@@ -508,7 +541,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-	exception_exit(prev_state);
+	ist_exit(regs, prev_state);
 }
 NOKPROBE_SYMBOL(do_debug);
 
-- 
1.9.3


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v4 3/5] x86, entry: Switch stacks on a paranoid entry from userspace
  2014-11-21 21:26 [PATCH v4 0/5] x86: Rework IST interrupts Andy Lutomirski
  2014-11-21 21:26 ` [PATCH v4 1/5] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME Andy Lutomirski
  2014-11-21 21:26 ` [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context Andy Lutomirski
@ 2014-11-21 21:26 ` " Andy Lutomirski
  2014-11-24 15:55   ` Borislav Petkov
  2014-11-21 21:26 ` [PATCH v4 4/5] x86: Clean up current_stack_pointer Andy Lutomirski
  2014-11-21 21:26 ` [PATCH v4 5/5] x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic Andy Lutomirski
  4 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-21 21:26 UTC (permalink / raw)
  To: Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Andy Lutomirski

This causes all non-NMI, non-double-fault kernel entries from
userspace to run on the normal kernel stack.  Double-fault is
exempt to minimize confusion if we double-fault directly from
userspace due to a bad kernel stack.

This is, suprisingly, simpler and shorter than the current code.  It
removes the IMO rather frightening paranoid_userspace path, and it
make sync_regs much simpler.

There is no risk of stack overflow due to this change -- the kernel
stack that we switch to is empty.

This will also enable us to create non-atomic sections within
machine checks from userspace, which will simplify memory failure
handling.  It will also allow the upcoming fsgsbase code to be
simplified, because it doesn't need to worry about usergs when
scheduling in paranoid_exit, as that code no longer exists.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 Documentation/x86/entry_64.txt         | 18 ++++---
 Documentation/x86/x86_64/kernel-stacks |  8 ++--
 arch/x86/kernel/entry_64.S             | 86 ++++++++++++++++++----------------
 arch/x86/kernel/traps.c                | 23 ++-------
 4 files changed, 67 insertions(+), 68 deletions(-)

diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt
index bc7226ef5055..8e53217e6d40 100644
--- a/Documentation/x86/entry_64.txt
+++ b/Documentation/x86/entry_64.txt
@@ -75,9 +75,6 @@ The expensive (paranoid) way is to read back the MSR_GS_BASE value
 	xorl %ebx,%ebx
 1:	ret
 
-and the whole paranoid non-paranoid macro complexity is about whether
-to suffer that RDMSR cost.
-
 If we are at an interrupt or user-trap/gate-alike boundary then we can
 use the faster check: the stack will be a reliable indicator of
 whether SWAPGS was already done: if we see that we are a secondary
@@ -90,6 +87,15 @@ which might have triggered right after a normal entry wrote CS to the
 stack but before we executed SWAPGS, then the only safe way to check
 for GS is the slower method: the RDMSR.
 
-So we try only to mark those entry methods 'paranoid' that absolutely
-need the more expensive check for the GS base - and we generate all
-'normal' entry points with the regular (faster) entry macros.
+Therefore, super-atomic entries (except NMI, which is handled separately)
+must use idtentry with paranoid=1 to handle gsbase correctly.  This
+triggers three main behavior changes:
+
+ - Interrupt entry will use the slower gsbase check.
+ - Interrupt entry from user mode will switch off the IST stack.
+ - Interrupt exit to kernel mode will not attempt to reschedule.
+
+We try to only use IST entries and the paranoid entry code for vectors
+that absolutely need the more expensive check for the GS base - and we
+generate all 'normal' entry points with the regular (faster) paranoid=0
+variant.
diff --git a/Documentation/x86/x86_64/kernel-stacks b/Documentation/x86/x86_64/kernel-stacks
index a01eec5d1d0b..e3c8a49d1a2f 100644
--- a/Documentation/x86/x86_64/kernel-stacks
+++ b/Documentation/x86/x86_64/kernel-stacks
@@ -40,9 +40,11 @@ An IST is selected by a non-zero value in the IST field of an
 interrupt-gate descriptor.  When an interrupt occurs and the hardware
 loads such a descriptor, the hardware automatically sets the new stack
 pointer based on the IST value, then invokes the interrupt handler.  If
-software wants to allow nested IST interrupts then the handler must
-adjust the IST values on entry to and exit from the interrupt handler.
-(This is occasionally done, e.g. for debug exceptions.)
+the interrupt came from user mode, then the interrupt handler prologue
+will switch back to the per-thread stack.  If software wants to allow
+nested IST interrupts then the handler must adjust the IST values on
+entry to and exit from the interrupt handler.  (This is occasionally
+done, e.g. for debug exceptions.)
 
 Events with different IST codes (i.e. with different stacks) can be
 nested.  For example, a debug interrupt can safely be interrupted by an
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index df088bb03fb3..de24b2eac6b2 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1064,6 +1064,11 @@ ENTRY(\sym)
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
 
 	.if \paranoid
+	.if \paranoid == 1
+	CFI_REMEMBER_STATE
+	testl $3, CS(%rsp)		/* If coming from userspace, switch */
+	jnz 1f				/* stacks. */
+	.endif
 	call save_paranoid
 	.else
 	call error_entry
@@ -1104,6 +1109,36 @@ ENTRY(\sym)
 	jmp error_exit			/* %ebx: no swapgs flag */
 	.endif
 
+	.if \paranoid == 1
+	CFI_RESTORE_STATE
+	/*
+	 * Paranoid entry from userspace.  Switch stacks and treat it
+	 * as a normal entry.  This means that paranoid handlers
+	 * run in real process context if user_mode(regs).
+	 */
+1:
+	call error_entry
+
+	DEFAULT_FRAME 0
+
+	movq %rsp,%rdi			/* pt_regs pointer */
+	call sync_regs
+	movq %rax,%rsp			/* switch stack */
+
+	movq %rsp,%rdi			/* pt_regs pointer */
+
+	.if \has_error_code
+	movq ORIG_RAX(%rsp),%rsi	/* get error code */
+	movq $-1,ORIG_RAX(%rsp)		/* no syscall to restart */
+	.else
+	xorl %esi,%esi			/* no error code */
+	.endif
+
+	call \do_sym
+
+	jmp error_exit			/* %ebx: no swapgs flag */
+	.endif
+
 	CFI_ENDPROC
 END(\sym)
 .endm
@@ -1124,7 +1159,7 @@ idtentry overflow do_overflow has_error_code=0
 idtentry bounds do_bounds has_error_code=0
 idtentry invalid_op do_invalid_op has_error_code=0
 idtentry device_not_available do_device_not_available has_error_code=0
-idtentry double_fault __do_double_fault has_error_code=1 paranoid=1
+idtentry double_fault __do_double_fault has_error_code=1 paranoid=2
 idtentry coprocessor_segment_overrun do_coprocessor_segment_overrun has_error_code=0
 idtentry invalid_TSS do_invalid_TSS has_error_code=1
 idtentry segment_not_present do_segment_not_present has_error_code=1
@@ -1305,16 +1340,14 @@ idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(
 #endif
 
 	/*
-	 * "Paranoid" exit path from exception stack.
-	 * Paranoid because this is used by NMIs and cannot take
-	 * any kernel state for granted.
-	 * We don't do kernel preemption checks here, because only
-	 * NMI should be common and it does not enable IRQs and
-	 * cannot get reschedule ticks.
+	 * "Paranoid" exit path from exception stack.  This is invoked
+	 * only on return from non-NMI IST interrupts that came
+	 * from kernel space.
 	 *
-	 * "trace" is 0 for the NMI handler only, because irq-tracing
-	 * is fundamentally NMI-unsafe. (we cannot change the soft and
-	 * hard flags at once, atomically)
+	 * We may be returning to very strange contexts (e.g. very early
+	 * in syscall entry), so checking for preemption here would
+	 * be complicated.  Fortunately, we there's no good reason
+	 * to try to handle preemption here.
 	 */
 
 	/* ebx:	no swapgs flag */
@@ -1324,43 +1357,14 @@ ENTRY(paranoid_exit)
 	TRACE_IRQS_OFF_DEBUG
 	testl %ebx,%ebx				/* swapgs needed? */
 	jnz paranoid_restore
-	testl $3,CS(%rsp)
-	jnz   paranoid_userspace
-paranoid_swapgs:
 	TRACE_IRQS_IRETQ 0
 	SWAPGS_UNSAFE_STACK
 	RESTORE_ALL 8
-	jmp irq_return
+	INTERRUPT_RETURN
 paranoid_restore:
 	TRACE_IRQS_IRETQ_DEBUG 0
 	RESTORE_ALL 8
-	jmp irq_return
-paranoid_userspace:
-	GET_THREAD_INFO(%rcx)
-	movl TI_flags(%rcx),%ebx
-	andl $_TIF_WORK_MASK,%ebx
-	jz paranoid_swapgs
-	movq %rsp,%rdi			/* &pt_regs */
-	call sync_regs
-	movq %rax,%rsp			/* switch stack for scheduling */
-	testl $_TIF_NEED_RESCHED,%ebx
-	jnz paranoid_schedule
-	movl %ebx,%edx			/* arg3: thread flags */
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	xorl %esi,%esi 			/* arg2: oldset */
-	movq %rsp,%rdi 			/* arg1: &pt_regs */
-	call do_notify_resume
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	jmp paranoid_userspace
-paranoid_schedule:
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_ANY)
-	SCHEDULE_USER
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	TRACE_IRQS_OFF
-	jmp paranoid_userspace
+	INTERRUPT_RETURN
 	CFI_ENDPROC
 END(paranoid_exit)
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index f5c4b8813774..6a02760df7b4 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -408,27 +408,14 @@ NOKPROBE_SYMBOL(do_int3);
 
 #ifdef CONFIG_X86_64
 /*
- * Help handler running on IST stack to switch back to user stack
- * for scheduling or signal handling. The actual stack switch is done in
- * entry.S
+ * Help handler running on IST stack to switch off the IST stack if the
+ * interrupted code was in user mode. The actual stack switch is done in
+ * entry_64.S
  */
 asmlinkage __visible struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
-	struct pt_regs *regs = eregs;
-	/* Did already sync */
-	if (eregs == (struct pt_regs *)eregs->sp)
-		;
-	/* Exception from user space */
-	else if (user_mode(eregs))
-		regs = task_pt_regs(current);
-	/*
-	 * Exception from kernel and interrupts are enabled. Move to
-	 * kernel process stack.
-	 */
-	else if (eregs->flags & X86_EFLAGS_IF)
-		regs = (struct pt_regs *)(eregs->sp -= sizeof(struct pt_regs));
-	if (eregs != regs)
-		*regs = *eregs;
+	struct pt_regs *regs = task_pt_regs(current);
+	*regs = *eregs;
 	return regs;
 }
 NOKPROBE_SYMBOL(sync_regs);
-- 
1.9.3


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v4 4/5] x86: Clean up current_stack_pointer
  2014-11-21 21:26 [PATCH v4 0/5] x86: Rework IST interrupts Andy Lutomirski
                   ` (2 preceding siblings ...)
  2014-11-21 21:26 ` [PATCH v4 3/5] x86, entry: Switch stacks on a paranoid entry " Andy Lutomirski
@ 2014-11-21 21:26 ` Andy Lutomirski
  2014-11-24 11:39   ` Borislav Petkov
  2014-11-21 21:26 ` [PATCH v4 5/5] x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic Andy Lutomirski
  4 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-21 21:26 UTC (permalink / raw)
  To: Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Andy Lutomirski

There's no good reason for it to be a macro, and x86_64 will want to
use it, so it should be in a header.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/thread_info.h | 11 +++++++++++
 arch/x86/kernel/irq_32.c           | 13 +++----------
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 547e344a6dc6..8b13b0fbda8e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -170,6 +170,17 @@ static inline struct thread_info *current_thread_info(void)
 	return ti;
 }
 
+static inline unsigned long current_stack_pointer(void)
+{
+	unsigned long sp;
+#ifdef CONFIG_X86_64
+	asm("mov %%rsp,%0" : "=g" (sp));
+#else
+	asm("mov %%esp,%0" : "=g" (sp));
+#endif
+	return sp;
+}
+
 #else /* !__ASSEMBLY__ */
 
 /* how to get the thread information struct from ASM */
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
index 63ce838e5a54..28d28f5eb8f4 100644
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -69,16 +69,9 @@ static void call_on_stack(void *func, void *stack)
 		     : "memory", "cc", "edx", "ecx", "eax");
 }
 
-/* how to get the current stack pointer from C */
-#define current_stack_pointer ({		\
-	unsigned long sp;			\
-	asm("mov %%esp,%0" : "=g" (sp));	\
-	sp;					\
-})
-
 static inline void *current_stack(void)
 {
-	return (void *)(current_stack_pointer & ~(THREAD_SIZE - 1));
+	return (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
 }
 
 static inline int
@@ -103,7 +96,7 @@ execute_on_irq_stack(int overflow, struct irq_desc *desc, int irq)
 
 	/* Save the next esp at the bottom of the stack */
 	prev_esp = (u32 *)irqstk;
-	*prev_esp = current_stack_pointer;
+	*prev_esp = current_stack_pointer();
 
 	if (unlikely(overflow))
 		call_on_stack(print_stack_overflow, isp);
@@ -156,7 +149,7 @@ void do_softirq_own_stack(void)
 
 	/* Push the previous esp onto the stack */
 	prev_esp = (u32 *)irqstk;
-	*prev_esp = current_stack_pointer;
+	*prev_esp = current_stack_pointer();
 
 	call_on_stack(__do_softirq, isp);
 }
-- 
1.9.3


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v4 5/5] x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic
  2014-11-21 21:26 [PATCH v4 0/5] x86: Rework IST interrupts Andy Lutomirski
                   ` (3 preceding siblings ...)
  2014-11-21 21:26 ` [PATCH v4 4/5] x86: Clean up current_stack_pointer Andy Lutomirski
@ 2014-11-21 21:26 ` Andy Lutomirski
  2014-11-24 15:54   ` Borislav Petkov
  4 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-21 21:26 UTC (permalink / raw)
  To: Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Andy Lutomirski

In some IST handlers, if the interrupt came from user mode,
we can safely enable preemption.  Add helpers to do it safely.

This is intended to be used my the memory failure code in
do_machine_check.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/include/asm/traps.h |  2 ++
 arch/x86/kernel/traps.c      | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index eb16a61bfd06..04ba537fc721 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -112,6 +112,8 @@ asmlinkage void mce_threshold_interrupt(void);
 
 extern enum ctx_state ist_enter(struct pt_regs *regs);
 extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
+extern void ist_begin_non_atomic(struct pt_regs *regs);
+extern void ist_end_non_atomic(void);
 
 /* Interrupts/Exceptions */
 enum {
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6a02760df7b4..2b5f2e038e3f 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -140,6 +140,44 @@ void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
 		rcu_nmi_exit();
 }
 
+/**
+ * ist_begin_non_atomic() - begin a non-atomic section in an IST exception
+ * @regs:	regs passed to the IST exception handler
+ *
+ * IST exception handlers normally cannot schedule.  As a special
+ * exception, if the exception interrupted userspace code (i.e.
+ * user_mode_vm(regs) would return true) and the exception was not
+ * a double fault, it can be safe to schedule.  ist_begin_non_atomic()
+ * begins a non-atomic section within an ist_enter()/ist_exit() region.
+ * Callers are responsible for enabling interrupts themselves inside
+ * the non-atomic section, and callers must call is_end_non_atomic()
+ * before ist_exit().
+ */
+void ist_begin_non_atomic(struct pt_regs *regs)
+{
+	BUG_ON(!user_mode_vm(regs));
+
+	/*
+	 * Sanity check: we need to be on the normal thread stack.  This
+	 * will catch asm bugs and any attempt to use ist_preempt_enable
+	 * from double_fault.
+	 */
+	BUG_ON(((current_stack_pointer() ^ this_cpu_read_stable(kernel_stack))
+		& ~(THREAD_SIZE - 1)) != 0);
+
+	preempt_count_sub(HARDIRQ_OFFSET);
+}
+
+/**
+ * ist_end_non_atomic() - begin a non-atomic section in an IST exception
+ *
+ * Ends a non-atomic section started with ist_begin_non_atomic().
+ */
+void ist_end_non_atomic(void)
+{
+	preempt_count_add(HARDIRQ_OFFSET);
+}
+
 static nokprobe_inline int
 do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
 		  struct pt_regs *regs,	long error_code)
-- 
1.9.3


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 21:26 ` [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context Andy Lutomirski
@ 2014-11-21 21:32   ` Andy Lutomirski
  2014-11-21 22:07     ` Paul E. McKenney
  2014-11-21 22:00   ` Paul E. McKenney
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-21 21:32 UTC (permalink / raw)
  To: Borislav Petkov, X86 ML, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Andy Lutomirski, Paul E. McKenney, Josh Triplett,
	Frédéric Weisbecker

On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> We currently pretend that IST context is like standard exception
> context, but this is incorrect.  IST entries from userspace are like
> standard exceptions except that they use per-cpu stacks, so they are
> atomic.  IST entries from kernel space are like NMIs from RCU's
> perspective -- they are not quiescent states even if they
> interrupted the kernel during a quiescent state.
>
> Add and use ist_enter and ist_exit to track IST context.  Even
> though x86_32 has no IST stacks, we track these interrupts the same
> way.

I should add:

I have no idea why RCU read-side critical sections are safe inside
__do_page_fault today.  It's guarded by exception_enter(), but that
doesn't do anything if context tracking is off, and context tracking
is usually off. What am I missing here?

--Andy

>
> This fixes two issues:
>
>  - Scheduling from an IST interrupt handler will now warn.  It would
>    previously appear to work as long as we got lucky and nothing
>    overwrote the stack frame.  (I don't know of any bugs in this
>    that would trigger the warning, but it's good to be on the safe
>    side.)
>
>  - RCU handling in IST context was dangerous.  As far as I know,
>    only machine checks were likely to trigger this, but it's good to
>    be on the safe side.
>
> Note that the machine check handlers appears to have been missing
> any context tracking at all before this patch.
>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Frédéric Weisbecker <fweisbec@gmail.com>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/include/asm/traps.h         |  4 +++
>  arch/x86/kernel/cpu/mcheck/mce.c     |  5 ++++
>  arch/x86/kernel/cpu/mcheck/p5.c      |  6 +++++
>  arch/x86/kernel/cpu/mcheck/winchip.c |  5 ++++
>  arch/x86/kernel/traps.c              | 49 ++++++++++++++++++++++++++++++------
>  5 files changed, 61 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index bc8352e7010a..eb16a61bfd06 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -1,6 +1,7 @@
>  #ifndef _ASM_X86_TRAPS_H
>  #define _ASM_X86_TRAPS_H
>
> +#include <linux/context_tracking_state.h>
>  #include <linux/kprobes.h>
>
>  #include <asm/debugreg.h>
> @@ -109,6 +110,9 @@ asmlinkage void smp_thermal_interrupt(void);
>  asmlinkage void mce_threshold_interrupt(void);
>  #endif
>
> +extern enum ctx_state ist_enter(struct pt_regs *regs);
> +extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
> +
>  /* Interrupts/Exceptions */
>  enum {
>         X86_TRAP_DE = 0,        /*  0, Divide-by-zero */
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 61a9668cebfd..b72509d77337 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -43,6 +43,7 @@
>  #include <linux/export.h>
>
>  #include <asm/processor.h>
> +#include <asm/traps.h>
>  #include <asm/mce.h>
>  #include <asm/msr.h>
>
> @@ -1016,6 +1017,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  {
>         struct mca_config *cfg = &mca_cfg;
>         struct mce m, *final;
> +       enum ctx_state prev_state;
>         int i;
>         int worst = 0;
>         int severity;
> @@ -1038,6 +1040,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>         DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
>         char *msg = "Unknown";
>
> +       prev_state = ist_enter(regs);
> +
>         this_cpu_inc(mce_exception_count);
>
>         if (!cfg->banks)
> @@ -1168,6 +1172,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>         mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
>  out:
>         sync_core();
> +       ist_exit(regs, prev_state);
>  }
>  EXPORT_SYMBOL_GPL(do_machine_check);
>
> diff --git a/arch/x86/kernel/cpu/mcheck/p5.c b/arch/x86/kernel/cpu/mcheck/p5.c
> index a3042989398c..ec2663a708e4 100644
> --- a/arch/x86/kernel/cpu/mcheck/p5.c
> +++ b/arch/x86/kernel/cpu/mcheck/p5.c
> @@ -8,6 +8,7 @@
>  #include <linux/smp.h>
>
>  #include <asm/processor.h>
> +#include <asm/traps.h>
>  #include <asm/mce.h>
>  #include <asm/msr.h>
>
> @@ -17,8 +18,11 @@ int mce_p5_enabled __read_mostly;
>  /* Machine check handler for Pentium class Intel CPUs: */
>  static void pentium_machine_check(struct pt_regs *regs, long error_code)
>  {
> +       enum ctx_state prev_state;
>         u32 loaddr, hi, lotype;
>
> +       prev_state = ist_enter(regs);
> +
>         rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
>         rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
>
> @@ -33,6 +37,8 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)
>         }
>
>         add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> +
> +       ist_exit(regs, prev_state);
>  }
>
>  /* Set up machine check reporting for processors with Intel style MCE: */
> diff --git a/arch/x86/kernel/cpu/mcheck/winchip.c b/arch/x86/kernel/cpu/mcheck/winchip.c
> index 7dc5564d0cdf..bd5d46a32210 100644
> --- a/arch/x86/kernel/cpu/mcheck/winchip.c
> +++ b/arch/x86/kernel/cpu/mcheck/winchip.c
> @@ -7,14 +7,19 @@
>  #include <linux/types.h>
>
>  #include <asm/processor.h>
> +#include <asm/traps.h>
>  #include <asm/mce.h>
>  #include <asm/msr.h>
>
>  /* Machine check handler for WinChip C6: */
>  static void winchip_machine_check(struct pt_regs *regs, long error_code)
>  {
> +       enum ctx_state prev_state = ist_enter(regs);
> +
>         printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
>         add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> +
> +       ist_exit(regs, prev_state);
>  }
>
>  /* Set up machine check reporting on the Winchip C6 series */
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 0d0e922fafc1..f5c4b8813774 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
>         preempt_count_dec();
>  }
>
> +enum ctx_state ist_enter(struct pt_regs *regs)
> +{
> +       /*
> +        * We are atomic because we're on the IST stack (or we're on x86_32,
> +        * in which case we still shouldn't schedule.
> +        */
> +       preempt_count_add(HARDIRQ_OFFSET);
> +
> +       if (user_mode_vm(regs)) {
> +               /* Other than that, we're just an exception. */
> +               return exception_enter();
> +       } else {
> +               /*
> +                * We might have interrupted pretty much anything.  In
> +                * fact, if we're a machine check, we can even interrupt
> +                * NMI processing.  We don't want in_nmi() to return true,
> +                * but we need to notify RCU.
> +                */
> +               rcu_nmi_enter();
> +               return IN_KERNEL;  /* the value is irrelevant. */
> +       }
> +}
> +
> +void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
> +{
> +       preempt_count_sub(HARDIRQ_OFFSET);
> +
> +       if (user_mode_vm(regs))
> +               return exception_exit(prev_state);
> +       else
> +               rcu_nmi_exit();
> +}
> +
>  static nokprobe_inline int
>  do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
>                   struct pt_regs *regs, long error_code)
> @@ -244,14 +277,14 @@ dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code)
>  {
>         enum ctx_state prev_state;
>
> -       prev_state = exception_enter();
> +       prev_state = ist_enter(regs);
>         if (notify_die(DIE_TRAP, "stack segment", regs, error_code,
>                        X86_TRAP_SS, SIGBUS) != NOTIFY_STOP) {
>                 preempt_conditional_sti(regs);
>                 do_trap(X86_TRAP_SS, SIGBUS, "stack segment", regs, error_code, NULL);
>                 preempt_conditional_cli(regs);
>         }
> -       exception_exit(prev_state);
> +       ist_exit(regs, prev_state);
>  }
>
>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> @@ -259,8 +292,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>         static const char str[] = "double fault";
>         struct task_struct *tsk = current;
>
> -       exception_enter();
> -       /* Return not checked because double check cannot be ignored */
> +       ist_enter(regs);
> +       /* Return not checked because double fault cannot be ignored */
>         notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
>
>         tsk->thread.error_code = error_code;
> @@ -343,7 +376,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>         if (poke_int3_handler(regs))
>                 return;
>
> -       prev_state = exception_enter();
> +       prev_state = ist_enter(regs);
>  #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
>         if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
>                                 SIGTRAP) == NOTIFY_STOP)
> @@ -369,7 +402,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>         preempt_conditional_cli(regs);
>         debug_stack_usage_dec();
>  exit:
> -       exception_exit(prev_state);
> +       ist_exit(regs, prev_state);
>  }
>  NOKPROBE_SYMBOL(do_int3);
>
> @@ -433,7 +466,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
>         unsigned long dr6;
>         int si_code;
>
> -       prev_state = exception_enter();
> +       prev_state = ist_enter(regs);
>
>         get_debugreg(dr6, 6);
>
> @@ -508,7 +541,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
>         debug_stack_usage_dec();
>
>  exit:
> -       exception_exit(prev_state);
> +       ist_exit(regs, prev_state);
>  }
>  NOKPROBE_SYMBOL(do_debug);
>
> --
> 1.9.3
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 21:26 ` [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context Andy Lutomirski
  2014-11-21 21:32   ` Andy Lutomirski
@ 2014-11-21 22:00   ` Paul E. McKenney
  2014-11-22 17:20   ` Borislav Petkov
  2015-01-22 21:52   ` Sasha Levin
  3 siblings, 0 replies; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-21 22:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, x86, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Fri, Nov 21, 2014 at 01:26:08PM -0800, Andy Lutomirski wrote:
> We currently pretend that IST context is like standard exception
> context, but this is incorrect.  IST entries from userspace are like
> standard exceptions except that they use per-cpu stacks, so they are
> atomic.  IST entries from kernel space are like NMIs from RCU's
> perspective -- they are not quiescent states even if they
> interrupted the kernel during a quiescent state.
> 
> Add and use ist_enter and ist_exit to track IST context.  Even
> though x86_32 has no IST stacks, we track these interrupts the same
> way.
> 
> This fixes two issues:
> 
>  - Scheduling from an IST interrupt handler will now warn.  It would
>    previously appear to work as long as we got lucky and nothing
>    overwrote the stack frame.  (I don't know of any bugs in this
>    that would trigger the warning, but it's good to be on the safe
>    side.)
> 
>  - RCU handling in IST context was dangerous.  As far as I know,
>    only machine checks were likely to trigger this, but it's good to
>    be on the safe side.
> 
> Note that the machine check handlers appears to have been missing
> any context tracking at all before this patch.
> 
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Frédéric Weisbecker <fweisbec@gmail.com>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Just out of curiosity...  Can an NMI occur in IST context?  If it can,
I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
nested calls.

							Thanx, Paul

> ---
>  arch/x86/include/asm/traps.h         |  4 +++
>  arch/x86/kernel/cpu/mcheck/mce.c     |  5 ++++
>  arch/x86/kernel/cpu/mcheck/p5.c      |  6 +++++
>  arch/x86/kernel/cpu/mcheck/winchip.c |  5 ++++
>  arch/x86/kernel/traps.c              | 49 ++++++++++++++++++++++++++++++------
>  5 files changed, 61 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index bc8352e7010a..eb16a61bfd06 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -1,6 +1,7 @@
>  #ifndef _ASM_X86_TRAPS_H
>  #define _ASM_X86_TRAPS_H
> 
> +#include <linux/context_tracking_state.h>
>  #include <linux/kprobes.h>
> 
>  #include <asm/debugreg.h>
> @@ -109,6 +110,9 @@ asmlinkage void smp_thermal_interrupt(void);
>  asmlinkage void mce_threshold_interrupt(void);
>  #endif
> 
> +extern enum ctx_state ist_enter(struct pt_regs *regs);
> +extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
> +
>  /* Interrupts/Exceptions */
>  enum {
>  	X86_TRAP_DE = 0,	/*  0, Divide-by-zero */
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 61a9668cebfd..b72509d77337 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -43,6 +43,7 @@
>  #include <linux/export.h>
> 
>  #include <asm/processor.h>
> +#include <asm/traps.h>
>  #include <asm/mce.h>
>  #include <asm/msr.h>
> 
> @@ -1016,6 +1017,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  {
>  	struct mca_config *cfg = &mca_cfg;
>  	struct mce m, *final;
> +	enum ctx_state prev_state;
>  	int i;
>  	int worst = 0;
>  	int severity;
> @@ -1038,6 +1040,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  	DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
>  	char *msg = "Unknown";
> 
> +	prev_state = ist_enter(regs);
> +
>  	this_cpu_inc(mce_exception_count);
> 
>  	if (!cfg->banks)
> @@ -1168,6 +1172,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
>  out:
>  	sync_core();
> +	ist_exit(regs, prev_state);
>  }
>  EXPORT_SYMBOL_GPL(do_machine_check);
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/p5.c b/arch/x86/kernel/cpu/mcheck/p5.c
> index a3042989398c..ec2663a708e4 100644
> --- a/arch/x86/kernel/cpu/mcheck/p5.c
> +++ b/arch/x86/kernel/cpu/mcheck/p5.c
> @@ -8,6 +8,7 @@
>  #include <linux/smp.h>
> 
>  #include <asm/processor.h>
> +#include <asm/traps.h>
>  #include <asm/mce.h>
>  #include <asm/msr.h>
> 
> @@ -17,8 +18,11 @@ int mce_p5_enabled __read_mostly;
>  /* Machine check handler for Pentium class Intel CPUs: */
>  static void pentium_machine_check(struct pt_regs *regs, long error_code)
>  {
> +	enum ctx_state prev_state;
>  	u32 loaddr, hi, lotype;
> 
> +	prev_state = ist_enter(regs);
> +
>  	rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
>  	rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
> 
> @@ -33,6 +37,8 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)
>  	}
> 
>  	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> +
> +	ist_exit(regs, prev_state);
>  }
> 
>  /* Set up machine check reporting for processors with Intel style MCE: */
> diff --git a/arch/x86/kernel/cpu/mcheck/winchip.c b/arch/x86/kernel/cpu/mcheck/winchip.c
> index 7dc5564d0cdf..bd5d46a32210 100644
> --- a/arch/x86/kernel/cpu/mcheck/winchip.c
> +++ b/arch/x86/kernel/cpu/mcheck/winchip.c
> @@ -7,14 +7,19 @@
>  #include <linux/types.h>
> 
>  #include <asm/processor.h>
> +#include <asm/traps.h>
>  #include <asm/mce.h>
>  #include <asm/msr.h>
> 
>  /* Machine check handler for WinChip C6: */
>  static void winchip_machine_check(struct pt_regs *regs, long error_code)
>  {
> +	enum ctx_state prev_state = ist_enter(regs);
> +
>  	printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
>  	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> +
> +	ist_exit(regs, prev_state);
>  }
> 
>  /* Set up machine check reporting on the Winchip C6 series */
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 0d0e922fafc1..f5c4b8813774 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
>  	preempt_count_dec();
>  }
> 
> +enum ctx_state ist_enter(struct pt_regs *regs)
> +{
> +	/*
> +	 * We are atomic because we're on the IST stack (or we're on x86_32,
> +	 * in which case we still shouldn't schedule.
> +	 */
> +	preempt_count_add(HARDIRQ_OFFSET);
> +
> +	if (user_mode_vm(regs)) {
> +		/* Other than that, we're just an exception. */
> +		return exception_enter();
> +	} else {
> +		/*
> +		 * We might have interrupted pretty much anything.  In
> +		 * fact, if we're a machine check, we can even interrupt
> +		 * NMI processing.  We don't want in_nmi() to return true,
> +		 * but we need to notify RCU.
> +		 */
> +		rcu_nmi_enter();
> +		return IN_KERNEL;  /* the value is irrelevant. */
> +	}
> +}
> +
> +void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
> +{
> +	preempt_count_sub(HARDIRQ_OFFSET);
> +
> +	if (user_mode_vm(regs))
> +		return exception_exit(prev_state);
> +	else
> +		rcu_nmi_exit();
> +}
> +
>  static nokprobe_inline int
>  do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
>  		  struct pt_regs *regs,	long error_code)
> @@ -244,14 +277,14 @@ dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code)
>  {
>  	enum ctx_state prev_state;
> 
> -	prev_state = exception_enter();
> +	prev_state = ist_enter(regs);
>  	if (notify_die(DIE_TRAP, "stack segment", regs, error_code,
>  		       X86_TRAP_SS, SIGBUS) != NOTIFY_STOP) {
>  		preempt_conditional_sti(regs);
>  		do_trap(X86_TRAP_SS, SIGBUS, "stack segment", regs, error_code, NULL);
>  		preempt_conditional_cli(regs);
>  	}
> -	exception_exit(prev_state);
> +	ist_exit(regs, prev_state);
>  }
> 
>  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> @@ -259,8 +292,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>  	static const char str[] = "double fault";
>  	struct task_struct *tsk = current;
> 
> -	exception_enter();
> -	/* Return not checked because double check cannot be ignored */
> +	ist_enter(regs);
> +	/* Return not checked because double fault cannot be ignored */
>  	notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
> 
>  	tsk->thread.error_code = error_code;
> @@ -343,7 +376,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>  	if (poke_int3_handler(regs))
>  		return;
> 
> -	prev_state = exception_enter();
> +	prev_state = ist_enter(regs);
>  #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
>  	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
>  				SIGTRAP) == NOTIFY_STOP)
> @@ -369,7 +402,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
>  	preempt_conditional_cli(regs);
>  	debug_stack_usage_dec();
>  exit:
> -	exception_exit(prev_state);
> +	ist_exit(regs, prev_state);
>  }
>  NOKPROBE_SYMBOL(do_int3);
> 
> @@ -433,7 +466,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
>  	unsigned long dr6;
>  	int si_code;
> 
> -	prev_state = exception_enter();
> +	prev_state = ist_enter(regs);
> 
>  	get_debugreg(dr6, 6);
> 
> @@ -508,7 +541,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
>  	debug_stack_usage_dec();
> 
>  exit:
> -	exception_exit(prev_state);
> +	ist_exit(regs, prev_state);
>  }
>  NOKPROBE_SYMBOL(do_debug);
> 
> -- 
> 1.9.3
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 21:32   ` Andy Lutomirski
@ 2014-11-21 22:07     ` Paul E. McKenney
  2014-11-21 22:19       ` Andy Lutomirski
  2014-11-21 22:20       ` Frederic Weisbecker
  0 siblings, 2 replies; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-21 22:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > We currently pretend that IST context is like standard exception
> > context, but this is incorrect.  IST entries from userspace are like
> > standard exceptions except that they use per-cpu stacks, so they are
> > atomic.  IST entries from kernel space are like NMIs from RCU's
> > perspective -- they are not quiescent states even if they
> > interrupted the kernel during a quiescent state.
> >
> > Add and use ist_enter and ist_exit to track IST context.  Even
> > though x86_32 has no IST stacks, we track these interrupts the same
> > way.
> 
> I should add:
> 
> I have no idea why RCU read-side critical sections are safe inside
> __do_page_fault today.  It's guarded by exception_enter(), but that
> doesn't do anything if context tracking is off, and context tracking
> is usually off. What am I missing here?

Ah!  There are three cases:

1.	Context tracking is off on a non-idle CPU.  In this case, RCU is
	still paying attention to CPUs running in both userspace and in
	the kernel.  So if a page fault happens, RCU will be set up to
	notice any RCU read-side critical sections.

2.	Context tracking is on on a non-idle CPU.  In this case, RCU
	might well be ignoring userspace execution: NO_HZ_FULL and
	all that.  However, as you pointed out, in this case the
	context-tracking code lets RCU know that we have entered the
	kernel, which means that RCU will again be paying attention to
	RCU read-side critical sections.

3.	The CPU is idle.  In this case, RCU is ignoring the CPU, so
	if we take a page fault when context tracking is off, life
	will be hard.  But the kernel is not supposed to take page
	faults in the idle loop, so this is not a problem.

Make sense?

							Thanx, Paul

> --Andy
> 
> >
> > This fixes two issues:
> >
> >  - Scheduling from an IST interrupt handler will now warn.  It would
> >    previously appear to work as long as we got lucky and nothing
> >    overwrote the stack frame.  (I don't know of any bugs in this
> >    that would trigger the warning, but it's good to be on the safe
> >    side.)
> >
> >  - RCU handling in IST context was dangerous.  As far as I know,
> >    only machine checks were likely to trigger this, but it's good to
> >    be on the safe side.
> >
> > Note that the machine check handlers appears to have been missing
> > any context tracking at all before this patch.
> >
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Cc: Josh Triplett <josh@joshtriplett.org>
> > Cc: Frédéric Weisbecker <fweisbec@gmail.com>
> > Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> > ---
> >  arch/x86/include/asm/traps.h         |  4 +++
> >  arch/x86/kernel/cpu/mcheck/mce.c     |  5 ++++
> >  arch/x86/kernel/cpu/mcheck/p5.c      |  6 +++++
> >  arch/x86/kernel/cpu/mcheck/winchip.c |  5 ++++
> >  arch/x86/kernel/traps.c              | 49 ++++++++++++++++++++++++++++++------
> >  5 files changed, 61 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> > index bc8352e7010a..eb16a61bfd06 100644
> > --- a/arch/x86/include/asm/traps.h
> > +++ b/arch/x86/include/asm/traps.h
> > @@ -1,6 +1,7 @@
> >  #ifndef _ASM_X86_TRAPS_H
> >  #define _ASM_X86_TRAPS_H
> >
> > +#include <linux/context_tracking_state.h>
> >  #include <linux/kprobes.h>
> >
> >  #include <asm/debugreg.h>
> > @@ -109,6 +110,9 @@ asmlinkage void smp_thermal_interrupt(void);
> >  asmlinkage void mce_threshold_interrupt(void);
> >  #endif
> >
> > +extern enum ctx_state ist_enter(struct pt_regs *regs);
> > +extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
> > +
> >  /* Interrupts/Exceptions */
> >  enum {
> >         X86_TRAP_DE = 0,        /*  0, Divide-by-zero */
> > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> > index 61a9668cebfd..b72509d77337 100644
> > --- a/arch/x86/kernel/cpu/mcheck/mce.c
> > +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> > @@ -43,6 +43,7 @@
> >  #include <linux/export.h>
> >
> >  #include <asm/processor.h>
> > +#include <asm/traps.h>
> >  #include <asm/mce.h>
> >  #include <asm/msr.h>
> >
> > @@ -1016,6 +1017,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> >  {
> >         struct mca_config *cfg = &mca_cfg;
> >         struct mce m, *final;
> > +       enum ctx_state prev_state;
> >         int i;
> >         int worst = 0;
> >         int severity;
> > @@ -1038,6 +1040,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> >         DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
> >         char *msg = "Unknown";
> >
> > +       prev_state = ist_enter(regs);
> > +
> >         this_cpu_inc(mce_exception_count);
> >
> >         if (!cfg->banks)
> > @@ -1168,6 +1172,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> >         mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> >  out:
> >         sync_core();
> > +       ist_exit(regs, prev_state);
> >  }
> >  EXPORT_SYMBOL_GPL(do_machine_check);
> >
> > diff --git a/arch/x86/kernel/cpu/mcheck/p5.c b/arch/x86/kernel/cpu/mcheck/p5.c
> > index a3042989398c..ec2663a708e4 100644
> > --- a/arch/x86/kernel/cpu/mcheck/p5.c
> > +++ b/arch/x86/kernel/cpu/mcheck/p5.c
> > @@ -8,6 +8,7 @@
> >  #include <linux/smp.h>
> >
> >  #include <asm/processor.h>
> > +#include <asm/traps.h>
> >  #include <asm/mce.h>
> >  #include <asm/msr.h>
> >
> > @@ -17,8 +18,11 @@ int mce_p5_enabled __read_mostly;
> >  /* Machine check handler for Pentium class Intel CPUs: */
> >  static void pentium_machine_check(struct pt_regs *regs, long error_code)
> >  {
> > +       enum ctx_state prev_state;
> >         u32 loaddr, hi, lotype;
> >
> > +       prev_state = ist_enter(regs);
> > +
> >         rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
> >         rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);
> >
> > @@ -33,6 +37,8 @@ static void pentium_machine_check(struct pt_regs *regs, long error_code)
> >         }
> >
> >         add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> > +
> > +       ist_exit(regs, prev_state);
> >  }
> >
> >  /* Set up machine check reporting for processors with Intel style MCE: */
> > diff --git a/arch/x86/kernel/cpu/mcheck/winchip.c b/arch/x86/kernel/cpu/mcheck/winchip.c
> > index 7dc5564d0cdf..bd5d46a32210 100644
> > --- a/arch/x86/kernel/cpu/mcheck/winchip.c
> > +++ b/arch/x86/kernel/cpu/mcheck/winchip.c
> > @@ -7,14 +7,19 @@
> >  #include <linux/types.h>
> >
> >  #include <asm/processor.h>
> > +#include <asm/traps.h>
> >  #include <asm/mce.h>
> >  #include <asm/msr.h>
> >
> >  /* Machine check handler for WinChip C6: */
> >  static void winchip_machine_check(struct pt_regs *regs, long error_code)
> >  {
> > +       enum ctx_state prev_state = ist_enter(regs);
> > +
> >         printk(KERN_EMERG "CPU0: Machine Check Exception.\n");
> >         add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
> > +
> > +       ist_exit(regs, prev_state);
> >  }
> >
> >  /* Set up machine check reporting on the Winchip C6 series */
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index 0d0e922fafc1..f5c4b8813774 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
> >         preempt_count_dec();
> >  }
> >
> > +enum ctx_state ist_enter(struct pt_regs *regs)
> > +{
> > +       /*
> > +        * We are atomic because we're on the IST stack (or we're on x86_32,
> > +        * in which case we still shouldn't schedule.
> > +        */
> > +       preempt_count_add(HARDIRQ_OFFSET);
> > +
> > +       if (user_mode_vm(regs)) {
> > +               /* Other than that, we're just an exception. */
> > +               return exception_enter();
> > +       } else {
> > +               /*
> > +                * We might have interrupted pretty much anything.  In
> > +                * fact, if we're a machine check, we can even interrupt
> > +                * NMI processing.  We don't want in_nmi() to return true,
> > +                * but we need to notify RCU.
> > +                */
> > +               rcu_nmi_enter();
> > +               return IN_KERNEL;  /* the value is irrelevant. */
> > +       }
> > +}
> > +
> > +void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
> > +{
> > +       preempt_count_sub(HARDIRQ_OFFSET);
> > +
> > +       if (user_mode_vm(regs))
> > +               return exception_exit(prev_state);
> > +       else
> > +               rcu_nmi_exit();
> > +}
> > +
> >  static nokprobe_inline int
> >  do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
> >                   struct pt_regs *regs, long error_code)
> > @@ -244,14 +277,14 @@ dotraplinkage void do_stack_segment(struct pt_regs *regs, long error_code)
> >  {
> >         enum ctx_state prev_state;
> >
> > -       prev_state = exception_enter();
> > +       prev_state = ist_enter(regs);
> >         if (notify_die(DIE_TRAP, "stack segment", regs, error_code,
> >                        X86_TRAP_SS, SIGBUS) != NOTIFY_STOP) {
> >                 preempt_conditional_sti(regs);
> >                 do_trap(X86_TRAP_SS, SIGBUS, "stack segment", regs, error_code, NULL);
> >                 preempt_conditional_cli(regs);
> >         }
> > -       exception_exit(prev_state);
> > +       ist_exit(regs, prev_state);
> >  }
> >
> >  dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> > @@ -259,8 +292,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> >         static const char str[] = "double fault";
> >         struct task_struct *tsk = current;
> >
> > -       exception_enter();
> > -       /* Return not checked because double check cannot be ignored */
> > +       ist_enter(regs);
> > +       /* Return not checked because double fault cannot be ignored */
> >         notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
> >
> >         tsk->thread.error_code = error_code;
> > @@ -343,7 +376,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> >         if (poke_int3_handler(regs))
> >                 return;
> >
> > -       prev_state = exception_enter();
> > +       prev_state = ist_enter(regs);
> >  #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
> >         if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
> >                                 SIGTRAP) == NOTIFY_STOP)
> > @@ -369,7 +402,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
> >         preempt_conditional_cli(regs);
> >         debug_stack_usage_dec();
> >  exit:
> > -       exception_exit(prev_state);
> > +       ist_exit(regs, prev_state);
> >  }
> >  NOKPROBE_SYMBOL(do_int3);
> >
> > @@ -433,7 +466,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
> >         unsigned long dr6;
> >         int si_code;
> >
> > -       prev_state = exception_enter();
> > +       prev_state = ist_enter(regs);
> >
> >         get_debugreg(dr6, 6);
> >
> > @@ -508,7 +541,7 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
> >         debug_stack_usage_dec();
> >
> >  exit:
> > -       exception_exit(prev_state);
> > +       ist_exit(regs, prev_state);
> >  }
> >  NOKPROBE_SYMBOL(do_debug);
> >
> > --
> > 1.9.3
> >
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 22:07     ` Paul E. McKenney
@ 2014-11-21 22:19       ` Andy Lutomirski
  2014-11-21 22:55         ` Paul E. McKenney
  2014-11-21 22:20       ` Frederic Weisbecker
  1 sibling, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-21 22:19 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > We currently pretend that IST context is like standard exception
>> > context, but this is incorrect.  IST entries from userspace are like
>> > standard exceptions except that they use per-cpu stacks, so they are
>> > atomic.  IST entries from kernel space are like NMIs from RCU's
>> > perspective -- they are not quiescent states even if they
>> > interrupted the kernel during a quiescent state.
>> >
>> > Add and use ist_enter and ist_exit to track IST context.  Even
>> > though x86_32 has no IST stacks, we track these interrupts the same
>> > way.
>>
>> I should add:
>>
>> I have no idea why RCU read-side critical sections are safe inside
>> __do_page_fault today.  It's guarded by exception_enter(), but that
>> doesn't do anything if context tracking is off, and context tracking
>> is usually off. What am I missing here?
>
> Ah!  There are three cases:
>
> 1.      Context tracking is off on a non-idle CPU.  In this case, RCU is
>         still paying attention to CPUs running in both userspace and in
>         the kernel.  So if a page fault happens, RCU will be set up to
>         notice any RCU read-side critical sections.
>
> 2.      Context tracking is on on a non-idle CPU.  In this case, RCU
>         might well be ignoring userspace execution: NO_HZ_FULL and
>         all that.  However, as you pointed out, in this case the
>         context-tracking code lets RCU know that we have entered the
>         kernel, which means that RCU will again be paying attention to
>         RCU read-side critical sections.
>
> 3.      The CPU is idle.  In this case, RCU is ignoring the CPU, so
>         if we take a page fault when context tracking is off, life
>         will be hard.  But the kernel is not supposed to take page
>         faults in the idle loop, so this is not a problem.
>

I guess so, as long as there are really no page faults in the idle loop.

There are, however, machine checks in the idle loop, and maybe kprobes
(haven't checked), so I think this patch might fix real bugs.

> Just out of curiosity...  Can an NMI occur in IST context?  If it can,
> I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
> nested calls.

Yes, and vice versa.  That code looked like it handled nesting
correctly, but I wasn't entirely sure.

Also, just to make sure: are we okay if rcu_nmi_enter() is called
before exception_enter if context tracking is on and we came directly
from userspace?

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 22:07     ` Paul E. McKenney
  2014-11-21 22:19       ` Andy Lutomirski
@ 2014-11-21 22:20       ` Frederic Weisbecker
  1 sibling, 0 replies; 72+ messages in thread
From: Frederic Weisbecker @ 2014-11-21 22:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Josh Triplett

On Fri, Nov 21, 2014 at 02:07:04PM -0800, Paul E. McKenney wrote:
> On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> > On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > > We currently pretend that IST context is like standard exception
> > > context, but this is incorrect.  IST entries from userspace are like
> > > standard exceptions except that they use per-cpu stacks, so they are
> > > atomic.  IST entries from kernel space are like NMIs from RCU's
> > > perspective -- they are not quiescent states even if they
> > > interrupted the kernel during a quiescent state.
> > >
> > > Add and use ist_enter and ist_exit to track IST context.  Even
> > > though x86_32 has no IST stacks, we track these interrupts the same
> > > way.
> > 
> > I should add:
> > 
> > I have no idea why RCU read-side critical sections are safe inside
> > __do_page_fault today.  It's guarded by exception_enter(), but that
> > doesn't do anything if context tracking is off, and context tracking
> > is usually off. What am I missing here?
> 
> Ah!  There are three cases:
> 
> 1.	Context tracking is off on a non-idle CPU.  In this case, RCU is
> 	still paying attention to CPUs running in both userspace and in
> 	the kernel.  So if a page fault happens, RCU will be set up to
> 	notice any RCU read-side critical sections.
> 
> 2.	Context tracking is on on a non-idle CPU.  In this case, RCU
> 	might well be ignoring userspace execution: NO_HZ_FULL and
> 	all that.  However, as you pointed out, in this case the
> 	context-tracking code lets RCU know that we have entered the
> 	kernel, which means that RCU will again be paying attention to
> 	RCU read-side critical sections.
> 
> 3.	The CPU is idle.  In this case, RCU is ignoring the CPU, so
> 	if we take a page fault when context tracking is off, life
> 	will be hard.  But the kernel is not supposed to take page
> 	faults in the idle loop, so this is not a problem.
> 
> Make sense?

To zoom out the picture for Andy, context tracking is never used 99%
of all workloads. It's only used for NO_HZ_FULL. RCU needs to tick
to poll on RCU uses. But when we run in userspace, RCU isn't used
and thus doesn't need the tick which we can stop. So context tracking
is there to tell RCU about CPUs crossing user/kernel boundaries.

Also these hooks account the cputime spent in userspace and kernelspace
in the absence of a tick.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 22:19       ` Andy Lutomirski
@ 2014-11-21 22:55         ` Paul E. McKenney
  2014-11-21 23:06           ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-21 22:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> > We currently pretend that IST context is like standard exception
> >> > context, but this is incorrect.  IST entries from userspace are like
> >> > standard exceptions except that they use per-cpu stacks, so they are
> >> > atomic.  IST entries from kernel space are like NMIs from RCU's
> >> > perspective -- they are not quiescent states even if they
> >> > interrupted the kernel during a quiescent state.
> >> >
> >> > Add and use ist_enter and ist_exit to track IST context.  Even
> >> > though x86_32 has no IST stacks, we track these interrupts the same
> >> > way.
> >>
> >> I should add:
> >>
> >> I have no idea why RCU read-side critical sections are safe inside
> >> __do_page_fault today.  It's guarded by exception_enter(), but that
> >> doesn't do anything if context tracking is off, and context tracking
> >> is usually off. What am I missing here?
> >
> > Ah!  There are three cases:
> >
> > 1.      Context tracking is off on a non-idle CPU.  In this case, RCU is
> >         still paying attention to CPUs running in both userspace and in
> >         the kernel.  So if a page fault happens, RCU will be set up to
> >         notice any RCU read-side critical sections.
> >
> > 2.      Context tracking is on on a non-idle CPU.  In this case, RCU
> >         might well be ignoring userspace execution: NO_HZ_FULL and
> >         all that.  However, as you pointed out, in this case the
> >         context-tracking code lets RCU know that we have entered the
> >         kernel, which means that RCU will again be paying attention to
> >         RCU read-side critical sections.
> >
> > 3.      The CPU is idle.  In this case, RCU is ignoring the CPU, so
> >         if we take a page fault when context tracking is off, life
> >         will be hard.  But the kernel is not supposed to take page
> >         faults in the idle loop, so this is not a problem.
> 
> I guess so, as long as there are really no page faults in the idle loop.

As far as I know, there are not.  If there are, someone needs to let
me know!  ;-)

> There are, however, machine checks in the idle loop, and maybe kprobes
> (haven't checked), so I think this patch might fix real bugs.

If you can get ISTs from the idle loop, then the patch is needed.

> > Just out of curiosity...  Can an NMI occur in IST context?  If it can,
> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
> > nested calls.
> 
> Yes, and vice versa.  That code looked like it handled nesting
> correctly, but I wasn't entirely sure.

It currently does not, please see below patch.  Are you able to test
nesting?  It would be really cool if you could do so -- I have no
way to test this patch.

> Also, just to make sure: are we okay if rcu_nmi_enter() is called
> before exception_enter if context tracking is on and we came directly
> from userspace?

If I understand correctly, this will result in context tracking invoking
rcu_user_enter(), which will result in the rcu_dynticks counter having an
odd value.  In that case, rcu_nmi_enter() will notice that RCU is already
paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
& 0x1), and will thus just return.  The matching rcu_nmi_exit() will
notice that the nesting count is zero, and will also just return.

Thus, everything works in that case.

In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
would see that RCU is not paying attention to this CPU and that the
NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
It would then atomically increment rtdp->dynticks, forcing RCU to start
paying attention to this CPU.  The matching rcu_nmi_exit() will see
that the nesting count was non-zero, but became zero when decremented.
This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
which will tell RCU to stop paying attention to this CPU.

							Thanx, Paul

------------------------------------------------------------------------

rcu: Make rcu_nmi_enter() handle nesting

Andy Lutomirski is introducing ISTs into x86, which from RCU's
viewpoint are NMIs.  Because ISTs and NMIs can nest, rcu_nmi_enter()
and rcu_nmi_exit() must now correctly handle nesting.  As luck would
have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
This patch therefore makes rcu_nmi_enter() handle nesting.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8749f43f3f05..875421aff6e3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
 	if (rdtp->dynticks_nmi_nesting == 0 &&
 	    (atomic_read(&rdtp->dynticks) & 0x1))
 		return;
-	rdtp->dynticks_nmi_nesting++;
+	if (rdtp->dynticks_nmi_nesting++ != 0)
+		return; /* Nested NMI/IST/whatever. */
 	smp_mb__before_atomic();  /* Force delay from prior write. */
 	atomic_inc(&rdtp->dynticks);
 	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 22:55         ` Paul E. McKenney
@ 2014-11-21 23:06           ` Andy Lutomirski
  2014-11-21 23:38             ` Paul E. McKenney
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-21 23:06 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
>> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >> > We currently pretend that IST context is like standard exception
>> >> > context, but this is incorrect.  IST entries from userspace are like
>> >> > standard exceptions except that they use per-cpu stacks, so they are
>> >> > atomic.  IST entries from kernel space are like NMIs from RCU's
>> >> > perspective -- they are not quiescent states even if they
>> >> > interrupted the kernel during a quiescent state.
>> >> >
>> >> > Add and use ist_enter and ist_exit to track IST context.  Even
>> >> > though x86_32 has no IST stacks, we track these interrupts the same
>> >> > way.
>> >>
>> >> I should add:
>> >>
>> >> I have no idea why RCU read-side critical sections are safe inside
>> >> __do_page_fault today.  It's guarded by exception_enter(), but that
>> >> doesn't do anything if context tracking is off, and context tracking
>> >> is usually off. What am I missing here?
>> >
>> > Ah!  There are three cases:
>> >
>> > 1.      Context tracking is off on a non-idle CPU.  In this case, RCU is
>> >         still paying attention to CPUs running in both userspace and in
>> >         the kernel.  So if a page fault happens, RCU will be set up to
>> >         notice any RCU read-side critical sections.
>> >
>> > 2.      Context tracking is on on a non-idle CPU.  In this case, RCU
>> >         might well be ignoring userspace execution: NO_HZ_FULL and
>> >         all that.  However, as you pointed out, in this case the
>> >         context-tracking code lets RCU know that we have entered the
>> >         kernel, which means that RCU will again be paying attention to
>> >         RCU read-side critical sections.
>> >
>> > 3.      The CPU is idle.  In this case, RCU is ignoring the CPU, so
>> >         if we take a page fault when context tracking is off, life
>> >         will be hard.  But the kernel is not supposed to take page
>> >         faults in the idle loop, so this is not a problem.
>>
>> I guess so, as long as there are really no page faults in the idle loop.
>
> As far as I know, there are not.  If there are, someone needs to let
> me know!  ;-)
>
>> There are, however, machine checks in the idle loop, and maybe kprobes
>> (haven't checked), so I think this patch might fix real bugs.
>
> If you can get ISTs from the idle loop, then the patch is needed.
>
>> > Just out of curiosity...  Can an NMI occur in IST context?  If it can,
>> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
>> > nested calls.
>>
>> Yes, and vice versa.  That code looked like it handled nesting
>> correctly, but I wasn't entirely sure.
>
> It currently does not, please see below patch.  Are you able to test
> nesting?  It would be really cool if you could do so -- I have no
> way to test this patch.

I can try.  It's sort of easy -- I'll put an int3 into do_nmi and add
a fixup to avoid crashing.

What should I look for?  Should I try to force full nohz on and assert
something?  I don't really know how to make full nohz work.

>
>> Also, just to make sure: are we okay if rcu_nmi_enter() is called
>> before exception_enter if context tracking is on and we came directly
>> from userspace?
>
> If I understand correctly, this will result in context tracking invoking
> rcu_user_enter(), which will result in the rcu_dynticks counter having an
> odd value.  In that case, rcu_nmi_enter() will notice that RCU is already
> paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
> & 0x1), and will thus just return.  The matching rcu_nmi_exit() will
> notice that the nesting count is zero, and will also just return.
>
> Thus, everything works in that case.
>
> In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
> would see that RCU is not paying attention to this CPU and that the
> NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
> It would then atomically increment rtdp->dynticks, forcing RCU to start
> paying attention to this CPU.  The matching rcu_nmi_exit() will see
> that the nesting count was non-zero, but became zero when decremented.
> This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
> which will tell RCU to stop paying attention to this CPU.
>
>                                                         Thanx, Paul
>
> ------------------------------------------------------------------------
>
> rcu: Make rcu_nmi_enter() handle nesting
>
> Andy Lutomirski is introducing ISTs into x86, which from RCU's
> viewpoint are NMIs.  Because ISTs and NMIs can nest, rcu_nmi_enter()
> and rcu_nmi_exit() must now correctly handle nesting.  As luck would
> have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
> This patch therefore makes rcu_nmi_enter() handle nesting.

Thanks.  Should I add this to v5 of my series?

--Andy

>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8749f43f3f05..875421aff6e3 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
>         if (rdtp->dynticks_nmi_nesting == 0 &&
>             (atomic_read(&rdtp->dynticks) & 0x1))
>                 return;
> -       rdtp->dynticks_nmi_nesting++;
> +       if (rdtp->dynticks_nmi_nesting++ != 0)
> +               return; /* Nested NMI/IST/whatever. */
>         smp_mb__before_atomic();  /* Force delay from prior write. */
>         atomic_inc(&rdtp->dynticks);
>         /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 23:06           ` Andy Lutomirski
@ 2014-11-21 23:38             ` Paul E. McKenney
  2014-11-22  2:00               ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-21 23:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Fri, Nov 21, 2014 at 03:06:48PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
> >> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> >> > We currently pretend that IST context is like standard exception
> >> >> > context, but this is incorrect.  IST entries from userspace are like
> >> >> > standard exceptions except that they use per-cpu stacks, so they are
> >> >> > atomic.  IST entries from kernel space are like NMIs from RCU's
> >> >> > perspective -- they are not quiescent states even if they
> >> >> > interrupted the kernel during a quiescent state.
> >> >> >
> >> >> > Add and use ist_enter and ist_exit to track IST context.  Even
> >> >> > though x86_32 has no IST stacks, we track these interrupts the same
> >> >> > way.
> >> >>
> >> >> I should add:
> >> >>
> >> >> I have no idea why RCU read-side critical sections are safe inside
> >> >> __do_page_fault today.  It's guarded by exception_enter(), but that
> >> >> doesn't do anything if context tracking is off, and context tracking
> >> >> is usually off. What am I missing here?
> >> >
> >> > Ah!  There are three cases:
> >> >
> >> > 1.      Context tracking is off on a non-idle CPU.  In this case, RCU is
> >> >         still paying attention to CPUs running in both userspace and in
> >> >         the kernel.  So if a page fault happens, RCU will be set up to
> >> >         notice any RCU read-side critical sections.
> >> >
> >> > 2.      Context tracking is on on a non-idle CPU.  In this case, RCU
> >> >         might well be ignoring userspace execution: NO_HZ_FULL and
> >> >         all that.  However, as you pointed out, in this case the
> >> >         context-tracking code lets RCU know that we have entered the
> >> >         kernel, which means that RCU will again be paying attention to
> >> >         RCU read-side critical sections.
> >> >
> >> > 3.      The CPU is idle.  In this case, RCU is ignoring the CPU, so
> >> >         if we take a page fault when context tracking is off, life
> >> >         will be hard.  But the kernel is not supposed to take page
> >> >         faults in the idle loop, so this is not a problem.
> >>
> >> I guess so, as long as there are really no page faults in the idle loop.
> >
> > As far as I know, there are not.  If there are, someone needs to let
> > me know!  ;-)
> >
> >> There are, however, machine checks in the idle loop, and maybe kprobes
> >> (haven't checked), so I think this patch might fix real bugs.
> >
> > If you can get ISTs from the idle loop, then the patch is needed.
> >
> >> > Just out of curiosity...  Can an NMI occur in IST context?  If it can,
> >> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
> >> > nested calls.
> >>
> >> Yes, and vice versa.  That code looked like it handled nesting
> >> correctly, but I wasn't entirely sure.
> >
> > It currently does not, please see below patch.  Are you able to test
> > nesting?  It would be really cool if you could do so -- I have no
> > way to test this patch.
> 
> I can try.  It's sort of easy -- I'll put an int3 into do_nmi and add
> a fixup to avoid crashing.
> 
> What should I look for?  Should I try to force full nohz on and assert
> something?  I don't really know how to make full nohz work.

You should look for the WARN_ON_ONCE() calls in rcu_nmi_enter() and
rcu_nmi_exit() to fire.

							Thanx, Paul

> >> Also, just to make sure: are we okay if rcu_nmi_enter() is called
> >> before exception_enter if context tracking is on and we came directly
> >> from userspace?
> >
> > If I understand correctly, this will result in context tracking invoking
> > rcu_user_enter(), which will result in the rcu_dynticks counter having an
> > odd value.  In that case, rcu_nmi_enter() will notice that RCU is already
> > paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
> > & 0x1), and will thus just return.  The matching rcu_nmi_exit() will
> > notice that the nesting count is zero, and will also just return.
> >
> > Thus, everything works in that case.
> >
> > In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
> > would see that RCU is not paying attention to this CPU and that the
> > NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
> > It would then atomically increment rtdp->dynticks, forcing RCU to start
> > paying attention to this CPU.  The matching rcu_nmi_exit() will see
> > that the nesting count was non-zero, but became zero when decremented.
> > This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
> > which will tell RCU to stop paying attention to this CPU.
> >
> >                                                         Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > rcu: Make rcu_nmi_enter() handle nesting
> >
> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
> > viewpoint are NMIs.  Because ISTs and NMIs can nest, rcu_nmi_enter()
> > and rcu_nmi_exit() must now correctly handle nesting.  As luck would
> > have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
> > This patch therefore makes rcu_nmi_enter() handle nesting.
> 
> Thanks.  Should I add this to v5 of my series?
> 
> --Andy
> 
> >
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 8749f43f3f05..875421aff6e3 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
> >         if (rdtp->dynticks_nmi_nesting == 0 &&
> >             (atomic_read(&rdtp->dynticks) & 0x1))
> >                 return;
> > -       rdtp->dynticks_nmi_nesting++;
> > +       if (rdtp->dynticks_nmi_nesting++ != 0)
> > +               return; /* Nested NMI/IST/whatever. */
> >         smp_mb__before_atomic();  /* Force delay from prior write. */
> >         atomic_inc(&rdtp->dynticks);
> >         /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> >
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 23:38             ` Paul E. McKenney
@ 2014-11-22  2:00               ` Andy Lutomirski
  2014-11-22  4:20                 ` Paul E. McKenney
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-22  2:00 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Nov 21, 2014 at 03:06:48PM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
>> >> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
>> >> <paulmck@linux.vnet.ibm.com> wrote:
>> >> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
>> >> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >> >> > We currently pretend that IST context is like standard exception
>> >> >> > context, but this is incorrect.  IST entries from userspace are like
>> >> >> > standard exceptions except that they use per-cpu stacks, so they are
>> >> >> > atomic.  IST entries from kernel space are like NMIs from RCU's
>> >> >> > perspective -- they are not quiescent states even if they
>> >> >> > interrupted the kernel during a quiescent state.
>> >> >> >
>> >> >> > Add and use ist_enter and ist_exit to track IST context.  Even
>> >> >> > though x86_32 has no IST stacks, we track these interrupts the same
>> >> >> > way.
>> >> >>
>> >> >> I should add:
>> >> >>
>> >> >> I have no idea why RCU read-side critical sections are safe inside
>> >> >> __do_page_fault today.  It's guarded by exception_enter(), but that
>> >> >> doesn't do anything if context tracking is off, and context tracking
>> >> >> is usually off. What am I missing here?
>> >> >
>> >> > Ah!  There are three cases:
>> >> >
>> >> > 1.      Context tracking is off on a non-idle CPU.  In this case, RCU is
>> >> >         still paying attention to CPUs running in both userspace and in
>> >> >         the kernel.  So if a page fault happens, RCU will be set up to
>> >> >         notice any RCU read-side critical sections.
>> >> >
>> >> > 2.      Context tracking is on on a non-idle CPU.  In this case, RCU
>> >> >         might well be ignoring userspace execution: NO_HZ_FULL and
>> >> >         all that.  However, as you pointed out, in this case the
>> >> >         context-tracking code lets RCU know that we have entered the
>> >> >         kernel, which means that RCU will again be paying attention to
>> >> >         RCU read-side critical sections.
>> >> >
>> >> > 3.      The CPU is idle.  In this case, RCU is ignoring the CPU, so
>> >> >         if we take a page fault when context tracking is off, life
>> >> >         will be hard.  But the kernel is not supposed to take page
>> >> >         faults in the idle loop, so this is not a problem.
>> >>
>> >> I guess so, as long as there are really no page faults in the idle loop.
>> >
>> > As far as I know, there are not.  If there are, someone needs to let
>> > me know!  ;-)
>> >
>> >> There are, however, machine checks in the idle loop, and maybe kprobes
>> >> (haven't checked), so I think this patch might fix real bugs.
>> >
>> > If you can get ISTs from the idle loop, then the patch is needed.
>> >
>> >> > Just out of curiosity...  Can an NMI occur in IST context?  If it can,
>> >> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
>> >> > nested calls.
>> >>
>> >> Yes, and vice versa.  That code looked like it handled nesting
>> >> correctly, but I wasn't entirely sure.
>> >
>> > It currently does not, please see below patch.  Are you able to test
>> > nesting?  It would be really cool if you could do so -- I have no
>> > way to test this patch.
>>
>> I can try.  It's sort of easy -- I'll put an int3 into do_nmi and add
>> a fixup to avoid crashing.
>>
>> What should I look for?  Should I try to force full nohz on and assert
>> something?  I don't really know how to make full nohz work.
>
> You should look for the WARN_ON_ONCE() calls in rcu_nmi_enter() and
> rcu_nmi_exit() to fire.

No warning with or without your patch, maybe because all of those
returns skip the labels.

Also, an NMI can happen *during* rcu_nmi_enter or rcu_nmi_exit.  Is
that okay?  Should those dynticks_nmi_nesting++ things be local_inc
and local_dec_and_test?

That dynticks_nmi_nesting thing seems scary to me.  Shouldn't the code
unconditionally increment dynticks_nmi_nesting in rcu_nmi_enter and
unconditionally decrement it in rcu_nmi_exit?

--Andy

>
>                                                         Thanx, Paul
>
>> >> Also, just to make sure: are we okay if rcu_nmi_enter() is called
>> >> before exception_enter if context tracking is on and we came directly
>> >> from userspace?
>> >
>> > If I understand correctly, this will result in context tracking invoking
>> > rcu_user_enter(), which will result in the rcu_dynticks counter having an
>> > odd value.  In that case, rcu_nmi_enter() will notice that RCU is already
>> > paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
>> > & 0x1), and will thus just return.  The matching rcu_nmi_exit() will
>> > notice that the nesting count is zero, and will also just return.
>> >
>> > Thus, everything works in that case.
>> >
>> > In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
>> > would see that RCU is not paying attention to this CPU and that the
>> > NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
>> > It would then atomically increment rtdp->dynticks, forcing RCU to start
>> > paying attention to this CPU.  The matching rcu_nmi_exit() will see
>> > that the nesting count was non-zero, but became zero when decremented.
>> > This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
>> > which will tell RCU to stop paying attention to this CPU.
>> >
>> >                                                         Thanx, Paul
>> >
>> > ------------------------------------------------------------------------
>> >
>> > rcu: Make rcu_nmi_enter() handle nesting
>> >
>> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
>> > viewpoint are NMIs.  Because ISTs and NMIs can nest, rcu_nmi_enter()
>> > and rcu_nmi_exit() must now correctly handle nesting.  As luck would
>> > have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
>> > This patch therefore makes rcu_nmi_enter() handle nesting.
>>
>> Thanks.  Should I add this to v5 of my series?
>>
>> --Andy
>>
>> >
>> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >
>> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> > index 8749f43f3f05..875421aff6e3 100644
>> > --- a/kernel/rcu/tree.c
>> > +++ b/kernel/rcu/tree.c
>> > @@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
>> >         if (rdtp->dynticks_nmi_nesting == 0 &&
>> >             (atomic_read(&rdtp->dynticks) & 0x1))
>> >                 return;
>> > -       rdtp->dynticks_nmi_nesting++;
>> > +       if (rdtp->dynticks_nmi_nesting++ != 0)
>> > +               return; /* Nested NMI/IST/whatever. */
>> >         smp_mb__before_atomic();  /* Force delay from prior write. */
>> >         atomic_inc(&rdtp->dynticks);
>> >         /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
>> >
>>
>>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC
>>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-22  2:00               ` Andy Lutomirski
@ 2014-11-22  4:20                 ` Paul E. McKenney
  2014-11-22  5:53                   ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-22  4:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, Nov 21, 2014 at 03:06:48PM -0800, Andy Lutomirski wrote:
> >> On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
> >> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> >> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> >> >> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> >> >> > We currently pretend that IST context is like standard exception
> >> >> >> > context, but this is incorrect.  IST entries from userspace are like
> >> >> >> > standard exceptions except that they use per-cpu stacks, so they are
> >> >> >> > atomic.  IST entries from kernel space are like NMIs from RCU's
> >> >> >> > perspective -- they are not quiescent states even if they
> >> >> >> > interrupted the kernel during a quiescent state.
> >> >> >> >
> >> >> >> > Add and use ist_enter and ist_exit to track IST context.  Even
> >> >> >> > though x86_32 has no IST stacks, we track these interrupts the same
> >> >> >> > way.
> >> >> >>
> >> >> >> I should add:
> >> >> >>
> >> >> >> I have no idea why RCU read-side critical sections are safe inside
> >> >> >> __do_page_fault today.  It's guarded by exception_enter(), but that
> >> >> >> doesn't do anything if context tracking is off, and context tracking
> >> >> >> is usually off. What am I missing here?
> >> >> >
> >> >> > Ah!  There are three cases:
> >> >> >
> >> >> > 1.      Context tracking is off on a non-idle CPU.  In this case, RCU is
> >> >> >         still paying attention to CPUs running in both userspace and in
> >> >> >         the kernel.  So if a page fault happens, RCU will be set up to
> >> >> >         notice any RCU read-side critical sections.
> >> >> >
> >> >> > 2.      Context tracking is on on a non-idle CPU.  In this case, RCU
> >> >> >         might well be ignoring userspace execution: NO_HZ_FULL and
> >> >> >         all that.  However, as you pointed out, in this case the
> >> >> >         context-tracking code lets RCU know that we have entered the
> >> >> >         kernel, which means that RCU will again be paying attention to
> >> >> >         RCU read-side critical sections.
> >> >> >
> >> >> > 3.      The CPU is idle.  In this case, RCU is ignoring the CPU, so
> >> >> >         if we take a page fault when context tracking is off, life
> >> >> >         will be hard.  But the kernel is not supposed to take page
> >> >> >         faults in the idle loop, so this is not a problem.
> >> >>
> >> >> I guess so, as long as there are really no page faults in the idle loop.
> >> >
> >> > As far as I know, there are not.  If there are, someone needs to let
> >> > me know!  ;-)
> >> >
> >> >> There are, however, machine checks in the idle loop, and maybe kprobes
> >> >> (haven't checked), so I think this patch might fix real bugs.
> >> >
> >> > If you can get ISTs from the idle loop, then the patch is needed.
> >> >
> >> >> > Just out of curiosity...  Can an NMI occur in IST context?  If it can,
> >> >> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
> >> >> > nested calls.
> >> >>
> >> >> Yes, and vice versa.  That code looked like it handled nesting
> >> >> correctly, but I wasn't entirely sure.
> >> >
> >> > It currently does not, please see below patch.  Are you able to test
> >> > nesting?  It would be really cool if you could do so -- I have no
> >> > way to test this patch.
> >>
> >> I can try.  It's sort of easy -- I'll put an int3 into do_nmi and add
> >> a fixup to avoid crashing.
> >>
> >> What should I look for?  Should I try to force full nohz on and assert
> >> something?  I don't really know how to make full nohz work.
> >
> > You should look for the WARN_ON_ONCE() calls in rcu_nmi_enter() and
> > rcu_nmi_exit() to fire.
> 
> No warning with or without your patch, maybe because all of those
> returns skip the labels.

I will be guardedly optimistic and take this as a good sign.  ;-)

> Also, an NMI can happen *during* rcu_nmi_enter or rcu_nmi_exit.  Is
> that okay?  Should those dynticks_nmi_nesting++ things be local_inc
> and local_dec_and_test?

Yep, it is OK during rcu_nmi_enter() or rcu_nmi_exit().  The nested
NMI will put the dynticks_nmi_nesting counter back where it was, so
no chance of confusion.

> That dynticks_nmi_nesting thing seems scary to me.  Shouldn't the code
> unconditionally increment dynticks_nmi_nesting in rcu_nmi_enter and
> unconditionally decrement it in rcu_nmi_exit?

You might be able to get that to work, but the reason it is not done
that way is because we might get an NMI while not in dyntick-idle
state.  In that case, it would be very bad to atomically increment
rcu_dynticks, because that would tell RCU to ignore the CPU while it
was in the NMI handler, which is the opposite of what we want.

But what did you have in mind?

							Thanx, Paul

> --Andy
> 
> >
> >                                                         Thanx, Paul
> >
> >> >> Also, just to make sure: are we okay if rcu_nmi_enter() is called
> >> >> before exception_enter if context tracking is on and we came directly
> >> >> from userspace?
> >> >
> >> > If I understand correctly, this will result in context tracking invoking
> >> > rcu_user_enter(), which will result in the rcu_dynticks counter having an
> >> > odd value.  In that case, rcu_nmi_enter() will notice that RCU is already
> >> > paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
> >> > & 0x1), and will thus just return.  The matching rcu_nmi_exit() will
> >> > notice that the nesting count is zero, and will also just return.
> >> >
> >> > Thus, everything works in that case.
> >> >
> >> > In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
> >> > would see that RCU is not paying attention to this CPU and that the
> >> > NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
> >> > It would then atomically increment rtdp->dynticks, forcing RCU to start
> >> > paying attention to this CPU.  The matching rcu_nmi_exit() will see
> >> > that the nesting count was non-zero, but became zero when decremented.
> >> > This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
> >> > which will tell RCU to stop paying attention to this CPU.
> >> >
> >> >                                                         Thanx, Paul
> >> >
> >> > ------------------------------------------------------------------------
> >> >
> >> > rcu: Make rcu_nmi_enter() handle nesting
> >> >
> >> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
> >> > viewpoint are NMIs.  Because ISTs and NMIs can nest, rcu_nmi_enter()
> >> > and rcu_nmi_exit() must now correctly handle nesting.  As luck would
> >> > have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
> >> > This patch therefore makes rcu_nmi_enter() handle nesting.
> >>
> >> Thanks.  Should I add this to v5 of my series?
> >>
> >> --Andy
> >>
> >> >
> >> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >> >
> >> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> > index 8749f43f3f05..875421aff6e3 100644
> >> > --- a/kernel/rcu/tree.c
> >> > +++ b/kernel/rcu/tree.c
> >> > @@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
> >> >         if (rdtp->dynticks_nmi_nesting == 0 &&
> >> >             (atomic_read(&rdtp->dynticks) & 0x1))
> >> >                 return;
> >> > -       rdtp->dynticks_nmi_nesting++;
> >> > +       if (rdtp->dynticks_nmi_nesting++ != 0)
> >> > +               return; /* Nested NMI/IST/whatever. */
> >> >         smp_mb__before_atomic();  /* Force delay from prior write. */
> >> >         atomic_inc(&rdtp->dynticks);
> >> >         /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> >> >
> >>
> >>
> >>
> >> --
> >> Andy Lutomirski
> >> AMA Capital Management, LLC
> >>
> >
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-22  4:20                 ` Paul E. McKenney
@ 2014-11-22  5:53                   ` Andy Lutomirski
  2014-11-22 23:41                     ` Paul E. McKenney
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-22  5:53 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Fri, Nov 21, 2014 at 03:06:48PM -0800, Andy Lutomirski wrote:
>> >> On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
>> >> <paulmck@linux.vnet.ibm.com> wrote:
>> >> > On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
>> >> >> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
>> >> >> <paulmck@linux.vnet.ibm.com> wrote:
>> >> >> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
>> >> >> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >> >> >> > We currently pretend that IST context is like standard exception
>> >> >> >> > context, but this is incorrect.  IST entries from userspace are like
>> >> >> >> > standard exceptions except that they use per-cpu stacks, so they are
>> >> >> >> > atomic.  IST entries from kernel space are like NMIs from RCU's
>> >> >> >> > perspective -- they are not quiescent states even if they
>> >> >> >> > interrupted the kernel during a quiescent state.
>> >> >> >> >
>> >> >> >> > Add and use ist_enter and ist_exit to track IST context.  Even
>> >> >> >> > though x86_32 has no IST stacks, we track these interrupts the same
>> >> >> >> > way.
>> >> >> >>
>> >> >> >> I should add:
>> >> >> >>
>> >> >> >> I have no idea why RCU read-side critical sections are safe inside
>> >> >> >> __do_page_fault today.  It's guarded by exception_enter(), but that
>> >> >> >> doesn't do anything if context tracking is off, and context tracking
>> >> >> >> is usually off. What am I missing here?
>> >> >> >
>> >> >> > Ah!  There are three cases:
>> >> >> >
>> >> >> > 1.      Context tracking is off on a non-idle CPU.  In this case, RCU is
>> >> >> >         still paying attention to CPUs running in both userspace and in
>> >> >> >         the kernel.  So if a page fault happens, RCU will be set up to
>> >> >> >         notice any RCU read-side critical sections.
>> >> >> >
>> >> >> > 2.      Context tracking is on on a non-idle CPU.  In this case, RCU
>> >> >> >         might well be ignoring userspace execution: NO_HZ_FULL and
>> >> >> >         all that.  However, as you pointed out, in this case the
>> >> >> >         context-tracking code lets RCU know that we have entered the
>> >> >> >         kernel, which means that RCU will again be paying attention to
>> >> >> >         RCU read-side critical sections.
>> >> >> >
>> >> >> > 3.      The CPU is idle.  In this case, RCU is ignoring the CPU, so
>> >> >> >         if we take a page fault when context tracking is off, life
>> >> >> >         will be hard.  But the kernel is not supposed to take page
>> >> >> >         faults in the idle loop, so this is not a problem.
>> >> >>
>> >> >> I guess so, as long as there are really no page faults in the idle loop.
>> >> >
>> >> > As far as I know, there are not.  If there are, someone needs to let
>> >> > me know!  ;-)
>> >> >
>> >> >> There are, however, machine checks in the idle loop, and maybe kprobes
>> >> >> (haven't checked), so I think this patch might fix real bugs.
>> >> >
>> >> > If you can get ISTs from the idle loop, then the patch is needed.
>> >> >
>> >> >> > Just out of curiosity...  Can an NMI occur in IST context?  If it can,
>> >> >> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
>> >> >> > nested calls.
>> >> >>
>> >> >> Yes, and vice versa.  That code looked like it handled nesting
>> >> >> correctly, but I wasn't entirely sure.
>> >> >
>> >> > It currently does not, please see below patch.  Are you able to test
>> >> > nesting?  It would be really cool if you could do so -- I have no
>> >> > way to test this patch.
>> >>
>> >> I can try.  It's sort of easy -- I'll put an int3 into do_nmi and add
>> >> a fixup to avoid crashing.
>> >>
>> >> What should I look for?  Should I try to force full nohz on and assert
>> >> something?  I don't really know how to make full nohz work.
>> >
>> > You should look for the WARN_ON_ONCE() calls in rcu_nmi_enter() and
>> > rcu_nmi_exit() to fire.
>>
>> No warning with or without your patch, maybe because all of those
>> returns skip the labels.
>
> I will be guardedly optimistic and take this as a good sign.  ;-)
>
>> Also, an NMI can happen *during* rcu_nmi_enter or rcu_nmi_exit.  Is
>> that okay?  Should those dynticks_nmi_nesting++ things be local_inc
>> and local_dec_and_test?
>
> Yep, it is OK during rcu_nmi_enter() or rcu_nmi_exit().  The nested
> NMI will put the dynticks_nmi_nesting counter back where it was, so
> no chance of confusion.
>

That sounds like it's making a scary assumption about the code
generated by the ++ operator.

>> That dynticks_nmi_nesting thing seems scary to me.  Shouldn't the code
>> unconditionally increment dynticks_nmi_nesting in rcu_nmi_enter and
>> unconditionally decrement it in rcu_nmi_exit?
>
> You might be able to get that to work, but the reason it is not done
> that way is because we might get an NMI while not in dyntick-idle
> state.  In that case, it would be very bad to atomically increment
> rcu_dynticks, because that would tell RCU to ignore the CPU while it
> was in the NMI handler, which is the opposite of what we want.
>
> But what did you have in mind?

If you're willing to have rcu_nmi_enter return a value and pass it
back to rcu_nmi_exit, then it could be like:

int rcu_nmi_enter()
{
  if (rdtp->dynticks & 1) {
    return 0;
  } else {
    rdtp->dynticks++;
    return 1;
  }
}

but wrapped in a cmpxchg loop to make it atomic, and

void rcu_nmi_exit(int state)
{
  if (state)
   rdtp->dynticks++;
}

Otherwise, I think that there may need to be enough state somewhere so
that the outermost nested rcu_nmi_enter knows whether to increment
dynticks.  For example, dynticks_nmi_nesting could store the nesting
count * 2 - (1 if the outermost nested user needs to increment
dynticks).  Something like:

void rcu_nmi_enter(void)
{
  /* Be very careful -- this function may be called reentrently on the
same CPU. */
  atomically: increment dynticks if it's even.

  /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
   * the state. */

  local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));

  WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
}

void rcu_nmi_exit(void)
{
  WARN_ON(!(dynticks & 1));
  locally atomically: dynticks_nmi_nesting -= 2, unless
dynticks_nmi_nesting == 1, in which case set it to zero

  if (dynticks_nmi_nesting was 1)
    atomic_inc(&dynticks);
}

The invariant here is that, for a single unnested enter/exit, if
dynticks_nmi_nesting != 0, then dynticks is odd.  As a result, an
rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
!= 0 *or* dynticks is odd will have no net effect, so the invariant,
in fact, holds for all invocations, nested or otherwise.

At least one of those conditions is true at all times during the
execution of outermost pair, starting with the first atomic operation
and ending with the final atomic_inc.  So they nest properly no matter
what else happens (unless, of course, someone else pokes dynticks in
the middle).

Thoughts?

>
>                                                         Thanx, Paul
>
>> --Andy
>>
>> >
>> >                                                         Thanx, Paul
>> >
>> >> >> Also, just to make sure: are we okay if rcu_nmi_enter() is called
>> >> >> before exception_enter if context tracking is on and we came directly
>> >> >> from userspace?
>> >> >
>> >> > If I understand correctly, this will result in context tracking invoking
>> >> > rcu_user_enter(), which will result in the rcu_dynticks counter having an
>> >> > odd value.  In that case, rcu_nmi_enter() will notice that RCU is already
>> >> > paying attention to this CPU via its check of atomic_read(&rdtp->dynticks)
>> >> > & 0x1), and will thus just return.  The matching rcu_nmi_exit() will
>> >> > notice that the nesting count is zero, and will also just return.
>> >> >
>> >> > Thus, everything works in that case.
>> >> >
>> >> > In contrast, if rcu_nmi_enter() was invoked from the idle loop, it
>> >> > would see that RCU is not paying attention to this CPU and that the
>> >> > NMI nesting depth (which rcu_nmi_enter() increments) used to be zero.
>> >> > It would then atomically increment rtdp->dynticks, forcing RCU to start
>> >> > paying attention to this CPU.  The matching rcu_nmi_exit() will see
>> >> > that the nesting count was non-zero, but became zero when decremented.
>> >> > This will cause rcu_nmi_exit() to atomically increment rtdp->dynticks,
>> >> > which will tell RCU to stop paying attention to this CPU.
>> >> >
>> >> >                                                         Thanx, Paul
>> >> >
>> >> > ------------------------------------------------------------------------
>> >> >
>> >> > rcu: Make rcu_nmi_enter() handle nesting
>> >> >
>> >> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
>> >> > viewpoint are NMIs.  Because ISTs and NMIs can nest, rcu_nmi_enter()
>> >> > and rcu_nmi_exit() must now correctly handle nesting.  As luck would
>> >> > have it, rcu_nmi_exit() handles nesting but rcu_nmi_enter() does not.
>> >> > This patch therefore makes rcu_nmi_enter() handle nesting.
>> >>
>> >> Thanks.  Should I add this to v5 of my series?
>> >>
>> >> --Andy
>> >>
>> >> >
>> >> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> >> >
>> >> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> >> > index 8749f43f3f05..875421aff6e3 100644
>> >> > --- a/kernel/rcu/tree.c
>> >> > +++ b/kernel/rcu/tree.c
>> >> > @@ -770,7 +770,8 @@ void rcu_nmi_enter(void)
>> >> >         if (rdtp->dynticks_nmi_nesting == 0 &&
>> >> >             (atomic_read(&rdtp->dynticks) & 0x1))
>> >> >                 return;
>> >> > -       rdtp->dynticks_nmi_nesting++;
>> >> > +       if (rdtp->dynticks_nmi_nesting++ != 0)
>> >> > +               return; /* Nested NMI/IST/whatever. */
>> >> >         smp_mb__before_atomic();  /* Force delay from prior write. */
>> >> >         atomic_inc(&rdtp->dynticks);
>> >> >         /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Andy Lutomirski
>> >> AMA Capital Management, LLC
>> >>
>> >
>>
>>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC
>>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 1/5] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
  2014-11-21 21:26 ` [PATCH v4 1/5] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME Andy Lutomirski
@ 2014-11-22 16:55   ` Borislav Petkov
  2014-11-24 17:58     ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Borislav Petkov @ 2014-11-22 16:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Linus Torvalds, linux-kernel, Peter Zijlstra, Oleg Nesterov,
	Tony Luck, Andi Kleen

On Fri, Nov 21, 2014 at 01:26:07PM -0800, Andy Lutomirski wrote:
> x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set
> but not on non-paranoid returns.  I suspect that this is a mistake
> and that the code only works because int3 is paranoid.
> 
> Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a
> workaround for the x86 bug.  With that bug fixed, we can remove
> _TIF_NOTIFY_RESUME from the uprobes code.
> 
> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Reported-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 21:26 ` [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context Andy Lutomirski
  2014-11-21 21:32   ` Andy Lutomirski
  2014-11-21 22:00   ` Paul E. McKenney
@ 2014-11-22 17:20   ` Borislav Petkov
  2014-11-24 19:48     ` Andy Lutomirski
  2015-01-22 21:52   ` Sasha Levin
  3 siblings, 1 reply; 72+ messages in thread
From: Borislav Petkov @ 2014-11-22 17:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Linus Torvalds, linux-kernel, Peter Zijlstra, Oleg Nesterov,
	Tony Luck, Andi Kleen, Paul E. McKenney, Josh Triplett,
	Frédéric Weisbecker

On Fri, Nov 21, 2014 at 01:26:08PM -0800, Andy Lutomirski wrote:
> We currently pretend that IST context is like standard exception
> context, but this is incorrect.  IST entries from userspace are like
> standard exceptions except that they use per-cpu stacks, so they are
> atomic.  IST entries from kernel space are like NMIs from RCU's
> perspective -- they are not quiescent states even if they
> interrupted the kernel during a quiescent state.
> 
> Add and use ist_enter and ist_exit to track IST context.  Even
> though x86_32 has no IST stacks, we track these interrupts the same
> way.
> 
> This fixes two issues:
> 
>  - Scheduling from an IST interrupt handler will now warn.  It would
>    previously appear to work as long as we got lucky and nothing
>    overwrote the stack frame.  (I don't know of any bugs in this
>    that would trigger the warning, but it's good to be on the safe
>    side.)
> 
>  - RCU handling in IST context was dangerous.  As far as I know,
>    only machine checks were likely to trigger this, but it's good to
>    be on the safe side.
> 
> Note that the machine check handlers appears to have been missing
> any context tracking at all before this patch.
> 
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Frédéric Weisbecker <fweisbec@gmail.com>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/include/asm/traps.h         |  4 +++
>  arch/x86/kernel/cpu/mcheck/mce.c     |  5 ++++
>  arch/x86/kernel/cpu/mcheck/p5.c      |  6 +++++
>  arch/x86/kernel/cpu/mcheck/winchip.c |  5 ++++
>  arch/x86/kernel/traps.c              | 49 ++++++++++++++++++++++++++++++------
>  5 files changed, 61 insertions(+), 8 deletions(-)

...

> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 0d0e922fafc1..f5c4b8813774 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
>  	preempt_count_dec();
>  }
>  
> +enum ctx_state ist_enter(struct pt_regs *regs)
> +{
> +	/*
> +	 * We are atomic because we're on the IST stack (or we're on x86_32,
> +	 * in which case we still shouldn't schedule.
> +	 */
> +	preempt_count_add(HARDIRQ_OFFSET);
> +
> +	if (user_mode_vm(regs)) {
> +		/* Other than that, we're just an exception. */
> +		return exception_enter();
> +	} else {
> +		/*
> +		 * We might have interrupted pretty much anything.  In
> +		 * fact, if we're a machine check, we can even interrupt
> +		 * NMI processing.  We don't want in_nmi() to return true,
> +		 * but we need to notify RCU.
> +		 */
> +		rcu_nmi_enter();
> +		return IN_KERNEL;  /* the value is irrelevant. */
> +	}

I guess dropping the explicit else-branch could make it look a bit nicer
with the curly braces gone and all...

enum ctx_state ist_enter(struct pt_regs *regs)
{
        /*
         * We are atomic because we're on the IST stack (or we're on x86_32,
         * in which case we still shouldn't schedule.
         */
        preempt_count_add(HARDIRQ_OFFSET);

        if (user_mode_vm(regs))
                /* Other than that, we're just an exception. */
                return exception_enter();

        /*
         * We might have interrupted pretty much anything.  In fact, if we're a
         * machine check, we can even interrupt NMI processing.  We don't want
         * in_nmi() to return true, but we need to notify RCU.
         */
        rcu_nmi_enter();
        return IN_KERNEL;  /* the value is irrelevant. */
} 

> +}
> +
> +void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
> +{
> +	preempt_count_sub(HARDIRQ_OFFSET);
> +
> +	if (user_mode_vm(regs))
> +		return exception_exit(prev_state);
> +	else
> +		rcu_nmi_exit();
> +}

Ditto here.

> +
>  static nokprobe_inline int
>  do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
>  		  struct pt_regs *regs,	long error_code)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-22  5:53                   ` Andy Lutomirski
@ 2014-11-22 23:41                     ` Paul E. McKenney
  2014-11-24 20:22                       ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-22 23:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Fri, Nov 21, 2014 at 03:06:48PM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Nov 21, 2014 at 2:55 PM, Paul E. McKenney
> >> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> >> > On Fri, Nov 21, 2014 at 02:19:17PM -0800, Andy Lutomirski wrote:
> >> >> >> On Fri, Nov 21, 2014 at 2:07 PM, Paul E. McKenney
> >> >> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> >> >> > On Fri, Nov 21, 2014 at 01:32:50PM -0800, Andy Lutomirski wrote:
> >> >> >> >> On Fri, Nov 21, 2014 at 1:26 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> >> >> >> > We currently pretend that IST context is like standard exception
> >> >> >> >> > context, but this is incorrect.  IST entries from userspace are like
> >> >> >> >> > standard exceptions except that they use per-cpu stacks, so they are
> >> >> >> >> > atomic.  IST entries from kernel space are like NMIs from RCU's
> >> >> >> >> > perspective -- they are not quiescent states even if they
> >> >> >> >> > interrupted the kernel during a quiescent state.
> >> >> >> >> >
> >> >> >> >> > Add and use ist_enter and ist_exit to track IST context.  Even
> >> >> >> >> > though x86_32 has no IST stacks, we track these interrupts the same
> >> >> >> >> > way.
> >> >> >> >>
> >> >> >> >> I should add:
> >> >> >> >>
> >> >> >> >> I have no idea why RCU read-side critical sections are safe inside
> >> >> >> >> __do_page_fault today.  It's guarded by exception_enter(), but that
> >> >> >> >> doesn't do anything if context tracking is off, and context tracking
> >> >> >> >> is usually off. What am I missing here?
> >> >> >> >
> >> >> >> > Ah!  There are three cases:
> >> >> >> >
> >> >> >> > 1.      Context tracking is off on a non-idle CPU.  In this case, RCU is
> >> >> >> >         still paying attention to CPUs running in both userspace and in
> >> >> >> >         the kernel.  So if a page fault happens, RCU will be set up to
> >> >> >> >         notice any RCU read-side critical sections.
> >> >> >> >
> >> >> >> > 2.      Context tracking is on on a non-idle CPU.  In this case, RCU
> >> >> >> >         might well be ignoring userspace execution: NO_HZ_FULL and
> >> >> >> >         all that.  However, as you pointed out, in this case the
> >> >> >> >         context-tracking code lets RCU know that we have entered the
> >> >> >> >         kernel, which means that RCU will again be paying attention to
> >> >> >> >         RCU read-side critical sections.
> >> >> >> >
> >> >> >> > 3.      The CPU is idle.  In this case, RCU is ignoring the CPU, so
> >> >> >> >         if we take a page fault when context tracking is off, life
> >> >> >> >         will be hard.  But the kernel is not supposed to take page
> >> >> >> >         faults in the idle loop, so this is not a problem.
> >> >> >>
> >> >> >> I guess so, as long as there are really no page faults in the idle loop.
> >> >> >
> >> >> > As far as I know, there are not.  If there are, someone needs to let
> >> >> > me know!  ;-)
> >> >> >
> >> >> >> There are, however, machine checks in the idle loop, and maybe kprobes
> >> >> >> (haven't checked), so I think this patch might fix real bugs.
> >> >> >
> >> >> > If you can get ISTs from the idle loop, then the patch is needed.
> >> >> >
> >> >> >> > Just out of curiosity...  Can an NMI occur in IST context?  If it can,
> >> >> >> > I need to make rcu_nmi_enter() and rcu_nmi_exit() deal properly with
> >> >> >> > nested calls.
> >> >> >>
> >> >> >> Yes, and vice versa.  That code looked like it handled nesting
> >> >> >> correctly, but I wasn't entirely sure.
> >> >> >
> >> >> > It currently does not, please see below patch.  Are you able to test
> >> >> > nesting?  It would be really cool if you could do so -- I have no
> >> >> > way to test this patch.
> >> >>
> >> >> I can try.  It's sort of easy -- I'll put an int3 into do_nmi and add
> >> >> a fixup to avoid crashing.
> >> >>
> >> >> What should I look for?  Should I try to force full nohz on and assert
> >> >> something?  I don't really know how to make full nohz work.
> >> >
> >> > You should look for the WARN_ON_ONCE() calls in rcu_nmi_enter() and
> >> > rcu_nmi_exit() to fire.
> >>
> >> No warning with or without your patch, maybe because all of those
> >> returns skip the labels.
> >
> > I will be guardedly optimistic and take this as a good sign.  ;-)
> >
> >> Also, an NMI can happen *during* rcu_nmi_enter or rcu_nmi_exit.  Is
> >> that okay?  Should those dynticks_nmi_nesting++ things be local_inc
> >> and local_dec_and_test?
> >
> > Yep, it is OK during rcu_nmi_enter() or rcu_nmi_exit().  The nested
> > NMI will put the dynticks_nmi_nesting counter back where it was, so
> > no chance of confusion.
> 
> That sounds like it's making a scary assumption about the code
> generated by the ++ operator.

Yeah, well, local_inc becomes the ++ operator on many systems.  ;-)

I could perhaps be convinced to use ACCESS_ONCE(), will give it some
thought.

> >> That dynticks_nmi_nesting thing seems scary to me.  Shouldn't the code
> >> unconditionally increment dynticks_nmi_nesting in rcu_nmi_enter and
> >> unconditionally decrement it in rcu_nmi_exit?
> >
> > You might be able to get that to work, but the reason it is not done
> > that way is because we might get an NMI while not in dyntick-idle
> > state.  In that case, it would be very bad to atomically increment
> > rcu_dynticks, because that would tell RCU to ignore the CPU while it
> > was in the NMI handler, which is the opposite of what we want.
> >
> > But what did you have in mind?
> 
> If you're willing to have rcu_nmi_enter return a value and pass it
> back to rcu_nmi_exit, then it could be like:
> 
> int rcu_nmi_enter()
> {
>   if (rdtp->dynticks & 1) {
>     return 0;
>   } else {
>     rdtp->dynticks++;
>     return 1;
>   }
> }
> 
> but wrapped in a cmpxchg loop to make it atomic, and
> 
> void rcu_nmi_exit(int state)
> {
>   if (state)
>    rdtp->dynticks++;
> }

Returning state sounds like a bad idea, if we can reasonably avoid it.

And I think I finally see what you are pointing out about my code: If
another NMI comes in between the time I increment ->dynticks_nmi_nesting
and the time I atomically increment ->dynticks, the nested NMI handler
will incorrectly believe that RCU is already paying attention to this CPU.
Which would indeed not be at all good, so good catch!!!

> Otherwise, I think that there may need to be enough state somewhere so
> that the outermost nested rcu_nmi_enter knows whether to increment
> dynticks.  For example, dynticks_nmi_nesting could store the nesting
> count * 2 - (1 if the outermost nested user needs to increment
> dynticks).  Something like:
> 
> void rcu_nmi_enter(void)
> {
>   /* Be very careful -- this function may be called reentrently on the
> same CPU. */
>   atomically: increment dynticks if it's even.
> 
>   /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
>    * the state. */
> 
>   local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
> 
>   WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
> }
> 
> void rcu_nmi_exit(void)
> {
>   WARN_ON(!(dynticks & 1));
>   locally atomically: dynticks_nmi_nesting -= 2, unless
> dynticks_nmi_nesting == 1, in which case set it to zero
> 
>   if (dynticks_nmi_nesting was 1)
>     atomic_inc(&dynticks);
> }
> 
> The invariant here is that, for a single unnested enter/exit, if
> dynticks_nmi_nesting != 0, then dynticks is odd.  As a result, an
> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
> != 0 *or* dynticks is odd will have no net effect, so the invariant,
> in fact, holds for all invocations, nested or otherwise.
> 
> At least one of those conditions is true at all times during the
> execution of outermost pair, starting with the first atomic operation
> and ending with the final atomic_inc.  So they nest properly no matter
> what else happens (unless, of course, someone else pokes dynticks in
> the middle).
> 
> Thoughts?

Let's see...  The evenness of ->dynticks should be preserved by nested NMI
handlers, so the check and increment need not be atomic.  We don't have
any way (other than atomic operations) to do local atomic modifications
on all architectures, because we cannot mask NMIs.  (Yes, it can work
on x86, but this is common code that needs to work everywhere.)  On the
other hand, presumably NMIs are rare, so atomic modification of the NMI
nesting counter should be OK, at least if it proves absolutely necessary.
And I am thinking that a mechanical proof will be needed here.  :-/

But first, let me try generating the code and informally evaluating it:

	 1   struct rcu_dynticks {
	 2     long long dynticks_nesting;
	 3     int dynticks_nmi_nesting;
	 4     atomic_t dynticks;
	 5   };
	 6 
	 7   void rcu_nmi_enter(void)
	 8   {
	 9     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
	10     int incby = 2;
	11 
	12     if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
	13       smp_mb__before_atomic();
	14       atomic_inc(&rdtp->dynticks);
	15       smp_mb__after_atomic();
	16       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
	17       incby = 1;
	18     }
	19     rdtp->dynticks_nmi_nesting += incby;
	20     barrier();
	21   }
	22 
	23   void rcu_nmi_exit(void)
	24   {
	25     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
	26 
	27     WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
	28     WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
	29     if (rdtp->dynticks_nmi_nesting != 1) {
	30       rdtp->dynticks_nmi_nesting -= 2;
	31       return;
	32     }
	33     rdtp->dynticks_nmi_nesting = 0;
	34     smp_mb__before_atomic();
	35     atomic_inc(&rdtp->dynticks);
	36     smp_mb__after_atomic();
	37     WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
	38   }

Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
assumes that we don't need to increment ->dynticks.

Line 12 checks to see if ->dynticks is even.  Note that this check is
stable: If there are nested NMIs, they will increment ->dynticks twice
or not at all, and either way preserves the evenness (to be proven, of
course, but that is the plan).  If ->dynticks is even, lines 13-15
atomically increment it, line 16 complains if still even, and line 17
says we will increment ->dynticks_nmi_nesting by only 1.

Either way, line 19 increments ->dynticks_nmi_nesting as needed and
line 20 keeps the compiler from getting too cute.

For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
structure.  Lines 27 and 28 complain bitterly if invariants are violated.
If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.

Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
atomically increment ->dynticks with full ordering, and line 37
complains bitterly if ->dynticks is not even.

So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
not nested, which is the correct thing for them to think in that case.
They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
to zero (adding and then subtracting 1).  If the NMI happens after the
atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
to zero (adding and subtracting two again).  If the NMI happens after
the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
time to one (again adding and subtracting two).

In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
one means we need to atomically increment ->dynticks, and other values
mean that we are partially or fully nested.  Reasoning proceeds as for
rcu_nmi_enter(), but in the opposite direction.

Whew!  That might even work.

But how about taking a different approach.  Assuming that there can
never be more than (say) 14 nesting NMI-like things, use the lower
four bits of ->dynticks to represent the NMI nesting and the upper
28 bits as the counter.  This of course requires modifying lots of
places in RCU that check the counter, but it is probably time to
abstract the check anyway.

This would allow my earlier attempted logic to work and (maybe) simplify
the reasoning a bit (and yes, the "magic" constants need macros):

	void rcu_nmi_enter(void)
	{
		struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
		int nesting = atomic_read(&rdtp->dynticks) & 0xf;
		int incby = 0x01;

		WARN_ON_ONCE(nexting == 0xf);
		if (nesting == 0) {
			if (atomic_read(&rdtp->dynticks) & 0x10)
				return;
			incby = 0x11;
		}
		smp_mb__before_atomic();
		atomic_add(&rdtp->dynticks, incby);
		smp_mb__after_atomic();
		WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
	}

	void rcu_nmi_exit(void)
	{
		struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
		int nesting = atomic_read(&rdtp->dynticks) & 0xf;
		int incby = 0x0f;

		if (nesting == 0)
			return;
		if (nesting > 1)
			incby = -1;
		smp_mb__before_atomic();
		atomic_add(&rdtp->dynticks, incby);
		smp_mb__after_atomic();
		WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
	}

Over to you!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 4/5] x86: Clean up current_stack_pointer
  2014-11-21 21:26 ` [PATCH v4 4/5] x86: Clean up current_stack_pointer Andy Lutomirski
@ 2014-11-24 11:39   ` Borislav Petkov
  0 siblings, 0 replies; 72+ messages in thread
From: Borislav Petkov @ 2014-11-24 11:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Linus Torvalds, linux-kernel, Peter Zijlstra, Oleg Nesterov,
	Tony Luck, Andi Kleen

On Fri, Nov 21, 2014 at 01:26:10PM -0800, Andy Lutomirski wrote:
> There's no good reason for it to be a macro, and x86_64 will want to
> use it, so it should be in a header.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 5/5] x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic
  2014-11-21 21:26 ` [PATCH v4 5/5] x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic Andy Lutomirski
@ 2014-11-24 15:54   ` Borislav Petkov
  2014-11-24 19:52     ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Borislav Petkov @ 2014-11-24 15:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Linus Torvalds, linux-kernel, Peter Zijlstra, Oleg Nesterov,
	Tony Luck, Andi Kleen

On Fri, Nov 21, 2014 at 01:26:11PM -0800, Andy Lutomirski wrote:
> In some IST handlers, if the interrupt came from user mode,
> we can safely enable preemption.  Add helpers to do it safely.
> 
> This is intended to be used my the memory failure code in
> do_machine_check.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/include/asm/traps.h |  2 ++
>  arch/x86/kernel/traps.c      | 38 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index eb16a61bfd06..04ba537fc721 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -112,6 +112,8 @@ asmlinkage void mce_threshold_interrupt(void);
>  
>  extern enum ctx_state ist_enter(struct pt_regs *regs);
>  extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
> +extern void ist_begin_non_atomic(struct pt_regs *regs);
> +extern void ist_end_non_atomic(void);
>  
>  /* Interrupts/Exceptions */
>  enum {
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 6a02760df7b4..2b5f2e038e3f 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -140,6 +140,44 @@ void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
>  		rcu_nmi_exit();
>  }
>  
> +/**
> + * ist_begin_non_atomic() - begin a non-atomic section in an IST exception
> + * @regs:	regs passed to the IST exception handler
> + *
> + * IST exception handlers normally cannot schedule.  As a special
> + * exception, if the exception interrupted userspace code (i.e.
> + * user_mode_vm(regs) would return true) and the exception was not
> + * a double fault, it can be safe to schedule.  ist_begin_non_atomic()
> + * begins a non-atomic section within an ist_enter()/ist_exit() region.
> + * Callers are responsible for enabling interrupts themselves inside
> + * the non-atomic section, and callers must call is_end_non_atomic()
> + * before ist_exit().
> + */

Ok, I guess this is is fine, albeit a bit circumstantial:

I need to do

	ist_enter()
	ist_begin_non_atomic()

In here, I still need to do

	if (user_mode_vm()) {
		# do non-atomic stuff
	}

AFAICT, right?

	ist_end_non_atomic()
	ist_exit()

and this whole fun for the context tracking exception_enter/_exit()
calls.

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 3/5] x86, entry: Switch stacks on a paranoid entry from userspace
  2014-11-21 21:26 ` [PATCH v4 3/5] x86, entry: Switch stacks on a paranoid entry " Andy Lutomirski
@ 2014-11-24 15:55   ` Borislav Petkov
  0 siblings, 0 replies; 72+ messages in thread
From: Borislav Petkov @ 2014-11-24 15:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, Linus Torvalds, linux-kernel, Peter Zijlstra, Oleg Nesterov,
	Tony Luck, Andi Kleen

On Fri, Nov 21, 2014 at 01:26:09PM -0800, Andy Lutomirski wrote:
> This causes all non-NMI, non-double-fault kernel entries from
> userspace to run on the normal kernel stack.  Double-fault is
> exempt to minimize confusion if we double-fault directly from
> userspace due to a bad kernel stack.
> 
> This is, suprisingly, simpler and shorter than the current code.  It
> removes the IMO rather frightening paranoid_userspace path, and it
> make sync_regs much simpler.
> 
> There is no risk of stack overflow due to this change -- the kernel
> stack that we switch to is empty.
> 
> This will also enable us to create non-atomic sections within
> machine checks from userspace, which will simplify memory failure
> handling.  It will also allow the upcoming fsgsbase code to be
> simplified, because it doesn't need to worry about usergs when
> scheduling in paranoid_exit, as that code no longer exists.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Andi Kleen <andi@firstfloor.org>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 1/5] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
  2014-11-22 16:55   ` Borislav Petkov
@ 2014-11-24 17:58     ` Andy Lutomirski
  0 siblings, 0 replies; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-24 17:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Linus Torvalds, linux-kernel, Peter Zijlstra,
	Oleg Nesterov, Tony Luck, Andi Kleen

On Sat, Nov 22, 2014 at 8:55 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 21, 2014 at 01:26:07PM -0800, Andy Lutomirski wrote:
>> x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set
>> but not on non-paranoid returns.  I suspect that this is a mistake
>> and that the code only works because int3 is paranoid.
>>
>> Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a
>> workaround for the x86 bug.  With that bug fixed, we can remove
>> _TIF_NOTIFY_RESUME from the uprobes code.
>>
>> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> Reported-by: Oleg Nesterov <oleg@redhat.com>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>
> Acked-by: Borislav Petkov <bp@suse.de>
>

This one is now upstream, so I'm dropping it.

--Andy

> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-22 17:20   ` Borislav Petkov
@ 2014-11-24 19:48     ` Andy Lutomirski
  0 siblings, 0 replies; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-24 19:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Linus Torvalds, linux-kernel, Peter Zijlstra,
	Oleg Nesterov, Tony Luck, Andi Kleen, Paul E. McKenney,
	Josh Triplett, Frédéric Weisbecker

On Sat, Nov 22, 2014 at 9:20 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 21, 2014 at 01:26:08PM -0800, Andy Lutomirski wrote:
>> We currently pretend that IST context is like standard exception
>> context, but this is incorrect.  IST entries from userspace are like
>> standard exceptions except that they use per-cpu stacks, so they are
>> atomic.  IST entries from kernel space are like NMIs from RCU's
>> perspective -- they are not quiescent states even if they
>> interrupted the kernel during a quiescent state.
>>
>> Add and use ist_enter and ist_exit to track IST context.  Even
>> though x86_32 has no IST stacks, we track these interrupts the same
>> way.
>>
>> This fixes two issues:
>>
>>  - Scheduling from an IST interrupt handler will now warn.  It would
>>    previously appear to work as long as we got lucky and nothing
>>    overwrote the stack frame.  (I don't know of any bugs in this
>>    that would trigger the warning, but it's good to be on the safe
>>    side.)
>>
>>  - RCU handling in IST context was dangerous.  As far as I know,
>>    only machine checks were likely to trigger this, but it's good to
>>    be on the safe side.
>>
>> Note that the machine check handlers appears to have been missing
>> any context tracking at all before this patch.
>>
>> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
>> Cc: Josh Triplett <josh@joshtriplett.org>
>> Cc: Frédéric Weisbecker <fweisbec@gmail.com>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/include/asm/traps.h         |  4 +++
>>  arch/x86/kernel/cpu/mcheck/mce.c     |  5 ++++
>>  arch/x86/kernel/cpu/mcheck/p5.c      |  6 +++++
>>  arch/x86/kernel/cpu/mcheck/winchip.c |  5 ++++
>>  arch/x86/kernel/traps.c              | 49 ++++++++++++++++++++++++++++++------
>>  5 files changed, 61 insertions(+), 8 deletions(-)
>
> ...
>
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index 0d0e922fafc1..f5c4b8813774 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -107,6 +107,39 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
>>       preempt_count_dec();
>>  }
>>
>> +enum ctx_state ist_enter(struct pt_regs *regs)
>> +{
>> +     /*
>> +      * We are atomic because we're on the IST stack (or we're on x86_32,
>> +      * in which case we still shouldn't schedule.
>> +      */
>> +     preempt_count_add(HARDIRQ_OFFSET);
>> +
>> +     if (user_mode_vm(regs)) {
>> +             /* Other than that, we're just an exception. */
>> +             return exception_enter();
>> +     } else {
>> +             /*
>> +              * We might have interrupted pretty much anything.  In
>> +              * fact, if we're a machine check, we can even interrupt
>> +              * NMI processing.  We don't want in_nmi() to return true,
>> +              * but we need to notify RCU.
>> +              */
>> +             rcu_nmi_enter();
>> +             return IN_KERNEL;  /* the value is irrelevant. */
>> +     }
>
> I guess dropping the explicit else-branch could make it look a bit nicer
> with the curly braces gone and all...
>
> enum ctx_state ist_enter(struct pt_regs *regs)
> {
>         /*
>          * We are atomic because we're on the IST stack (or we're on x86_32,
>          * in which case we still shouldn't schedule.
>          */
>         preempt_count_add(HARDIRQ_OFFSET);
>
>         if (user_mode_vm(regs))
>                 /* Other than that, we're just an exception. */
>                 return exception_enter();
>

Two indented lines w/o curly braces makes me think of goto fail; :-/

TBH, when there are clearly two options, I tend to prefer the braces
that make it very obvious what's going on.  I had some memorable bugs
several years ago that would have been impossible if I has used braces
more liberally.

--Andy

>         /*
>          * We might have interrupted pretty much anything.  In fact, if we're a
>          * machine check, we can even interrupt NMI processing.  We don't want
>          * in_nmi() to return true, but we need to notify RCU.
>          */
>         rcu_nmi_enter();
>         return IN_KERNEL;  /* the value is irrelevant. */
> }
>
>> +}
>> +
>> +void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
>> +{
>> +     preempt_count_sub(HARDIRQ_OFFSET);
>> +
>> +     if (user_mode_vm(regs))
>> +             return exception_exit(prev_state);
>> +     else
>> +             rcu_nmi_exit();
>> +}
>
> Ditto here.
>
>> +
>>  static nokprobe_inline int
>>  do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
>>                 struct pt_regs *regs, long error_code)
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 5/5] x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic
  2014-11-24 15:54   ` Borislav Petkov
@ 2014-11-24 19:52     ` Andy Lutomirski
  0 siblings, 0 replies; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-24 19:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, Linus Torvalds, linux-kernel, Peter Zijlstra,
	Oleg Nesterov, Tony Luck, Andi Kleen

On Mon, Nov 24, 2014 at 7:54 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 21, 2014 at 01:26:11PM -0800, Andy Lutomirski wrote:
>> In some IST handlers, if the interrupt came from user mode,
>> we can safely enable preemption.  Add helpers to do it safely.
>>
>> This is intended to be used my the memory failure code in
>> do_machine_check.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/include/asm/traps.h |  2 ++
>>  arch/x86/kernel/traps.c      | 38 ++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 40 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
>> index eb16a61bfd06..04ba537fc721 100644
>> --- a/arch/x86/include/asm/traps.h
>> +++ b/arch/x86/include/asm/traps.h
>> @@ -112,6 +112,8 @@ asmlinkage void mce_threshold_interrupt(void);
>>
>>  extern enum ctx_state ist_enter(struct pt_regs *regs);
>>  extern void ist_exit(struct pt_regs *regs, enum ctx_state prev_state);
>> +extern void ist_begin_non_atomic(struct pt_regs *regs);
>> +extern void ist_end_non_atomic(void);
>>
>>  /* Interrupts/Exceptions */
>>  enum {
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index 6a02760df7b4..2b5f2e038e3f 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -140,6 +140,44 @@ void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
>>               rcu_nmi_exit();
>>  }
>>
>> +/**
>> + * ist_begin_non_atomic() - begin a non-atomic section in an IST exception
>> + * @regs:    regs passed to the IST exception handler
>> + *
>> + * IST exception handlers normally cannot schedule.  As a special
>> + * exception, if the exception interrupted userspace code (i.e.
>> + * user_mode_vm(regs) would return true) and the exception was not
>> + * a double fault, it can be safe to schedule.  ist_begin_non_atomic()
>> + * begins a non-atomic section within an ist_enter()/ist_exit() region.
>> + * Callers are responsible for enabling interrupts themselves inside
>> + * the non-atomic section, and callers must call is_end_non_atomic()
>> + * before ist_exit().
>> + */
>
> Ok, I guess this is is fine, albeit a bit circumstantial:
>
> I need to do
>
>         ist_enter()
>         ist_begin_non_atomic()
>
> In here, I still need to do
>
>         if (user_mode_vm()) {
>                 # do non-atomic stuff
>         }
>
> AFAICT, right?
>
>         ist_end_non_atomic()
>         ist_exit()
>
> and this whole fun for the context tracking exception_enter/_exit()
> calls.

Not quite.  This fun stuff is to satisfy Linus' request to make the
atomicness of the IST entries unconditional instead of depending on
the weird user_space_vm thing and to make the escape to non-atomicness
be clearly a special case.

So your code above should OOPS instead of working.  The code should be:

ist_enter()

if (user_mode_vm()) {
  ist_begin_non_atomic();
  local_irq_enable();
  # do non-atomic stuff
  local_irq_disable();
  ist_end_non_atomic();
}

ist_exit().

Although that user_mode_vm() check might be subsumed by the mce_severity stuff.

NB: This series is going to look a bit different once I rebase it.
This part is the same, but Linus merged some patches that change some
of the same code over the weekend.

--Andy

>
> Acked-by: Borislav Petkov <bp@suse.de>
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-22 23:41                     ` Paul E. McKenney
@ 2014-11-24 20:22                       ` Andy Lutomirski
  2014-11-24 20:54                         ` Paul E. McKenney
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-24 20:22 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
>> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
>> >> <paulmck@linux.vnet.ibm.com> wrote:

> Returning state sounds like a bad idea, if we can reasonably avoid it.

I agree, except that we already do it for exception_enter(), etc.  But
yes, changing fewer things is nice.

>
> And I think I finally see what you are pointing out about my code: If
> another NMI comes in between the time I increment ->dynticks_nmi_nesting
> and the time I atomically increment ->dynticks, the nested NMI handler
> will incorrectly believe that RCU is already paying attention to this CPU.
> Which would indeed not be at all good, so good catch!!!
>
>> Otherwise, I think that there may need to be enough state somewhere so
>> that the outermost nested rcu_nmi_enter knows whether to increment
>> dynticks.  For example, dynticks_nmi_nesting could store the nesting
>> count * 2 - (1 if the outermost nested user needs to increment
>> dynticks).  Something like:
>>
>> void rcu_nmi_enter(void)
>> {
>>   /* Be very careful -- this function may be called reentrently on the
>> same CPU. */
>>   atomically: increment dynticks if it's even.
>>
>>   /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
>>    * the state. */
>>
>>   local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
>>
>>   WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
>> }
>>
>> void rcu_nmi_exit(void)
>> {
>>   WARN_ON(!(dynticks & 1));
>>   locally atomically: dynticks_nmi_nesting -= 2, unless
>> dynticks_nmi_nesting == 1, in which case set it to zero
>>
>>   if (dynticks_nmi_nesting was 1)
>>     atomic_inc(&dynticks);
>> }
>>
>> The invariant here is that, for a single unnested enter/exit, if
>> dynticks_nmi_nesting != 0, then dynticks is odd.  As a result, an
>> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
>> != 0 *or* dynticks is odd will have no net effect, so the invariant,
>> in fact, holds for all invocations, nested or otherwise.
>>
>> At least one of those conditions is true at all times during the
>> execution of outermost pair, starting with the first atomic operation
>> and ending with the final atomic_inc.  So they nest properly no matter
>> what else happens (unless, of course, someone else pokes dynticks in
>> the middle).
>>
>> Thoughts?
>
> Let's see...  The evenness of ->dynticks should be preserved by nested NMI
> handlers, so the check and increment need not be atomic.  We don't have
> any way (other than atomic operations) to do local atomic modifications
> on all architectures, because we cannot mask NMIs.  (Yes, it can work
> on x86, but this is common code that needs to work everywhere.)  On the
> other hand, presumably NMIs are rare, so atomic modification of the NMI
> nesting counter should be OK, at least if it proves absolutely necessary.
> And I am thinking that a mechanical proof will be needed here.  :-/
>
> But first, let me try generating the code and informally evaluating it:
>
>          1   struct rcu_dynticks {
>          2     long long dynticks_nesting;
>          3     int dynticks_nmi_nesting;
>          4     atomic_t dynticks;
>          5   };
>          6
>          7   void rcu_nmi_enter(void)
>          8   {
>          9     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>         10     int incby = 2;
>         11
>         12     if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
>         13       smp_mb__before_atomic();
>         14       atomic_inc(&rdtp->dynticks);
>         15       smp_mb__after_atomic();
>         16       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>         17       incby = 1;

WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?

>         18     }
>         19     rdtp->dynticks_nmi_nesting += incby;

Oh, I see why you don't need local_add -- it's because an nmi in the
middle of this increment won't have any effect on the interrupted
code, so even a software RMW will be okay.

>         20     barrier();
>         21   }
>         22
>         23   void rcu_nmi_exit(void)
>         24   {
>         25     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>         26
>         27     WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
>         28     WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>         29     if (rdtp->dynticks_nmi_nesting != 1) {

WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?

>         30       rdtp->dynticks_nmi_nesting -= 2;
>         31       return;
>         32     }
>         33     rdtp->dynticks_nmi_nesting = 0;
>         34     smp_mb__before_atomic();

This implies barrier(), right?

>         35     atomic_inc(&rdtp->dynticks);
>         36     smp_mb__after_atomic();
>         37     WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>         38   }
>
> Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
> assumes that we don't need to increment ->dynticks.
>
> Line 12 checks to see if ->dynticks is even.  Note that this check is
> stable: If there are nested NMIs, they will increment ->dynticks twice
> or not at all, and either way preserves the evenness (to be proven, of
> course, but that is the plan).  If ->dynticks is even, lines 13-15
> atomically increment it, line 16 complains if still even, and line 17
> says we will increment ->dynticks_nmi_nesting by only 1.
>
> Either way, line 19 increments ->dynticks_nmi_nesting as needed and
> line 20 keeps the compiler from getting too cute.
>
> For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
> structure.  Lines 27 and 28 complain bitterly if invariants are violated.
> If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
> then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
>
> Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
> atomically increment ->dynticks with full ordering, and line 37
> complains bitterly if ->dynticks is not even.
>
> So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
> nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
> not nested, which is the correct thing for them to think in that case.
> They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
> to zero (adding and then subtracting 1).  If the NMI happens after the
> atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
> will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
> to zero (adding and subtracting two again).  If the NMI happens after
> the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
> and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
> time to one (again adding and subtracting two).
>
> In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
> one means we need to atomically increment ->dynticks, and other values
> mean that we are partially or fully nested.  Reasoning proceeds as for
> rcu_nmi_enter(), but in the opposite direction.
>
> Whew!  That might even work.

I think I like this, with the warnings above.

>
> But how about taking a different approach.  Assuming that there can
> never be more than (say) 14 nesting NMI-like things, use the lower
> four bits of ->dynticks to represent the NMI nesting and the upper
> 28 bits as the counter.  This of course requires modifying lots of
> places in RCU that check the counter, but it is probably time to
> abstract the check anyway.
>
> This would allow my earlier attempted logic to work and (maybe) simplify
> the reasoning a bit (and yes, the "magic" constants need macros):
>
>         void rcu_nmi_enter(void)
>         {
>                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
>                 int incby = 0x01;
>
>                 WARN_ON_ONCE(nexting == 0xf);
>                 if (nesting == 0) {
>                         if (atomic_read(&rdtp->dynticks) & 0x10)
>                                 return;
>                         incby = 0x11;
>                 }
>                 smp_mb__before_atomic();
>                 atomic_add(&rdtp->dynticks, incby);
>                 smp_mb__after_atomic();
>                 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>         }
>
>         void rcu_nmi_exit(void)
>         {
>                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
>                 int incby = 0x0f;
>
>                 if (nesting == 0)
>                         return;
>                 if (nesting > 1)
>                         incby = -1;
>                 smp_mb__before_atomic();
>                 atomic_add(&rdtp->dynticks, incby);
>                 smp_mb__after_atomic();
>                 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>         }
>
> Over to you!  ;-)

This latter one is all you :)

--Andy

>
>                                                         Thanx, Paul
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 20:22                       ` Andy Lutomirski
@ 2014-11-24 20:54                         ` Paul E. McKenney
  2014-11-24 21:02                           ` Andy Lutomirski
  2014-11-24 21:27                           ` Paul E. McKenney
  0 siblings, 2 replies; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-24 20:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
> On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
> >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> >> >> <paulmck@linux.vnet.ibm.com> wrote:
> 
> > Returning state sounds like a bad idea, if we can reasonably avoid it.
> 
> I agree, except that we already do it for exception_enter(), etc.  But
> yes, changing fewer things is nice.
> 
> >
> > And I think I finally see what you are pointing out about my code: If
> > another NMI comes in between the time I increment ->dynticks_nmi_nesting
> > and the time I atomically increment ->dynticks, the nested NMI handler
> > will incorrectly believe that RCU is already paying attention to this CPU.
> > Which would indeed not be at all good, so good catch!!!
> >
> >> Otherwise, I think that there may need to be enough state somewhere so
> >> that the outermost nested rcu_nmi_enter knows whether to increment
> >> dynticks.  For example, dynticks_nmi_nesting could store the nesting
> >> count * 2 - (1 if the outermost nested user needs to increment
> >> dynticks).  Something like:
> >>
> >> void rcu_nmi_enter(void)
> >> {
> >>   /* Be very careful -- this function may be called reentrently on the
> >> same CPU. */
> >>   atomically: increment dynticks if it's even.
> >>
> >>   /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
> >>    * the state. */
> >>
> >>   local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
> >>
> >>   WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
> >> }
> >>
> >> void rcu_nmi_exit(void)
> >> {
> >>   WARN_ON(!(dynticks & 1));
> >>   locally atomically: dynticks_nmi_nesting -= 2, unless
> >> dynticks_nmi_nesting == 1, in which case set it to zero
> >>
> >>   if (dynticks_nmi_nesting was 1)
> >>     atomic_inc(&dynticks);
> >> }
> >>
> >> The invariant here is that, for a single unnested enter/exit, if
> >> dynticks_nmi_nesting != 0, then dynticks is odd.  As a result, an
> >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
> >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
> >> in fact, holds for all invocations, nested or otherwise.
> >>
> >> At least one of those conditions is true at all times during the
> >> execution of outermost pair, starting with the first atomic operation
> >> and ending with the final atomic_inc.  So they nest properly no matter
> >> what else happens (unless, of course, someone else pokes dynticks in
> >> the middle).
> >>
> >> Thoughts?
> >
> > Let's see...  The evenness of ->dynticks should be preserved by nested NMI
> > handlers, so the check and increment need not be atomic.  We don't have
> > any way (other than atomic operations) to do local atomic modifications
> > on all architectures, because we cannot mask NMIs.  (Yes, it can work
> > on x86, but this is common code that needs to work everywhere.)  On the
> > other hand, presumably NMIs are rare, so atomic modification of the NMI
> > nesting counter should be OK, at least if it proves absolutely necessary.
> > And I am thinking that a mechanical proof will be needed here.  :-/
> >
> > But first, let me try generating the code and informally evaluating it:
> >
> >          1   struct rcu_dynticks {
> >          2     long long dynticks_nesting;
> >          3     int dynticks_nmi_nesting;
> >          4     atomic_t dynticks;
> >          5   };
> >          6
> >          7   void rcu_nmi_enter(void)
> >          8   {
> >          9     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >         10     int incby = 2;
> >         11
> >         12     if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> >         13       smp_mb__before_atomic();
> >         14       atomic_inc(&rdtp->dynticks);
> >         15       smp_mb__after_atomic();
> >         16       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> >         17       incby = 1;
> 
> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?

That would make sense.

> >         18     }
> >         19     rdtp->dynticks_nmi_nesting += incby;
> 
> Oh, I see why you don't need local_add -- it's because an nmi in the
> middle of this increment won't have any effect on the interrupted
> code, so even a software RMW will be okay.

Yep!  ;-)

> >         20     barrier();
> >         21   }
> >         22
> >         23   void rcu_nmi_exit(void)
> >         24   {
> >         25     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >         26
> >         27     WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
> >         28     WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> >         29     if (rdtp->dynticks_nmi_nesting != 1) {
> 
> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?

This is already implied by the WARN_ON_ONCE() on line 27 and the check
on line 29.

> >         30       rdtp->dynticks_nmi_nesting -= 2;
> >         31       return;
> >         32     }
> >         33     rdtp->dynticks_nmi_nesting = 0;
> >         34     smp_mb__before_atomic();
> 
> This implies barrier(), right?

Yep!

> >         35     atomic_inc(&rdtp->dynticks);
> >         36     smp_mb__after_atomic();
> >         37     WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> >         38   }
> >
> > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
> > assumes that we don't need to increment ->dynticks.
> >
> > Line 12 checks to see if ->dynticks is even.  Note that this check is
> > stable: If there are nested NMIs, they will increment ->dynticks twice
> > or not at all, and either way preserves the evenness (to be proven, of
> > course, but that is the plan).  If ->dynticks is even, lines 13-15
> > atomically increment it, line 16 complains if still even, and line 17
> > says we will increment ->dynticks_nmi_nesting by only 1.
> >
> > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
> > line 20 keeps the compiler from getting too cute.
> >
> > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
> > structure.  Lines 27 and 28 complain bitterly if invariants are violated.
> > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
> > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
> >
> > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
> > atomically increment ->dynticks with full ordering, and line 37
> > complains bitterly if ->dynticks is not even.
> >
> > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
> > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
> > not nested, which is the correct thing for them to think in that case.
> > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
> > to zero (adding and then subtracting 1).  If the NMI happens after the
> > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
> > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
> > to zero (adding and subtracting two again).  If the NMI happens after
> > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
> > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
> > time to one (again adding and subtracting two).
> >
> > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
> > one means we need to atomically increment ->dynticks, and other values
> > mean that we are partially or fully nested.  Reasoning proceeds as for
> > rcu_nmi_enter(), but in the opposite direction.
> >
> > Whew!  That might even work.
> 
> I think I like this, with the warnings above.

OK with dropping the one that I called out as redundant?

> > But how about taking a different approach.  Assuming that there can
> > never be more than (say) 14 nesting NMI-like things, use the lower
> > four bits of ->dynticks to represent the NMI nesting and the upper
> > 28 bits as the counter.  This of course requires modifying lots of
> > places in RCU that check the counter, but it is probably time to
> > abstract the check anyway.
> >
> > This would allow my earlier attempted logic to work and (maybe) simplify
> > the reasoning a bit (and yes, the "magic" constants need macros):
> >
> >         void rcu_nmi_enter(void)
> >         {
> >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> >                 int incby = 0x01;
> >
> >                 WARN_ON_ONCE(nexting == 0xf);
> >                 if (nesting == 0) {
> >                         if (atomic_read(&rdtp->dynticks) & 0x10)
> >                                 return;
> >                         incby = 0x11;
> >                 }
> >                 smp_mb__before_atomic();
> >                 atomic_add(&rdtp->dynticks, incby);
> >                 smp_mb__after_atomic();
> >                 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> >         }
> >
> >         void rcu_nmi_exit(void)
> >         {
> >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> >                 int incby = 0x0f;
> >
> >                 if (nesting == 0)
> >                         return;
> >                 if (nesting > 1)
> >                         incby = -1;
> >                 smp_mb__before_atomic();
> >                 atomic_add(&rdtp->dynticks, incby);
> >                 smp_mb__after_atomic();
> >                 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> >         }
> >
> > Over to you!  ;-)
> 
> This latter one is all you :)

Well, let's see how I feel about it after trying a Promela model of
the first code sequence.  ;-)

							Thanx, Paul

> --Andy
> 
> >
> >                                                         Thanx, Paul
> >
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 20:54                         ` Paul E. McKenney
@ 2014-11-24 21:02                           ` Andy Lutomirski
  2014-11-24 21:35                             ` Paul E. McKenney
  2014-11-24 21:27                           ` Paul E. McKenney
  1 sibling, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-24 21:02 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 12:54 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
>> On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
>> >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
>> >> <paulmck@linux.vnet.ibm.com> wrote:
>> >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
>> >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
>> >> >> <paulmck@linux.vnet.ibm.com> wrote:
>>
>> > Returning state sounds like a bad idea, if we can reasonably avoid it.
>>
>> I agree, except that we already do it for exception_enter(), etc.  But
>> yes, changing fewer things is nice.
>>
>> >
>> > And I think I finally see what you are pointing out about my code: If
>> > another NMI comes in between the time I increment ->dynticks_nmi_nesting
>> > and the time I atomically increment ->dynticks, the nested NMI handler
>> > will incorrectly believe that RCU is already paying attention to this CPU.
>> > Which would indeed not be at all good, so good catch!!!
>> >
>> >> Otherwise, I think that there may need to be enough state somewhere so
>> >> that the outermost nested rcu_nmi_enter knows whether to increment
>> >> dynticks.  For example, dynticks_nmi_nesting could store the nesting
>> >> count * 2 - (1 if the outermost nested user needs to increment
>> >> dynticks).  Something like:
>> >>
>> >> void rcu_nmi_enter(void)
>> >> {
>> >>   /* Be very careful -- this function may be called reentrently on the
>> >> same CPU. */
>> >>   atomically: increment dynticks if it's even.
>> >>
>> >>   /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
>> >>    * the state. */
>> >>
>> >>   local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
>> >>
>> >>   WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
>> >> }
>> >>
>> >> void rcu_nmi_exit(void)
>> >> {
>> >>   WARN_ON(!(dynticks & 1));
>> >>   locally atomically: dynticks_nmi_nesting -= 2, unless
>> >> dynticks_nmi_nesting == 1, in which case set it to zero
>> >>
>> >>   if (dynticks_nmi_nesting was 1)
>> >>     atomic_inc(&dynticks);
>> >> }
>> >>
>> >> The invariant here is that, for a single unnested enter/exit, if
>> >> dynticks_nmi_nesting != 0, then dynticks is odd.  As a result, an
>> >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
>> >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
>> >> in fact, holds for all invocations, nested or otherwise.
>> >>
>> >> At least one of those conditions is true at all times during the
>> >> execution of outermost pair, starting with the first atomic operation
>> >> and ending with the final atomic_inc.  So they nest properly no matter
>> >> what else happens (unless, of course, someone else pokes dynticks in
>> >> the middle).
>> >>
>> >> Thoughts?
>> >
>> > Let's see...  The evenness of ->dynticks should be preserved by nested NMI
>> > handlers, so the check and increment need not be atomic.  We don't have
>> > any way (other than atomic operations) to do local atomic modifications
>> > on all architectures, because we cannot mask NMIs.  (Yes, it can work
>> > on x86, but this is common code that needs to work everywhere.)  On the
>> > other hand, presumably NMIs are rare, so atomic modification of the NMI
>> > nesting counter should be OK, at least if it proves absolutely necessary.
>> > And I am thinking that a mechanical proof will be needed here.  :-/
>> >
>> > But first, let me try generating the code and informally evaluating it:
>> >
>> >          1   struct rcu_dynticks {
>> >          2     long long dynticks_nesting;
>> >          3     int dynticks_nmi_nesting;
>> >          4     atomic_t dynticks;
>> >          5   };
>> >          6
>> >          7   void rcu_nmi_enter(void)
>> >          8   {
>> >          9     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> >         10     int incby = 2;
>> >         11
>> >         12     if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
>> >         13       smp_mb__before_atomic();
>> >         14       atomic_inc(&rdtp->dynticks);
>> >         15       smp_mb__after_atomic();
>> >         16       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> >         17       incby = 1;
>>
>> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?
>
> That would make sense.
>
>> >         18     }
>> >         19     rdtp->dynticks_nmi_nesting += incby;
>>
>> Oh, I see why you don't need local_add -- it's because an nmi in the
>> middle of this increment won't have any effect on the interrupted
>> code, so even a software RMW will be okay.
>
> Yep!  ;-)
>
>> >         20     barrier();
>> >         21   }
>> >         22
>> >         23   void rcu_nmi_exit(void)
>> >         24   {
>> >         25     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> >         26
>> >         27     WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
>> >         28     WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> >         29     if (rdtp->dynticks_nmi_nesting != 1) {
>>
>> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?
>
> This is already implied by the WARN_ON_ONCE() on line 27 and the check
> on line 29.

I was worried about negative numbers.  Maybe change line 27 to
WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0), then?  (Or is it
unsigned?  If so, let's make to signed to catch this type of error.)

>
>> >         30       rdtp->dynticks_nmi_nesting -= 2;
>> >         31       return;
>> >         32     }
>> >         33     rdtp->dynticks_nmi_nesting = 0;
>> >         34     smp_mb__before_atomic();
>>
>> This implies barrier(), right?
>
> Yep!
>
>> >         35     atomic_inc(&rdtp->dynticks);
>> >         36     smp_mb__after_atomic();
>> >         37     WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>> >         38   }
>> >
>> > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
>> > assumes that we don't need to increment ->dynticks.
>> >
>> > Line 12 checks to see if ->dynticks is even.  Note that this check is
>> > stable: If there are nested NMIs, they will increment ->dynticks twice
>> > or not at all, and either way preserves the evenness (to be proven, of
>> > course, but that is the plan).  If ->dynticks is even, lines 13-15
>> > atomically increment it, line 16 complains if still even, and line 17
>> > says we will increment ->dynticks_nmi_nesting by only 1.
>> >
>> > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
>> > line 20 keeps the compiler from getting too cute.
>> >
>> > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
>> > structure.  Lines 27 and 28 complain bitterly if invariants are violated.
>> > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
>> > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
>> >
>> > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
>> > atomically increment ->dynticks with full ordering, and line 37
>> > complains bitterly if ->dynticks is not even.
>> >
>> > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
>> > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
>> > not nested, which is the correct thing for them to think in that case.
>> > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
>> > to zero (adding and then subtracting 1).  If the NMI happens after the
>> > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
>> > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
>> > to zero (adding and subtracting two again).  If the NMI happens after
>> > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
>> > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
>> > time to one (again adding and subtracting two).
>> >
>> > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
>> > one means we need to atomically increment ->dynticks, and other values
>> > mean that we are partially or fully nested.  Reasoning proceeds as for
>> > rcu_nmi_enter(), but in the opposite direction.
>> >
>> > Whew!  That might even work.
>>
>> I think I like this, with the warnings above.
>
> OK with dropping the one that I called out as redundant?

Sure, but see about.

>
>> > But how about taking a different approach.  Assuming that there can
>> > never be more than (say) 14 nesting NMI-like things, use the lower
>> > four bits of ->dynticks to represent the NMI nesting and the upper
>> > 28 bits as the counter.  This of course requires modifying lots of
>> > places in RCU that check the counter, but it is probably time to
>> > abstract the check anyway.
>> >
>> > This would allow my earlier attempted logic to work and (maybe) simplify
>> > the reasoning a bit (and yes, the "magic" constants need macros):
>> >
>> >         void rcu_nmi_enter(void)
>> >         {
>> >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
>> >                 int incby = 0x01;
>> >
>> >                 WARN_ON_ONCE(nexting == 0xf);
>> >                 if (nesting == 0) {
>> >                         if (atomic_read(&rdtp->dynticks) & 0x10)
>> >                                 return;
>> >                         incby = 0x11;
>> >                 }
>> >                 smp_mb__before_atomic();
>> >                 atomic_add(&rdtp->dynticks, incby);
>> >                 smp_mb__after_atomic();
>> >                 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> >         }
>> >
>> >         void rcu_nmi_exit(void)
>> >         {
>> >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
>> >                 int incby = 0x0f;
>> >
>> >                 if (nesting == 0)
>> >                         return;
>> >                 if (nesting > 1)
>> >                         incby = -1;
>> >                 smp_mb__before_atomic();
>> >                 atomic_add(&rdtp->dynticks, incby);
>> >                 smp_mb__after_atomic();
>> >                 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>> >         }
>> >
>> > Over to you!  ;-)
>>
>> This latter one is all you :)
>
> Well, let's see how I feel about it after trying a Promela model of
> the first code sequence.  ;-)

:)

Does Promela understand the differences between this type of
reentrancy and real threading?

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 20:54                         ` Paul E. McKenney
  2014-11-24 21:02                           ` Andy Lutomirski
@ 2014-11-24 21:27                           ` Paul E. McKenney
  1 sibling, 0 replies; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-24 21:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 12:54:41PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
> > On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
> > >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
> > >> <paulmck@linux.vnet.ibm.com> wrote:
> > >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> > >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> > >> >> <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > > Returning state sounds like a bad idea, if we can reasonably avoid it.
> > 
> > I agree, except that we already do it for exception_enter(), etc.  But
> > yes, changing fewer things is nice.
> > 
> > >
> > > And I think I finally see what you are pointing out about my code: If
> > > another NMI comes in between the time I increment ->dynticks_nmi_nesting
> > > and the time I atomically increment ->dynticks, the nested NMI handler
> > > will incorrectly believe that RCU is already paying attention to this CPU.
> > > Which would indeed not be at all good, so good catch!!!
> > >
> > >> Otherwise, I think that there may need to be enough state somewhere so
> > >> that the outermost nested rcu_nmi_enter knows whether to increment
> > >> dynticks.  For example, dynticks_nmi_nesting could store the nesting
> > >> count * 2 - (1 if the outermost nested user needs to increment
> > >> dynticks).  Something like:
> > >>
> > >> void rcu_nmi_enter(void)
> > >> {
> > >>   /* Be very careful -- this function may be called reentrently on the
> > >> same CPU. */
> > >>   atomically: increment dynticks if it's even.
> > >>
> > >>   /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
> > >>    * the state. */
> > >>
> > >>   local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
> > >>
> > >>   WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
> > >> }
> > >>
> > >> void rcu_nmi_exit(void)
> > >> {
> > >>   WARN_ON(!(dynticks & 1));
> > >>   locally atomically: dynticks_nmi_nesting -= 2, unless
> > >> dynticks_nmi_nesting == 1, in which case set it to zero
> > >>
> > >>   if (dynticks_nmi_nesting was 1)
> > >>     atomic_inc(&dynticks);
> > >> }
> > >>
> > >> The invariant here is that, for a single unnested enter/exit, if
> > >> dynticks_nmi_nesting != 0, then dynticks is odd.  As a result, an
> > >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
> > >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
> > >> in fact, holds for all invocations, nested or otherwise.
> > >>
> > >> At least one of those conditions is true at all times during the
> > >> execution of outermost pair, starting with the first atomic operation
> > >> and ending with the final atomic_inc.  So they nest properly no matter
> > >> what else happens (unless, of course, someone else pokes dynticks in
> > >> the middle).
> > >>
> > >> Thoughts?
> > >
> > > Let's see...  The evenness of ->dynticks should be preserved by nested NMI
> > > handlers, so the check and increment need not be atomic.  We don't have
> > > any way (other than atomic operations) to do local atomic modifications
> > > on all architectures, because we cannot mask NMIs.  (Yes, it can work
> > > on x86, but this is common code that needs to work everywhere.)  On the
> > > other hand, presumably NMIs are rare, so atomic modification of the NMI
> > > nesting counter should be OK, at least if it proves absolutely necessary.
> > > And I am thinking that a mechanical proof will be needed here.  :-/
> > >
> > > But first, let me try generating the code and informally evaluating it:
> > >
> > >          1   struct rcu_dynticks {
> > >          2     long long dynticks_nesting;
> > >          3     int dynticks_nmi_nesting;
> > >          4     atomic_t dynticks;
> > >          5   };
> > >          6
> > >          7   void rcu_nmi_enter(void)
> > >          8   {
> > >          9     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >         10     int incby = 2;
> > >         11
> > >         12     if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> > >         13       smp_mb__before_atomic();
> > >         14       atomic_inc(&rdtp->dynticks);
> > >         15       smp_mb__after_atomic();
> > >         16       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > >         17       incby = 1;
> > 
> > WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?
> 
> That would make sense.

I take it back.  We get here only if there is no lower-level NMI in effect.
I could do WARN_ON_ONCE(rdtp->dynticks_nmi_nesting > 0), though.

							Thanx, Paul

> > >         18     }
> > >         19     rdtp->dynticks_nmi_nesting += incby;
> > 
> > Oh, I see why you don't need local_add -- it's because an nmi in the
> > middle of this increment won't have any effect on the interrupted
> > code, so even a software RMW will be okay.
> 
> Yep!  ;-)
> 
> > >         20     barrier();
> > >         21   }
> > >         22
> > >         23   void rcu_nmi_exit(void)
> > >         24   {
> > >         25     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >         26
> > >         27     WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
> > >         28     WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > >         29     if (rdtp->dynticks_nmi_nesting != 1) {
> > 
> > WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?
> 
> This is already implied by the WARN_ON_ONCE() on line 27 and the check
> on line 29.
> 
> > >         30       rdtp->dynticks_nmi_nesting -= 2;
> > >         31       return;
> > >         32     }
> > >         33     rdtp->dynticks_nmi_nesting = 0;
> > >         34     smp_mb__before_atomic();
> > 
> > This implies barrier(), right?
> 
> Yep!
> 
> > >         35     atomic_inc(&rdtp->dynticks);
> > >         36     smp_mb__after_atomic();
> > >         37     WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > >         38   }
> > >
> > > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
> > > assumes that we don't need to increment ->dynticks.
> > >
> > > Line 12 checks to see if ->dynticks is even.  Note that this check is
> > > stable: If there are nested NMIs, they will increment ->dynticks twice
> > > or not at all, and either way preserves the evenness (to be proven, of
> > > course, but that is the plan).  If ->dynticks is even, lines 13-15
> > > atomically increment it, line 16 complains if still even, and line 17
> > > says we will increment ->dynticks_nmi_nesting by only 1.
> > >
> > > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
> > > line 20 keeps the compiler from getting too cute.
> > >
> > > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
> > > structure.  Lines 27 and 28 complain bitterly if invariants are violated.
> > > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
> > > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
> > >
> > > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
> > > atomically increment ->dynticks with full ordering, and line 37
> > > complains bitterly if ->dynticks is not even.
> > >
> > > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
> > > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
> > > not nested, which is the correct thing for them to think in that case.
> > > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
> > > to zero (adding and then subtracting 1).  If the NMI happens after the
> > > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
> > > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
> > > to zero (adding and subtracting two again).  If the NMI happens after
> > > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
> > > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
> > > time to one (again adding and subtracting two).
> > >
> > > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
> > > one means we need to atomically increment ->dynticks, and other values
> > > mean that we are partially or fully nested.  Reasoning proceeds as for
> > > rcu_nmi_enter(), but in the opposite direction.
> > >
> > > Whew!  That might even work.
> > 
> > I think I like this, with the warnings above.
> 
> OK with dropping the one that I called out as redundant?
> 
> > > But how about taking a different approach.  Assuming that there can
> > > never be more than (say) 14 nesting NMI-like things, use the lower
> > > four bits of ->dynticks to represent the NMI nesting and the upper
> > > 28 bits as the counter.  This of course requires modifying lots of
> > > places in RCU that check the counter, but it is probably time to
> > > abstract the check anyway.
> > >
> > > This would allow my earlier attempted logic to work and (maybe) simplify
> > > the reasoning a bit (and yes, the "magic" constants need macros):
> > >
> > >         void rcu_nmi_enter(void)
> > >         {
> > >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> > >                 int incby = 0x01;
> > >
> > >                 WARN_ON_ONCE(nexting == 0xf);
> > >                 if (nesting == 0) {
> > >                         if (atomic_read(&rdtp->dynticks) & 0x10)
> > >                                 return;
> > >                         incby = 0x11;
> > >                 }
> > >                 smp_mb__before_atomic();
> > >                 atomic_add(&rdtp->dynticks, incby);
> > >                 smp_mb__after_atomic();
> > >                 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > >         }
> > >
> > >         void rcu_nmi_exit(void)
> > >         {
> > >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> > >                 int incby = 0x0f;
> > >
> > >                 if (nesting == 0)
> > >                         return;
> > >                 if (nesting > 1)
> > >                         incby = -1;
> > >                 smp_mb__before_atomic();
> > >                 atomic_add(&rdtp->dynticks, incby);
> > >                 smp_mb__after_atomic();
> > >                 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > >         }
> > >
> > > Over to you!  ;-)
> > 
> > This latter one is all you :)
> 
> Well, let's see how I feel about it after trying a Promela model of
> the first code sequence.  ;-)
> 
> 							Thanx, Paul
> 
> > --Andy
> > 
> > >
> > >                                                         Thanx, Paul
> > >
> > 
> > 
> > 
> > -- 
> > Andy Lutomirski
> > AMA Capital Management, LLC
> > 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 21:02                           ` Andy Lutomirski
@ 2014-11-24 21:35                             ` Paul E. McKenney
  2014-11-24 22:34                               ` Paul E. McKenney
  0 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-24 21:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 01:02:51PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 24, 2014 at 12:54 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
> >> On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
> >> >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
> >> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> >> >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> >> >> >> <paulmck@linux.vnet.ibm.com> wrote:
> >>
> >> > Returning state sounds like a bad idea, if we can reasonably avoid it.
> >>
> >> I agree, except that we already do it for exception_enter(), etc.  But
> >> yes, changing fewer things is nice.
> >>
> >> >
> >> > And I think I finally see what you are pointing out about my code: If
> >> > another NMI comes in between the time I increment ->dynticks_nmi_nesting
> >> > and the time I atomically increment ->dynticks, the nested NMI handler
> >> > will incorrectly believe that RCU is already paying attention to this CPU.
> >> > Which would indeed not be at all good, so good catch!!!
> >> >
> >> >> Otherwise, I think that there may need to be enough state somewhere so
> >> >> that the outermost nested rcu_nmi_enter knows whether to increment
> >> >> dynticks.  For example, dynticks_nmi_nesting could store the nesting
> >> >> count * 2 - (1 if the outermost nested user needs to increment
> >> >> dynticks).  Something like:
> >> >>
> >> >> void rcu_nmi_enter(void)
> >> >> {
> >> >>   /* Be very careful -- this function may be called reentrently on the
> >> >> same CPU. */
> >> >>   atomically: increment dynticks if it's even.
> >> >>
> >> >>   /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
> >> >>    * the state. */
> >> >>
> >> >>   local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
> >> >>
> >> >>   WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
> >> >> }
> >> >>
> >> >> void rcu_nmi_exit(void)
> >> >> {
> >> >>   WARN_ON(!(dynticks & 1));
> >> >>   locally atomically: dynticks_nmi_nesting -= 2, unless
> >> >> dynticks_nmi_nesting == 1, in which case set it to zero
> >> >>
> >> >>   if (dynticks_nmi_nesting was 1)
> >> >>     atomic_inc(&dynticks);
> >> >> }
> >> >>
> >> >> The invariant here is that, for a single unnested enter/exit, if
> >> >> dynticks_nmi_nesting != 0, then dynticks is odd.  As a result, an
> >> >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
> >> >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
> >> >> in fact, holds for all invocations, nested or otherwise.
> >> >>
> >> >> At least one of those conditions is true at all times during the
> >> >> execution of outermost pair, starting with the first atomic operation
> >> >> and ending with the final atomic_inc.  So they nest properly no matter
> >> >> what else happens (unless, of course, someone else pokes dynticks in
> >> >> the middle).
> >> >>
> >> >> Thoughts?
> >> >
> >> > Let's see...  The evenness of ->dynticks should be preserved by nested NMI
> >> > handlers, so the check and increment need not be atomic.  We don't have
> >> > any way (other than atomic operations) to do local atomic modifications
> >> > on all architectures, because we cannot mask NMIs.  (Yes, it can work
> >> > on x86, but this is common code that needs to work everywhere.)  On the
> >> > other hand, presumably NMIs are rare, so atomic modification of the NMI
> >> > nesting counter should be OK, at least if it proves absolutely necessary.
> >> > And I am thinking that a mechanical proof will be needed here.  :-/
> >> >
> >> > But first, let me try generating the code and informally evaluating it:
> >> >
> >> >          1   struct rcu_dynticks {
> >> >          2     long long dynticks_nesting;
> >> >          3     int dynticks_nmi_nesting;
> >> >          4     atomic_t dynticks;
> >> >          5   };
> >> >          6
> >> >          7   void rcu_nmi_enter(void)
> >> >          8   {
> >> >          9     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >> >         10     int incby = 2;
> >> >         11
> >> >         12     if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> >> >         13       smp_mb__before_atomic();
> >> >         14       atomic_inc(&rdtp->dynticks);
> >> >         15       smp_mb__after_atomic();
> >> >         16       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> >> >         17       incby = 1;
> >>
> >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?
> >
> > That would make sense.
> >
> >> >         18     }
> >> >         19     rdtp->dynticks_nmi_nesting += incby;
> >>
> >> Oh, I see why you don't need local_add -- it's because an nmi in the
> >> middle of this increment won't have any effect on the interrupted
> >> code, so even a software RMW will be okay.
> >
> > Yep!  ;-)
> >
> >> >         20     barrier();
> >> >         21   }
> >> >         22
> >> >         23   void rcu_nmi_exit(void)
> >> >         24   {
> >> >         25     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >> >         26
> >> >         27     WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
> >> >         28     WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> >> >         29     if (rdtp->dynticks_nmi_nesting != 1) {
> >>
> >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?
> >
> > This is already implied by the WARN_ON_ONCE() on line 27 and the check
> > on line 29.
> 
> I was worried about negative numbers.  Maybe change line 27 to
> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0), then?  (Or is it
> unsigned?  If so, let's make to signed to catch this type of error.)

Good point, they are signed, so your WARN_ON_ONCE() would work.

> >> >         30       rdtp->dynticks_nmi_nesting -= 2;
> >> >         31       return;
> >> >         32     }
> >> >         33     rdtp->dynticks_nmi_nesting = 0;
> >> >         34     smp_mb__before_atomic();
> >>
> >> This implies barrier(), right?
> >
> > Yep!
> >
> >> >         35     atomic_inc(&rdtp->dynticks);
> >> >         36     smp_mb__after_atomic();
> >> >         37     WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> >> >         38   }
> >> >
> >> > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
> >> > assumes that we don't need to increment ->dynticks.
> >> >
> >> > Line 12 checks to see if ->dynticks is even.  Note that this check is
> >> > stable: If there are nested NMIs, they will increment ->dynticks twice
> >> > or not at all, and either way preserves the evenness (to be proven, of
> >> > course, but that is the plan).  If ->dynticks is even, lines 13-15
> >> > atomically increment it, line 16 complains if still even, and line 17
> >> > says we will increment ->dynticks_nmi_nesting by only 1.
> >> >
> >> > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
> >> > line 20 keeps the compiler from getting too cute.
> >> >
> >> > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
> >> > structure.  Lines 27 and 28 complain bitterly if invariants are violated.
> >> > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
> >> > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
> >> >
> >> > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
> >> > atomically increment ->dynticks with full ordering, and line 37
> >> > complains bitterly if ->dynticks is not even.
> >> >
> >> > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
> >> > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
> >> > not nested, which is the correct thing for them to think in that case.
> >> > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
> >> > to zero (adding and then subtracting 1).  If the NMI happens after the
> >> > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
> >> > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
> >> > to zero (adding and subtracting two again).  If the NMI happens after
> >> > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
> >> > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
> >> > time to one (again adding and subtracting two).
> >> >
> >> > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
> >> > one means we need to atomically increment ->dynticks, and other values
> >> > mean that we are partially or fully nested.  Reasoning proceeds as for
> >> > rcu_nmi_enter(), but in the opposite direction.
> >> >
> >> > Whew!  That might even work.
> >>
> >> I think I like this, with the warnings above.
> >
> > OK with dropping the one that I called out as redundant?
> 
> Sure, but see about.
> 
> >
> >> > But how about taking a different approach.  Assuming that there can
> >> > never be more than (say) 14 nesting NMI-like things, use the lower
> >> > four bits of ->dynticks to represent the NMI nesting and the upper
> >> > 28 bits as the counter.  This of course requires modifying lots of
> >> > places in RCU that check the counter, but it is probably time to
> >> > abstract the check anyway.
> >> >
> >> > This would allow my earlier attempted logic to work and (maybe) simplify
> >> > the reasoning a bit (and yes, the "magic" constants need macros):
> >> >
> >> >         void rcu_nmi_enter(void)
> >> >         {
> >> >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >> >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> >> >                 int incby = 0x01;
> >> >
> >> >                 WARN_ON_ONCE(nexting == 0xf);
> >> >                 if (nesting == 0) {
> >> >                         if (atomic_read(&rdtp->dynticks) & 0x10)
> >> >                                 return;
> >> >                         incby = 0x11;
> >> >                 }
> >> >                 smp_mb__before_atomic();
> >> >                 atomic_add(&rdtp->dynticks, incby);
> >> >                 smp_mb__after_atomic();
> >> >                 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> >> >         }
> >> >
> >> >         void rcu_nmi_exit(void)
> >> >         {
> >> >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >> >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> >> >                 int incby = 0x0f;
> >> >
> >> >                 if (nesting == 0)
> >> >                         return;
> >> >                 if (nesting > 1)
> >> >                         incby = -1;
> >> >                 smp_mb__before_atomic();
> >> >                 atomic_add(&rdtp->dynticks, incby);
> >> >                 smp_mb__after_atomic();
> >> >                 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> >> >         }
> >> >
> >> > Over to you!  ;-)
> >>
> >> This latter one is all you :)
> >
> > Well, let's see how I feel about it after trying a Promela model of
> > the first code sequence.  ;-)
> 
> :)
> 
> Does Promela understand the differences between this type of
> reentrancy and real threading?

Not as far as I know.  But it can be tricked into making this distinction.
One thread just has the Promela code as is, and the other thread has
the same Promela code entirely contained in an atomic block.  This means
that the entire second thread must executed at one point in the first
thread, just like an NMI would.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 21:35                             ` Paul E. McKenney
@ 2014-11-24 22:34                               ` Paul E. McKenney
  2014-11-24 22:36                                 ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-24 22:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 24, 2014 at 01:02:51PM -0800, Andy Lutomirski wrote:
> > On Mon, Nov 24, 2014 at 12:54 PM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > > On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
> > >> On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
> > >> <paulmck@linux.vnet.ibm.com> wrote:
> > >> > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
> > >> >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
> > >> >> <paulmck@linux.vnet.ibm.com> wrote:
> > >> >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
> > >> >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
> > >> >> >> <paulmck@linux.vnet.ibm.com> wrote:
> > >>
> > >> > Returning state sounds like a bad idea, if we can reasonably avoid it.
> > >>
> > >> I agree, except that we already do it for exception_enter(), etc.  But
> > >> yes, changing fewer things is nice.
> > >>
> > >> >
> > >> > And I think I finally see what you are pointing out about my code: If
> > >> > another NMI comes in between the time I increment ->dynticks_nmi_nesting
> > >> > and the time I atomically increment ->dynticks, the nested NMI handler
> > >> > will incorrectly believe that RCU is already paying attention to this CPU.
> > >> > Which would indeed not be at all good, so good catch!!!
> > >> >
> > >> >> Otherwise, I think that there may need to be enough state somewhere so
> > >> >> that the outermost nested rcu_nmi_enter knows whether to increment
> > >> >> dynticks.  For example, dynticks_nmi_nesting could store the nesting
> > >> >> count * 2 - (1 if the outermost nested user needs to increment
> > >> >> dynticks).  Something like:
> > >> >>
> > >> >> void rcu_nmi_enter(void)
> > >> >> {
> > >> >>   /* Be very careful -- this function may be called reentrently on the
> > >> >> same CPU. */
> > >> >>   atomically: increment dynticks if it's even.
> > >> >>
> > >> >>   /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
> > >> >>    * the state. */
> > >> >>
> > >> >>   local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
> > >> >>
> > >> >>   WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
> > >> >> }
> > >> >>
> > >> >> void rcu_nmi_exit(void)
> > >> >> {
> > >> >>   WARN_ON(!(dynticks & 1));
> > >> >>   locally atomically: dynticks_nmi_nesting -= 2, unless
> > >> >> dynticks_nmi_nesting == 1, in which case set it to zero
> > >> >>
> > >> >>   if (dynticks_nmi_nesting was 1)
> > >> >>     atomic_inc(&dynticks);
> > >> >> }
> > >> >>
> > >> >> The invariant here is that, for a single unnested enter/exit, if
> > >> >> dynticks_nmi_nesting != 0, then dynticks is odd.  As a result, an
> > >> >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
> > >> >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
> > >> >> in fact, holds for all invocations, nested or otherwise.
> > >> >>
> > >> >> At least one of those conditions is true at all times during the
> > >> >> execution of outermost pair, starting with the first atomic operation
> > >> >> and ending with the final atomic_inc.  So they nest properly no matter
> > >> >> what else happens (unless, of course, someone else pokes dynticks in
> > >> >> the middle).
> > >> >>
> > >> >> Thoughts?
> > >> >
> > >> > Let's see...  The evenness of ->dynticks should be preserved by nested NMI
> > >> > handlers, so the check and increment need not be atomic.  We don't have
> > >> > any way (other than atomic operations) to do local atomic modifications
> > >> > on all architectures, because we cannot mask NMIs.  (Yes, it can work
> > >> > on x86, but this is common code that needs to work everywhere.)  On the
> > >> > other hand, presumably NMIs are rare, so atomic modification of the NMI
> > >> > nesting counter should be OK, at least if it proves absolutely necessary.
> > >> > And I am thinking that a mechanical proof will be needed here.  :-/
> > >> >
> > >> > But first, let me try generating the code and informally evaluating it:
> > >> >
> > >> >          1   struct rcu_dynticks {
> > >> >          2     long long dynticks_nesting;
> > >> >          3     int dynticks_nmi_nesting;
> > >> >          4     atomic_t dynticks;
> > >> >          5   };
> > >> >          6
> > >> >          7   void rcu_nmi_enter(void)
> > >> >          8   {
> > >> >          9     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >> >         10     int incby = 2;
> > >> >         11
> > >> >         12     if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> > >> >         13       smp_mb__before_atomic();
> > >> >         14       atomic_inc(&rdtp->dynticks);
> > >> >         15       smp_mb__after_atomic();
> > >> >         16       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > >> >         17       incby = 1;
> > >>
> > >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?
> > >
> > > That would make sense.
> > >
> > >> >         18     }
> > >> >         19     rdtp->dynticks_nmi_nesting += incby;
> > >>
> > >> Oh, I see why you don't need local_add -- it's because an nmi in the
> > >> middle of this increment won't have any effect on the interrupted
> > >> code, so even a software RMW will be okay.
> > >
> > > Yep!  ;-)
> > >
> > >> >         20     barrier();
> > >> >         21   }
> > >> >         22
> > >> >         23   void rcu_nmi_exit(void)
> > >> >         24   {
> > >> >         25     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >> >         26
> > >> >         27     WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
> > >> >         28     WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > >> >         29     if (rdtp->dynticks_nmi_nesting != 1) {
> > >>
> > >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?
> > >
> > > This is already implied by the WARN_ON_ONCE() on line 27 and the check
> > > on line 29.
> > 
> > I was worried about negative numbers.  Maybe change line 27 to
> > WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0), then?  (Or is it
> > unsigned?  If so, let's make to signed to catch this type of error.)
> 
> Good point, they are signed, so your WARN_ON_ONCE() would work.
> 
> > >> >         30       rdtp->dynticks_nmi_nesting -= 2;
> > >> >         31       return;
> > >> >         32     }
> > >> >         33     rdtp->dynticks_nmi_nesting = 0;
> > >> >         34     smp_mb__before_atomic();
> > >>
> > >> This implies barrier(), right?
> > >
> > > Yep!
> > >
> > >> >         35     atomic_inc(&rdtp->dynticks);
> > >> >         36     smp_mb__after_atomic();
> > >> >         37     WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > >> >         38   }
> > >> >
> > >> > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
> > >> > assumes that we don't need to increment ->dynticks.
> > >> >
> > >> > Line 12 checks to see if ->dynticks is even.  Note that this check is
> > >> > stable: If there are nested NMIs, they will increment ->dynticks twice
> > >> > or not at all, and either way preserves the evenness (to be proven, of
> > >> > course, but that is the plan).  If ->dynticks is even, lines 13-15
> > >> > atomically increment it, line 16 complains if still even, and line 17
> > >> > says we will increment ->dynticks_nmi_nesting by only 1.
> > >> >
> > >> > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
> > >> > line 20 keeps the compiler from getting too cute.
> > >> >
> > >> > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
> > >> > structure.  Lines 27 and 28 complain bitterly if invariants are violated.
> > >> > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
> > >> > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
> > >> >
> > >> > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
> > >> > atomically increment ->dynticks with full ordering, and line 37
> > >> > complains bitterly if ->dynticks is not even.
> > >> >
> > >> > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
> > >> > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
> > >> > not nested, which is the correct thing for them to think in that case.
> > >> > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
> > >> > to zero (adding and then subtracting 1).  If the NMI happens after the
> > >> > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
> > >> > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
> > >> > to zero (adding and subtracting two again).  If the NMI happens after
> > >> > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
> > >> > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
> > >> > time to one (again adding and subtracting two).
> > >> >
> > >> > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
> > >> > one means we need to atomically increment ->dynticks, and other values
> > >> > mean that we are partially or fully nested.  Reasoning proceeds as for
> > >> > rcu_nmi_enter(), but in the opposite direction.
> > >> >
> > >> > Whew!  That might even work.
> > >>
> > >> I think I like this, with the warnings above.
> > >
> > > OK with dropping the one that I called out as redundant?
> > 
> > Sure, but see about.
> > 
> > >
> > >> > But how about taking a different approach.  Assuming that there can
> > >> > never be more than (say) 14 nesting NMI-like things, use the lower
> > >> > four bits of ->dynticks to represent the NMI nesting and the upper
> > >> > 28 bits as the counter.  This of course requires modifying lots of
> > >> > places in RCU that check the counter, but it is probably time to
> > >> > abstract the check anyway.
> > >> >
> > >> > This would allow my earlier attempted logic to work and (maybe) simplify
> > >> > the reasoning a bit (and yes, the "magic" constants need macros):
> > >> >
> > >> >         void rcu_nmi_enter(void)
> > >> >         {
> > >> >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >> >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> > >> >                 int incby = 0x01;
> > >> >
> > >> >                 WARN_ON_ONCE(nexting == 0xf);
> > >> >                 if (nesting == 0) {
> > >> >                         if (atomic_read(&rdtp->dynticks) & 0x10)
> > >> >                                 return;
> > >> >                         incby = 0x11;
> > >> >                 }
> > >> >                 smp_mb__before_atomic();
> > >> >                 atomic_add(&rdtp->dynticks, incby);
> > >> >                 smp_mb__after_atomic();
> > >> >                 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > >> >         }
> > >> >
> > >> >         void rcu_nmi_exit(void)
> > >> >         {
> > >> >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > >> >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
> > >> >                 int incby = 0x0f;
> > >> >
> > >> >                 if (nesting == 0)
> > >> >                         return;
> > >> >                 if (nesting > 1)
> > >> >                         incby = -1;
> > >> >                 smp_mb__before_atomic();
> > >> >                 atomic_add(&rdtp->dynticks, incby);
> > >> >                 smp_mb__after_atomic();
> > >> >                 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
> > >> >         }
> > >> >
> > >> > Over to you!  ;-)
> > >>
> > >> This latter one is all you :)
> > >
> > > Well, let's see how I feel about it after trying a Promela model of
> > > the first code sequence.  ;-)
> > 
> > :)
> > 
> > Does Promela understand the differences between this type of
> > reentrancy and real threading?
> 
> Not as far as I know.  But it can be tricked into making this distinction.
> One thread just has the Promela code as is, and the other thread has
> the same Promela code entirely contained in an atomic block.  This means
> that the entire second thread must executed at one point in the first
> thread, just like an NMI would.

And the following Promela model claims that your approach works.
Should I trust it?  ;-)

							Thanx, Paul

------------------------------------------------------------------------

/*
 * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
 * that allows nesting.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, you can access it online at
 * http://www.gnu.org/licenses/gpl-2.0.html.
 *
 * Copyright IBM Corporation, 2014
 *
 * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
 */

byte dynticks_nesting;
byte dynticks_nmi_nesting;
byte dynticks;
byte busy;

/*
 * Promela verision of rcu_nmi_enter().
 */
inline rcu_nmi_enter()
{
	assert(dynticks_nmi_nesting >= 0);
	if
	:: (dynticks & 1) == 0 ->
		atomic {
			dynticks = dynticks + 1;
		}
		assert((dynticks & 1) == 1);
		dynticks_nmi_nesting = dynticks_nmi_nesting + 1;
		assert(dynticks_nmi_nesting >= 1);
	:: else ->
		dynticks_nmi_nesting = dynticks_nmi_nesting + 2;
	fi;
}

/*
 * Promela verision of rcu_nmi_exit().
 */
inline rcu_nmi_exit()
{
	assert(dynticks_nmi_nesting > 0);
	assert((dynticks & 1) != 0);
	if
	:: dynticks_nmi_nesting != 1 ->
		dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
	:: else ->
		dynticks_nmi_nesting = 0;
		atomic {
			dynticks = dynticks + 1;
		}
		assert((dynticks & 1) == 0);
	fi;
}

/*
 * Base-level NMI runs non-atomically.  Crudely emulates process-level
 * dynticks-idle entry/exit.
 */
proctype base_NMI()
{
	do
	::	if
		:: 1 ->	atomic {
				dynticks = dynticks + 1;
			}
			busy = 0;
		:: 1 ->	skip;
		fi;
		rcu_nmi_enter();
		assert((dynticks & 1) == 1);
		rcu_nmi_exit();
		if
		:: busy -> skip;
		:: !busy ->
			atomic {
				dynticks = dynticks + 1;
			}
			busy = 1;
		fi;
	od;
}

/*
 * Nested NMI runs atomically to emulate interrupting base_level().
 */
proctype nested_NMI()
{
	do
	::	atomic {
			rcu_nmi_enter();
			assert((dynticks & 1) == 1);
			rcu_nmi_exit();
		}
	od;
}

init {
	dynticks_nesting = 0;
	dynticks_nmi_nesting = 0;
	dynticks = 0;
	busy = 0;
	run base_NMI();
	run nested_NMI();
}


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 22:34                               ` Paul E. McKenney
@ 2014-11-24 22:36                                 ` Andy Lutomirski
  2014-11-24 22:57                                   ` Paul E. McKenney
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-24 22:36 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
>> On Mon, Nov 24, 2014 at 01:02:51PM -0800, Andy Lutomirski wrote:
>> > On Mon, Nov 24, 2014 at 12:54 PM, Paul E. McKenney
>> > <paulmck@linux.vnet.ibm.com> wrote:
>> > > On Mon, Nov 24, 2014 at 12:22:13PM -0800, Andy Lutomirski wrote:
>> > >> On Sat, Nov 22, 2014 at 3:41 PM, Paul E. McKenney
>> > >> <paulmck@linux.vnet.ibm.com> wrote:
>> > >> > On Fri, Nov 21, 2014 at 09:53:29PM -0800, Andy Lutomirski wrote:
>> > >> >> On Fri, Nov 21, 2014 at 8:20 PM, Paul E. McKenney
>> > >> >> <paulmck@linux.vnet.ibm.com> wrote:
>> > >> >> > On Fri, Nov 21, 2014 at 06:00:14PM -0800, Andy Lutomirski wrote:
>> > >> >> >> On Fri, Nov 21, 2014 at 3:38 PM, Paul E. McKenney
>> > >> >> >> <paulmck@linux.vnet.ibm.com> wrote:
>> > >>
>> > >> > Returning state sounds like a bad idea, if we can reasonably avoid it.
>> > >>
>> > >> I agree, except that we already do it for exception_enter(), etc.  But
>> > >> yes, changing fewer things is nice.
>> > >>
>> > >> >
>> > >> > And I think I finally see what you are pointing out about my code: If
>> > >> > another NMI comes in between the time I increment ->dynticks_nmi_nesting
>> > >> > and the time I atomically increment ->dynticks, the nested NMI handler
>> > >> > will incorrectly believe that RCU is already paying attention to this CPU.
>> > >> > Which would indeed not be at all good, so good catch!!!
>> > >> >
>> > >> >> Otherwise, I think that there may need to be enough state somewhere so
>> > >> >> that the outermost nested rcu_nmi_enter knows whether to increment
>> > >> >> dynticks.  For example, dynticks_nmi_nesting could store the nesting
>> > >> >> count * 2 - (1 if the outermost nested user needs to increment
>> > >> >> dynticks).  Something like:
>> > >> >>
>> > >> >> void rcu_nmi_enter(void)
>> > >> >> {
>> > >> >>   /* Be very careful -- this function may be called reentrently on the
>> > >> >> same CPU. */
>> > >> >>   atomically: increment dynticks if it's even.
>> > >> >>
>> > >> >>   /* If an rcu_nmi_enter/rcu_nmi_exit pair happens here, then it will not change
>> > >> >>    * the state. */
>> > >> >>
>> > >> >>   local_inc(&dynticks_nmi_nesting, (we incremented dynticks ? 1 : 2));
>> > >> >>
>> > >> >>   WARN_ON(we incremented dynticks and dynticks_nmi_nesting was nonzero);
>> > >> >> }
>> > >> >>
>> > >> >> void rcu_nmi_exit(void)
>> > >> >> {
>> > >> >>   WARN_ON(!(dynticks & 1));
>> > >> >>   locally atomically: dynticks_nmi_nesting -= 2, unless
>> > >> >> dynticks_nmi_nesting == 1, in which case set it to zero
>> > >> >>
>> > >> >>   if (dynticks_nmi_nesting was 1)
>> > >> >>     atomic_inc(&dynticks);
>> > >> >> }
>> > >> >>
>> > >> >> The invariant here is that, for a single unnested enter/exit, if
>> > >> >> dynticks_nmi_nesting != 0, then dynticks is odd.  As a result, an
>> > >> >> rcu_nmi_enter/rcu_nmi_exit pair at any time when dynticks_nmi_nesting
>> > >> >> != 0 *or* dynticks is odd will have no net effect, so the invariant,
>> > >> >> in fact, holds for all invocations, nested or otherwise.
>> > >> >>
>> > >> >> At least one of those conditions is true at all times during the
>> > >> >> execution of outermost pair, starting with the first atomic operation
>> > >> >> and ending with the final atomic_inc.  So they nest properly no matter
>> > >> >> what else happens (unless, of course, someone else pokes dynticks in
>> > >> >> the middle).
>> > >> >>
>> > >> >> Thoughts?
>> > >> >
>> > >> > Let's see...  The evenness of ->dynticks should be preserved by nested NMI
>> > >> > handlers, so the check and increment need not be atomic.  We don't have
>> > >> > any way (other than atomic operations) to do local atomic modifications
>> > >> > on all architectures, because we cannot mask NMIs.  (Yes, it can work
>> > >> > on x86, but this is common code that needs to work everywhere.)  On the
>> > >> > other hand, presumably NMIs are rare, so atomic modification of the NMI
>> > >> > nesting counter should be OK, at least if it proves absolutely necessary.
>> > >> > And I am thinking that a mechanical proof will be needed here.  :-/
>> > >> >
>> > >> > But first, let me try generating the code and informally evaluating it:
>> > >> >
>> > >> >          1   struct rcu_dynticks {
>> > >> >          2     long long dynticks_nesting;
>> > >> >          3     int dynticks_nmi_nesting;
>> > >> >          4     atomic_t dynticks;
>> > >> >          5   };
>> > >> >          6
>> > >> >          7   void rcu_nmi_enter(void)
>> > >> >          8   {
>> > >> >          9     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > >> >         10     int incby = 2;
>> > >> >         11
>> > >> >         12     if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
>> > >> >         13       smp_mb__before_atomic();
>> > >> >         14       atomic_inc(&rdtp->dynticks);
>> > >> >         15       smp_mb__after_atomic();
>> > >> >         16       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> > >> >         17       incby = 1;
>> > >>
>> > >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 1) here, perhaps?
>> > >
>> > > That would make sense.
>> > >
>> > >> >         18     }
>> > >> >         19     rdtp->dynticks_nmi_nesting += incby;
>> > >>
>> > >> Oh, I see why you don't need local_add -- it's because an nmi in the
>> > >> middle of this increment won't have any effect on the interrupted
>> > >> code, so even a software RMW will be okay.
>> > >
>> > > Yep!  ;-)
>> > >
>> > >> >         20     barrier();
>> > >> >         21   }
>> > >> >         22
>> > >> >         23   void rcu_nmi_exit(void)
>> > >> >         24   {
>> > >> >         25     struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > >> >         26
>> > >> >         27     WARN_ON_ONCE(!rdtp->dynticks_nmi_nesting);
>> > >> >         28     WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> > >> >         29     if (rdtp->dynticks_nmi_nesting != 1) {
>> > >>
>> > >> WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 2), perhaps?
>> > >
>> > > This is already implied by the WARN_ON_ONCE() on line 27 and the check
>> > > on line 29.
>> >
>> > I was worried about negative numbers.  Maybe change line 27 to
>> > WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0), then?  (Or is it
>> > unsigned?  If so, let's make to signed to catch this type of error.)
>>
>> Good point, they are signed, so your WARN_ON_ONCE() would work.
>>
>> > >> >         30       rdtp->dynticks_nmi_nesting -= 2;
>> > >> >         31       return;
>> > >> >         32     }
>> > >> >         33     rdtp->dynticks_nmi_nesting = 0;
>> > >> >         34     smp_mb__before_atomic();
>> > >>
>> > >> This implies barrier(), right?
>> > >
>> > > Yep!
>> > >
>> > >> >         35     atomic_inc(&rdtp->dynticks);
>> > >> >         36     smp_mb__after_atomic();
>> > >> >         37     WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>> > >> >         38   }
>> > >> >
>> > >> > Line 9 picks up a pointer to this CPU's rcu_dynticks structure and line 10
>> > >> > assumes that we don't need to increment ->dynticks.
>> > >> >
>> > >> > Line 12 checks to see if ->dynticks is even.  Note that this check is
>> > >> > stable: If there are nested NMIs, they will increment ->dynticks twice
>> > >> > or not at all, and either way preserves the evenness (to be proven, of
>> > >> > course, but that is the plan).  If ->dynticks is even, lines 13-15
>> > >> > atomically increment it, line 16 complains if still even, and line 17
>> > >> > says we will increment ->dynticks_nmi_nesting by only 1.
>> > >> >
>> > >> > Either way, line 19 increments ->dynticks_nmi_nesting as needed and
>> > >> > line 20 keeps the compiler from getting too cute.
>> > >> >
>> > >> > For rcu_nmi_exit(), line 25 again picks up this CPUs rcu_dynticks
>> > >> > structure.  Lines 27 and 28 complain bitterly if invariants are violated.
>> > >> > If line 29 finds that the value of ->dynticks_nmi_nesting is not 1,
>> > >> > then line 30 subtracts 2 from ->dynticks_nmi_nesting and line 31 returns.
>> > >> >
>> > >> > Otherwise, line 33 sets ->dynticks_nmi_nesting to zero, lines 34-36
>> > >> > atomically increment ->dynticks with full ordering, and line 37
>> > >> > complains bitterly if ->dynticks is not even.
>> > >> >
>> > >> > So, if an NMI occurs before rcu_nmi_enter's atomic increment, then the
>> > >> > nested NMI's rcu_nmi_enter() and rcu_nmi_exit() will think that they are
>> > >> > not nested, which is the correct thing for them to think in that case.
>> > >> > They will increment ->dynticks twice and restore ->dynticks_nmi_nesting
>> > >> > to zero (adding and then subtracting 1).  If the NMI happens after the
>> > >> > atomic increment, then the nested rcu_nmi_enter() and rcu_nmi_exit()
>> > >> > will leave ->dynticks alone, and will restore ->dynticks_nmi_nesting
>> > >> > to zero (adding and subtracting two again).  If the NMI happens after
>> > >> > the increment of ->dynticks_nmi_nesting, the nested NMI's rcu_nmi_enter()
>> > >> > and rcu_nmi_exit() will again restore ->dynticks_nmi_nesting, but this
>> > >> > time to one (again adding and subtracting two).
>> > >> >
>> > >> > In rcu_nmi_exit(), ->dynticks_nmi_nesting of zero had better not happen,
>> > >> > one means we need to atomically increment ->dynticks, and other values
>> > >> > mean that we are partially or fully nested.  Reasoning proceeds as for
>> > >> > rcu_nmi_enter(), but in the opposite direction.
>> > >> >
>> > >> > Whew!  That might even work.
>> > >>
>> > >> I think I like this, with the warnings above.
>> > >
>> > > OK with dropping the one that I called out as redundant?
>> >
>> > Sure, but see about.
>> >
>> > >
>> > >> > But how about taking a different approach.  Assuming that there can
>> > >> > never be more than (say) 14 nesting NMI-like things, use the lower
>> > >> > four bits of ->dynticks to represent the NMI nesting and the upper
>> > >> > 28 bits as the counter.  This of course requires modifying lots of
>> > >> > places in RCU that check the counter, but it is probably time to
>> > >> > abstract the check anyway.
>> > >> >
>> > >> > This would allow my earlier attempted logic to work and (maybe) simplify
>> > >> > the reasoning a bit (and yes, the "magic" constants need macros):
>> > >> >
>> > >> >         void rcu_nmi_enter(void)
>> > >> >         {
>> > >> >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > >> >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
>> > >> >                 int incby = 0x01;
>> > >> >
>> > >> >                 WARN_ON_ONCE(nexting == 0xf);
>> > >> >                 if (nesting == 0) {
>> > >> >                         if (atomic_read(&rdtp->dynticks) & 0x10)
>> > >> >                                 return;
>> > >> >                         incby = 0x11;
>> > >> >                 }
>> > >> >                 smp_mb__before_atomic();
>> > >> >                 atomic_add(&rdtp->dynticks, incby);
>> > >> >                 smp_mb__after_atomic();
>> > >> >                 WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
>> > >> >         }
>> > >> >
>> > >> >         void rcu_nmi_exit(void)
>> > >> >         {
>> > >> >                 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>> > >> >                 int nesting = atomic_read(&rdtp->dynticks) & 0xf;
>> > >> >                 int incby = 0x0f;
>> > >> >
>> > >> >                 if (nesting == 0)
>> > >> >                         return;
>> > >> >                 if (nesting > 1)
>> > >> >                         incby = -1;
>> > >> >                 smp_mb__before_atomic();
>> > >> >                 atomic_add(&rdtp->dynticks, incby);
>> > >> >                 smp_mb__after_atomic();
>> > >> >                 WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
>> > >> >         }
>> > >> >
>> > >> > Over to you!  ;-)
>> > >>
>> > >> This latter one is all you :)
>> > >
>> > > Well, let's see how I feel about it after trying a Promela model of
>> > > the first code sequence.  ;-)
>> >
>> > :)
>> >
>> > Does Promela understand the differences between this type of
>> > reentrancy and real threading?
>>
>> Not as far as I know.  But it can be tricked into making this distinction.
>> One thread just has the Promela code as is, and the other thread has
>> the same Promela code entirely contained in an atomic block.  This means
>> that the entire second thread must executed at one point in the first
>> thread, just like an NMI would.
>
> And the following Promela model claims that your approach works.
> Should I trust it?  ;-)
>

I think so.

Want to write a patch?  If so, whose tree should it go in?  I can add
it to my IST series, but that seems a bit odd.

--Andy

>                                                         Thanx, Paul
>
> ------------------------------------------------------------------------
>
> /*
>  * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
>  * that allows nesting.
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License as published by
>  * the Free Software Foundation; either version 2 of the License, or
>  * (at your option) any later version.
>  *
>  * This program is distributed in the hope that it will be useful,
>  * but WITHOUT ANY WARRANTY; without even the implied warranty of
>  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>  * GNU General Public License for more details.
>  *
>  * You should have received a copy of the GNU General Public License
>  * along with this program; if not, you can access it online at
>  * http://www.gnu.org/licenses/gpl-2.0.html.
>  *
>  * Copyright IBM Corporation, 2014
>  *
>  * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>  */
>
> byte dynticks_nesting;
> byte dynticks_nmi_nesting;
> byte dynticks;
> byte busy;
>
> /*
>  * Promela verision of rcu_nmi_enter().
>  */
> inline rcu_nmi_enter()
> {
>         assert(dynticks_nmi_nesting >= 0);
>         if
>         :: (dynticks & 1) == 0 ->
>                 atomic {
>                         dynticks = dynticks + 1;
>                 }
>                 assert((dynticks & 1) == 1);
>                 dynticks_nmi_nesting = dynticks_nmi_nesting + 1;
>                 assert(dynticks_nmi_nesting >= 1);
>         :: else ->
>                 dynticks_nmi_nesting = dynticks_nmi_nesting + 2;
>         fi;
> }
>
> /*
>  * Promela verision of rcu_nmi_exit().
>  */
> inline rcu_nmi_exit()
> {
>         assert(dynticks_nmi_nesting > 0);
>         assert((dynticks & 1) != 0);
>         if
>         :: dynticks_nmi_nesting != 1 ->
>                 dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
>         :: else ->
>                 dynticks_nmi_nesting = 0;
>                 atomic {
>                         dynticks = dynticks + 1;
>                 }
>                 assert((dynticks & 1) == 0);
>         fi;
> }
>
> /*
>  * Base-level NMI runs non-atomically.  Crudely emulates process-level
>  * dynticks-idle entry/exit.
>  */
> proctype base_NMI()
> {
>         do
>         ::      if
>                 :: 1 -> atomic {
>                                 dynticks = dynticks + 1;
>                         }
>                         busy = 0;
>                 :: 1 -> skip;
>                 fi;
>                 rcu_nmi_enter();
>                 assert((dynticks & 1) == 1);
>                 rcu_nmi_exit();
>                 if
>                 :: busy -> skip;
>                 :: !busy ->
>                         atomic {
>                                 dynticks = dynticks + 1;
>                         }
>                         busy = 1;
>                 fi;
>         od;
> }
>
> /*
>  * Nested NMI runs atomically to emulate interrupting base_level().
>  */
> proctype nested_NMI()
> {
>         do
>         ::      atomic {
>                         rcu_nmi_enter();
>                         assert((dynticks & 1) == 1);
>                         rcu_nmi_exit();
>                 }
>         od;
> }
>
> init {
>         dynticks_nesting = 0;
>         dynticks_nmi_nesting = 0;
>         dynticks = 0;
>         busy = 0;
>         run base_NMI();
>         run nested_NMI();
> }
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 22:36                                 ` Andy Lutomirski
@ 2014-11-24 22:57                                   ` Paul E. McKenney
  2014-11-24 23:31                                     ` Paul E. McKenney
  0 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-24 22:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:

[ . . . ]

> > And the following Promela model claims that your approach works.
> > Should I trust it?  ;-)
> >
> 
> I think so.
> 
> Want to write a patch?  If so, whose tree should it go in?  I can add
> it to my IST series, but that seems a bit odd.

Working on it.  ;-)

							Thanx, Paul

> --Andy
> 
> >                                                         Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > /*
> >  * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
> >  * that allows nesting.
> >  *
> >  * This program is free software; you can redistribute it and/or modify
> >  * it under the terms of the GNU General Public License as published by
> >  * the Free Software Foundation; either version 2 of the License, or
> >  * (at your option) any later version.
> >  *
> >  * This program is distributed in the hope that it will be useful,
> >  * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >  * GNU General Public License for more details.
> >  *
> >  * You should have received a copy of the GNU General Public License
> >  * along with this program; if not, you can access it online at
> >  * http://www.gnu.org/licenses/gpl-2.0.html.
> >  *
> >  * Copyright IBM Corporation, 2014
> >  *
> >  * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >  */
> >
> > byte dynticks_nesting;
> > byte dynticks_nmi_nesting;
> > byte dynticks;
> > byte busy;
> >
> > /*
> >  * Promela verision of rcu_nmi_enter().
> >  */
> > inline rcu_nmi_enter()
> > {
> >         assert(dynticks_nmi_nesting >= 0);
> >         if
> >         :: (dynticks & 1) == 0 ->
> >                 atomic {
> >                         dynticks = dynticks + 1;
> >                 }
> >                 assert((dynticks & 1) == 1);
> >                 dynticks_nmi_nesting = dynticks_nmi_nesting + 1;
> >                 assert(dynticks_nmi_nesting >= 1);
> >         :: else ->
> >                 dynticks_nmi_nesting = dynticks_nmi_nesting + 2;
> >         fi;
> > }
> >
> > /*
> >  * Promela verision of rcu_nmi_exit().
> >  */
> > inline rcu_nmi_exit()
> > {
> >         assert(dynticks_nmi_nesting > 0);
> >         assert((dynticks & 1) != 0);
> >         if
> >         :: dynticks_nmi_nesting != 1 ->
> >                 dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
> >         :: else ->
> >                 dynticks_nmi_nesting = 0;
> >                 atomic {
> >                         dynticks = dynticks + 1;
> >                 }
> >                 assert((dynticks & 1) == 0);
> >         fi;
> > }
> >
> > /*
> >  * Base-level NMI runs non-atomically.  Crudely emulates process-level
> >  * dynticks-idle entry/exit.
> >  */
> > proctype base_NMI()
> > {
> >         do
> >         ::      if
> >                 :: 1 -> atomic {
> >                                 dynticks = dynticks + 1;
> >                         }
> >                         busy = 0;
> >                 :: 1 -> skip;
> >                 fi;
> >                 rcu_nmi_enter();
> >                 assert((dynticks & 1) == 1);
> >                 rcu_nmi_exit();
> >                 if
> >                 :: busy -> skip;
> >                 :: !busy ->
> >                         atomic {
> >                                 dynticks = dynticks + 1;
> >                         }
> >                         busy = 1;
> >                 fi;
> >         od;
> > }
> >
> > /*
> >  * Nested NMI runs atomically to emulate interrupting base_level().
> >  */
> > proctype nested_NMI()
> > {
> >         do
> >         ::      atomic {
> >                         rcu_nmi_enter();
> >                         assert((dynticks & 1) == 1);
> >                         rcu_nmi_exit();
> >                 }
> >         od;
> > }
> >
> > init {
> >         dynticks_nesting = 0;
> >         dynticks_nmi_nesting = 0;
> >         dynticks = 0;
> >         busy = 0;
> >         run base_NMI();
> >         run nested_NMI();
> > }
> >
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 22:57                                   ` Paul E. McKenney
@ 2014-11-24 23:31                                     ` Paul E. McKenney
  2014-11-24 23:35                                       ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-24 23:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 02:57:54PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
> > On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > And the following Promela model claims that your approach works.
> > > Should I trust it?  ;-)
> > >
> > 
> > I think so.
> > 
> > Want to write a patch?  If so, whose tree should it go in?  I can add
> > it to my IST series, but that seems a bit odd.
> 
> Working on it.  ;-)

And here is a sneak preview of the patch.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

rcu: Make rcu_nmi_enter() handle nesting

Andy Lutomirski is introducing ISTs into x86, which from RCU's
viewpoint are NMIs.  Because ISTs and NMIs can nest, rcu_nmi_enter() and
rcu_nmi_exit() must now correctly handle nesting.  This commit therefore
introduces nesting, using a clever NMI-coordination algorithm suggested
by Andy.  The trick is to atomically increment ->dynticks (if needed)
before manipulating ->dynticks_nmi_nesting on entry (and, accordingly,
after on exit).  In addition, ->dynticks_nmi_nesting is incremented by
one if ->dynticks was incremented and by two otherwise.  This means that
when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal to one, it knows
that ->dynticks must be atomically incremented.

This NMI-coordination algorithms has been validated by the following
Promela model, for whatever that might be worth:

/*
 * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
 * that allows nesting.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, you can access it online at
 * http://www.gnu.org/licenses/gpl-2.0.html.
 *
 * Copyright IBM Corporation, 2014
 *
 * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
 */

byte dynticks_nesting = 0;
byte dynticks_nmi_nesting = 0;
byte dynticks = 0;

/*
 * Promela verision of rcu_nmi_enter().
 */
inline rcu_nmi_enter()
{
	byte incby;

	incby = 2;
	assert(dynticks_nmi_nesting >= 0);
	if
	:: (dynticks & 1) == 0 ->
		atomic {
			dynticks = dynticks + 1;
		}
		assert((dynticks & 1) == 1);
		incby = 1;
	:: else ->
		skip;
	fi;
	dynticks_nmi_nesting = dynticks_nmi_nesting + incby;
	assert(dynticks_nmi_nesting >= 1);
}

/*
 * Promela verision of rcu_nmi_exit().
 */
inline rcu_nmi_exit()
{
	assert(dynticks_nmi_nesting > 0);
	assert((dynticks & 1) != 0);
	if
	:: dynticks_nmi_nesting != 1 ->
		dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
	:: else ->
		dynticks_nmi_nesting = 0;
		atomic {
			dynticks = dynticks + 1;
		}
		assert((dynticks & 1) == 0);
	fi;
}

/*
 * Base-level NMI runs non-atomically.  Crudely emulates process-level
 * dynticks-idle entry/exit.
 */
proctype base_NMI()
{
	byte busy;

	busy = 0;
	do
	::	if
		:: 1 ->	atomic {
				dynticks = dynticks + 1;
			}
			busy = 0;
		:: 1 ->	skip;
		fi;
		rcu_nmi_enter();
		assert((dynticks & 1) == 1);
		rcu_nmi_exit();
		if
		:: busy -> skip;
		:: !busy ->
			atomic {
				dynticks = dynticks + 1;
			}
			busy = 1;
		fi;
	od;
}

/*
 * Nested NMI runs atomically to emulate interrupting base_level().
 */
proctype nested_NMI()
{
	do
	::	atomic {
			rcu_nmi_enter();
			assert((dynticks & 1) == 1);
			rcu_nmi_exit();
		}
	od;
}

init {
	run base_NMI();
	run nested_NMI();
}

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8749f43f3f05..fc0236992655 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -759,39 +759,71 @@ void rcu_irq_enter(void)
 /**
  * rcu_nmi_enter - inform RCU of entry to NMI context
  *
- * If the CPU was idle with dynamic ticks active, and there is no
- * irq handler running, this updates rdtp->dynticks_nmi to let the
- * RCU grace-period handling know that the CPU is active.
+ * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
+ * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
+ * that the CPU is active.  This implementation permits nested NMIs, as
+ * long as the nesting level does not overflow an int.  (You will probably
+ * run out of stack space first.)
  */
 void rcu_nmi_enter(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+	int incby = 2;
 
-	if (rdtp->dynticks_nmi_nesting == 0 &&
-	    (atomic_read(&rdtp->dynticks) & 0x1))
-		return;
-	rdtp->dynticks_nmi_nesting++;
-	smp_mb__before_atomic();  /* Force delay from prior write. */
-	atomic_inc(&rdtp->dynticks);
-	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
-	smp_mb__after_atomic();  /* See above. */
-	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+	/* Complain about underflow. */
+	WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
+
+	/*
+	 * If idle from RCU viewpoint, atomically increment ->dynticks
+	 * to mark non-idle and increment ->dynticks_nmi_nesting by one.
+	 * Otherwise, increment ->dynticks_nmi_nesting by two.  This means
+	 * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
+	 * to be in the outermost NMI handler that interrupted an RCU-idle
+	 * period (observation due to Andy Lutomirski).
+	 */
+	if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
+		smp_mb__before_atomic();  /* Force delay from prior write. */
+		atomic_inc(&rdtp->dynticks);
+		/* atomic_inc() before later RCU read-side crit sects */
+		smp_mb__after_atomic();  /* See above. */
+		WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+		incby = 1;
+	}
+	rdtp->dynticks_nmi_nesting += incby;
+	barrier();
 }
 
 /**
  * rcu_nmi_exit - inform RCU of exit from NMI context
  *
- * If the CPU was idle with dynamic ticks active, and there is no
- * irq handler running, this updates rdtp->dynticks_nmi to let the
- * RCU grace-period handling know that the CPU is no longer active.
+ * If we are returning from the outermost NMI handler that interrupted an
+ * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
+ * to let the RCU grace-period handling know that the CPU is back to
+ * being RCU-idle.
  */
 void rcu_nmi_exit(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
 
-	if (rdtp->dynticks_nmi_nesting == 0 ||
-	    --rdtp->dynticks_nmi_nesting != 0)
+	/*
+	 * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
+	 * (We are exiting an NMI handler, so RCU better be paying attention
+	 * to us!)
+	 */
+	WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
+	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+
+	/*
+	 * If the nesting level is not 1, the CPU wasn't RCU-idle, so
+	 * leave it in non-RCU-idle state.
+	 */
+	if (rdtp->dynticks_nmi_nesting != 1) {
+		rdtp->dynticks_nmi_nesting -= 2;
 		return;
+	}
+
+	/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
+	rdtp->dynticks_nmi_nesting = 0;
 	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
 	smp_mb__before_atomic();  /* See above. */
 	atomic_inc(&rdtp->dynticks);


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 23:31                                     ` Paul E. McKenney
@ 2014-11-24 23:35                                       ` Andy Lutomirski
  2014-11-24 23:50                                         ` Paul E. McKenney
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-24 23:35 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 3:31 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, Nov 24, 2014 at 02:57:54PM -0800, Paul E. McKenney wrote:
>> On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
>> > On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
>> > <paulmck@linux.vnet.ibm.com> wrote:
>> > > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
>>
>> [ . . . ]
>>
>> > > And the following Promela model claims that your approach works.
>> > > Should I trust it?  ;-)
>> > >
>> >
>> > I think so.
>> >
>> > Want to write a patch?  If so, whose tree should it go in?  I can add
>> > it to my IST series, but that seems a bit odd.
>>
>> Working on it.  ;-)
>
> And here is a sneak preview of the patch.  Thoughts?
>
>                                                         Thanx, Paul
>
> ------------------------------------------------------------------------
>
> rcu: Make rcu_nmi_enter() handle nesting
>
> Andy Lutomirski is introducing ISTs into x86, which from RCU's
> viewpoint are NMIs.  Because ISTs and NMIs can nest, rcu_nmi_enter() and
> rcu_nmi_exit() must now correctly handle nesting.

You must not be a frequent reader of entry_64.S and the Intel SDM :)
IOW, IST is just a stack switching mechanism, and these interrupts
have been around forever -- they're just buggy right now.

How about:

x86 has multiple types of NMI-like interrupts: real NMIs, machine
checks, and, for some values of NMI-like, debugging and breakpoint
interrupts.  These interrupts can nest inside each other.  Andy
Lutomirski is adding RCU support to these interrupts, so
rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.

Other than that, I like it.

>  This commit therefore
> introduces nesting, using a clever NMI-coordination algorithm suggested
> by Andy.  The trick is to atomically increment ->dynticks (if needed)
> before manipulating ->dynticks_nmi_nesting on entry (and, accordingly,
> after on exit).  In addition, ->dynticks_nmi_nesting is incremented by
> one if ->dynticks was incremented and by two otherwise.  This means that
> when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal to one, it knows
> that ->dynticks must be atomically incremented.
>
> This NMI-coordination algorithms has been validated by the following
> Promela model, for whatever that might be worth:
>
> /*
>  * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
>  * that allows nesting.
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License as published by
>  * the Free Software Foundation; either version 2 of the License, or
>  * (at your option) any later version.
>  *
>  * This program is distributed in the hope that it will be useful,
>  * but WITHOUT ANY WARRANTY; without even the implied warranty of
>  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>  * GNU General Public License for more details.
>  *
>  * You should have received a copy of the GNU General Public License
>  * along with this program; if not, you can access it online at
>  * http://www.gnu.org/licenses/gpl-2.0.html.
>  *
>  * Copyright IBM Corporation, 2014
>  *
>  * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>  */
>
> byte dynticks_nesting = 0;
> byte dynticks_nmi_nesting = 0;
> byte dynticks = 0;
>
> /*
>  * Promela verision of rcu_nmi_enter().
>  */
> inline rcu_nmi_enter()
> {
>         byte incby;
>
>         incby = 2;
>         assert(dynticks_nmi_nesting >= 0);
>         if
>         :: (dynticks & 1) == 0 ->
>                 atomic {
>                         dynticks = dynticks + 1;
>                 }
>                 assert((dynticks & 1) == 1);
>                 incby = 1;
>         :: else ->
>                 skip;
>         fi;
>         dynticks_nmi_nesting = dynticks_nmi_nesting + incby;
>         assert(dynticks_nmi_nesting >= 1);
> }
>
> /*
>  * Promela verision of rcu_nmi_exit().
>  */
> inline rcu_nmi_exit()
> {
>         assert(dynticks_nmi_nesting > 0);
>         assert((dynticks & 1) != 0);
>         if
>         :: dynticks_nmi_nesting != 1 ->
>                 dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
>         :: else ->
>                 dynticks_nmi_nesting = 0;
>                 atomic {
>                         dynticks = dynticks + 1;
>                 }
>                 assert((dynticks & 1) == 0);
>         fi;
> }
>
> /*
>  * Base-level NMI runs non-atomically.  Crudely emulates process-level
>  * dynticks-idle entry/exit.
>  */
> proctype base_NMI()
> {
>         byte busy;
>
>         busy = 0;
>         do
>         ::      if
>                 :: 1 -> atomic {
>                                 dynticks = dynticks + 1;
>                         }
>                         busy = 0;
>                 :: 1 -> skip;
>                 fi;
>                 rcu_nmi_enter();
>                 assert((dynticks & 1) == 1);
>                 rcu_nmi_exit();
>                 if
>                 :: busy -> skip;
>                 :: !busy ->
>                         atomic {
>                                 dynticks = dynticks + 1;
>                         }
>                         busy = 1;
>                 fi;
>         od;
> }
>
> /*
>  * Nested NMI runs atomically to emulate interrupting base_level().
>  */
> proctype nested_NMI()
> {
>         do
>         ::      atomic {
>                         rcu_nmi_enter();
>                         assert((dynticks & 1) == 1);
>                         rcu_nmi_exit();
>                 }
>         od;
> }
>
> init {
>         run base_NMI();
>         run nested_NMI();
> }
>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8749f43f3f05..fc0236992655 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -759,39 +759,71 @@ void rcu_irq_enter(void)
>  /**
>   * rcu_nmi_enter - inform RCU of entry to NMI context
>   *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is active.
> + * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
> + * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
> + * that the CPU is active.  This implementation permits nested NMIs, as
> + * long as the nesting level does not overflow an int.  (You will probably
> + * run out of stack space first.)
>   */
>  void rcu_nmi_enter(void)
>  {
>         struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> +       int incby = 2;
>
> -       if (rdtp->dynticks_nmi_nesting == 0 &&
> -           (atomic_read(&rdtp->dynticks) & 0x1))
> -               return;
> -       rdtp->dynticks_nmi_nesting++;
> -       smp_mb__before_atomic();  /* Force delay from prior write. */
> -       atomic_inc(&rdtp->dynticks);
> -       /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> -       smp_mb__after_atomic();  /* See above. */
> -       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +       /* Complain about underflow. */
> +       WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
> +
> +       /*
> +        * If idle from RCU viewpoint, atomically increment ->dynticks
> +        * to mark non-idle and increment ->dynticks_nmi_nesting by one.
> +        * Otherwise, increment ->dynticks_nmi_nesting by two.  This means
> +        * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
> +        * to be in the outermost NMI handler that interrupted an RCU-idle
> +        * period (observation due to Andy Lutomirski).
> +        */
> +       if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> +               smp_mb__before_atomic();  /* Force delay from prior write. */
> +               atomic_inc(&rdtp->dynticks);
> +               /* atomic_inc() before later RCU read-side crit sects */
> +               smp_mb__after_atomic();  /* See above. */
> +               WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +               incby = 1;
> +       }
> +       rdtp->dynticks_nmi_nesting += incby;
> +       barrier();
>  }
>
>  /**
>   * rcu_nmi_exit - inform RCU of exit from NMI context
>   *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is no longer active.
> + * If we are returning from the outermost NMI handler that interrupted an
> + * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
> + * to let the RCU grace-period handling know that the CPU is back to
> + * being RCU-idle.
>   */
>  void rcu_nmi_exit(void)
>  {
>         struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>
> -       if (rdtp->dynticks_nmi_nesting == 0 ||
> -           --rdtp->dynticks_nmi_nesting != 0)
> +       /*
> +        * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
> +        * (We are exiting an NMI handler, so RCU better be paying attention
> +        * to us!)
> +        */
> +       WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
> +       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +
> +       /*
> +        * If the nesting level is not 1, the CPU wasn't RCU-idle, so
> +        * leave it in non-RCU-idle state.
> +        */
> +       if (rdtp->dynticks_nmi_nesting != 1) {
> +               rdtp->dynticks_nmi_nesting -= 2;
>                 return;
> +       }
> +
> +       /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
> +       rdtp->dynticks_nmi_nesting = 0;
>         /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
>         smp_mb__before_atomic();  /* See above. */
>         atomic_inc(&rdtp->dynticks);
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 23:35                                       ` Andy Lutomirski
@ 2014-11-24 23:50                                         ` Paul E. McKenney
  2014-11-24 23:52                                           ` Andy Lutomirski
                                                             ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-24 23:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 03:35:55PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 24, 2014 at 3:31 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, Nov 24, 2014 at 02:57:54PM -0800, Paul E. McKenney wrote:
> >> On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
> >> > On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
> >> > <paulmck@linux.vnet.ibm.com> wrote:
> >> > > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
> >>
> >> [ . . . ]
> >>
> >> > > And the following Promela model claims that your approach works.
> >> > > Should I trust it?  ;-)
> >> > >
> >> >
> >> > I think so.
> >> >
> >> > Want to write a patch?  If so, whose tree should it go in?  I can add
> >> > it to my IST series, but that seems a bit odd.
> >>
> >> Working on it.  ;-)
> >
> > And here is a sneak preview of the patch.  Thoughts?
> >
> >                                                         Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > rcu: Make rcu_nmi_enter() handle nesting
> >
> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
> > viewpoint are NMIs.  Because ISTs and NMIs can nest, rcu_nmi_enter() and
> > rcu_nmi_exit() must now correctly handle nesting.
> 
> You must not be a frequent reader of entry_64.S and the Intel SDM :)
> IOW, IST is just a stack switching mechanism, and these interrupts
> have been around forever -- they're just buggy right now.
> 
> How about:
> 
> x86 has multiple types of NMI-like interrupts: real NMIs, machine
> checks, and, for some values of NMI-like, debugging and breakpoint
> interrupts.  These interrupts can nest inside each other.  Andy
> Lutomirski is adding RCU support to these interrupts, so
> rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.
> 
> Other than that, I like it.

And here is the updated version.  Left to my normal workflow, this would
go into 3.20.  Please let me know if you need it earlier, but in that
case, I will need you to test it fairly soon.  (I don't have any way
to test nested NMI-like things.)

							Thanx, Paul

------------------------------------------------------------------------

rcu: Make rcu_nmi_enter() handle nesting

The x86 architecture has multiple types of NMI-like interrupts: real
NMIs, machine checks, and, for some values of NMI-like, debugging
and breakpoint interrupts.  These interrupts can nest inside each
other.  Andy Lutomirski is adding RCU support to these interrupts,
so rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.

This commit therefore introduces nesting, using a clever NMI-coordination
algorithm suggested by Andy.  The trick is to atomically increment
->dynticks (if needed) before manipulating ->dynticks_nmi_nesting on entry
(and, accordingly, after on exit).  In addition, ->dynticks_nmi_nesting
is incremented by one if ->dynticks was incremented and by two otherwise.
This means that when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal
to one, it knows that ->dynticks must be atomically incremented.

This NMI-coordination algorithms has been validated by the following
Promela model, for whatever that might be worth:

/*
 * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
 * that allows nesting.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, you can access it online at
 * http://www.gnu.org/licenses/gpl-2.0.html.
 *
 * Copyright IBM Corporation, 2014
 *
 * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
 */

byte dynticks_nesting = 0;
byte dynticks_nmi_nesting = 0;
byte dynticks = 0;

/*
 * Promela verision of rcu_nmi_enter().
 */
inline rcu_nmi_enter()
{
	byte incby;

	incby = 2;
	assert(dynticks_nmi_nesting >= 0);
	if
	:: (dynticks & 1) == 0 ->
		atomic {
			dynticks = dynticks + 1;
		}
		assert((dynticks & 1) == 1);
		incby = 1;
	:: else ->
		skip;
	fi;
	dynticks_nmi_nesting = dynticks_nmi_nesting + incby;
	assert(dynticks_nmi_nesting >= 1);
}

/*
 * Promela verision of rcu_nmi_exit().
 */
inline rcu_nmi_exit()
{
	assert(dynticks_nmi_nesting > 0);
	assert((dynticks & 1) != 0);
	if
	:: dynticks_nmi_nesting != 1 ->
		dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
	:: else ->
		dynticks_nmi_nesting = 0;
		atomic {
			dynticks = dynticks + 1;
		}
		assert((dynticks & 1) == 0);
	fi;
}

/*
 * Base-level NMI runs non-atomically.  Crudely emulates process-level
 * dynticks-idle entry/exit.
 */
proctype base_NMI()
{
	byte busy;

	busy = 0;
	do
	::	if
		:: 1 ->	atomic {
				dynticks = dynticks + 1;
			}
			busy = 0;
		:: 1 ->	skip;
		fi;
		rcu_nmi_enter();
		assert((dynticks & 1) == 1);
		rcu_nmi_exit();
		if
		:: busy -> skip;
		:: !busy ->
			atomic {
				dynticks = dynticks + 1;
			}
			busy = 1;
		fi;
	od;
}

/*
 * Nested NMI runs atomically to emulate interrupting base_level().
 */
proctype nested_NMI()
{
	do
	::	atomic {
			rcu_nmi_enter();
			assert((dynticks & 1) == 1);
			rcu_nmi_exit();
		}
	od;
}

init {
	run base_NMI();
	run nested_NMI();
}

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8749f43f3f05..fc0236992655 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -759,39 +759,71 @@ void rcu_irq_enter(void)
 /**
  * rcu_nmi_enter - inform RCU of entry to NMI context
  *
- * If the CPU was idle with dynamic ticks active, and there is no
- * irq handler running, this updates rdtp->dynticks_nmi to let the
- * RCU grace-period handling know that the CPU is active.
+ * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
+ * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
+ * that the CPU is active.  This implementation permits nested NMIs, as
+ * long as the nesting level does not overflow an int.  (You will probably
+ * run out of stack space first.)
  */
 void rcu_nmi_enter(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+	int incby = 2;
 
-	if (rdtp->dynticks_nmi_nesting == 0 &&
-	    (atomic_read(&rdtp->dynticks) & 0x1))
-		return;
-	rdtp->dynticks_nmi_nesting++;
-	smp_mb__before_atomic();  /* Force delay from prior write. */
-	atomic_inc(&rdtp->dynticks);
-	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
-	smp_mb__after_atomic();  /* See above. */
-	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+	/* Complain about underflow. */
+	WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
+
+	/*
+	 * If idle from RCU viewpoint, atomically increment ->dynticks
+	 * to mark non-idle and increment ->dynticks_nmi_nesting by one.
+	 * Otherwise, increment ->dynticks_nmi_nesting by two.  This means
+	 * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
+	 * to be in the outermost NMI handler that interrupted an RCU-idle
+	 * period (observation due to Andy Lutomirski).
+	 */
+	if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
+		smp_mb__before_atomic();  /* Force delay from prior write. */
+		atomic_inc(&rdtp->dynticks);
+		/* atomic_inc() before later RCU read-side crit sects */
+		smp_mb__after_atomic();  /* See above. */
+		WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+		incby = 1;
+	}
+	rdtp->dynticks_nmi_nesting += incby;
+	barrier();
 }
 
 /**
  * rcu_nmi_exit - inform RCU of exit from NMI context
  *
- * If the CPU was idle with dynamic ticks active, and there is no
- * irq handler running, this updates rdtp->dynticks_nmi to let the
- * RCU grace-period handling know that the CPU is no longer active.
+ * If we are returning from the outermost NMI handler that interrupted an
+ * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
+ * to let the RCU grace-period handling know that the CPU is back to
+ * being RCU-idle.
  */
 void rcu_nmi_exit(void)
 {
 	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
 
-	if (rdtp->dynticks_nmi_nesting == 0 ||
-	    --rdtp->dynticks_nmi_nesting != 0)
+	/*
+	 * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
+	 * (We are exiting an NMI handler, so RCU better be paying attention
+	 * to us!)
+	 */
+	WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
+	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+
+	/*
+	 * If the nesting level is not 1, the CPU wasn't RCU-idle, so
+	 * leave it in non-RCU-idle state.
+	 */
+	if (rdtp->dynticks_nmi_nesting != 1) {
+		rdtp->dynticks_nmi_nesting -= 2;
 		return;
+	}
+
+	/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
+	rdtp->dynticks_nmi_nesting = 0;
 	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
 	smp_mb__before_atomic();  /* See above. */
 	atomic_inc(&rdtp->dynticks);


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 23:50                                         ` Paul E. McKenney
@ 2014-11-24 23:52                                           ` Andy Lutomirski
  2014-11-25 18:58                                             ` Borislav Petkov
  2014-11-25 17:13                                           ` [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context Paul E. McKenney
  2014-11-27  7:03                                           ` Lai Jiangshan
  2 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-11-24 23:52 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 3:50 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, Nov 24, 2014 at 03:35:55PM -0800, Andy Lutomirski wrote:
>> On Mon, Nov 24, 2014 at 3:31 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Mon, Nov 24, 2014 at 02:57:54PM -0800, Paul E. McKenney wrote:
>> >> On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
>> >> > On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
>> >> > <paulmck@linux.vnet.ibm.com> wrote:
>> >> > > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
>> >>
>> >> [ . . . ]
>> >>
>> >> > > And the following Promela model claims that your approach works.
>> >> > > Should I trust it?  ;-)
>> >> > >
>> >> >
>> >> > I think so.
>> >> >
>> >> > Want to write a patch?  If so, whose tree should it go in?  I can add
>> >> > it to my IST series, but that seems a bit odd.
>> >>
>> >> Working on it.  ;-)
>> >
>> > And here is a sneak preview of the patch.  Thoughts?
>> >
>> >                                                         Thanx, Paul
>> >
>> > ------------------------------------------------------------------------
>> >
>> > rcu: Make rcu_nmi_enter() handle nesting
>> >
>> > Andy Lutomirski is introducing ISTs into x86, which from RCU's
>> > viewpoint are NMIs.  Because ISTs and NMIs can nest, rcu_nmi_enter() and
>> > rcu_nmi_exit() must now correctly handle nesting.
>>
>> You must not be a frequent reader of entry_64.S and the Intel SDM :)
>> IOW, IST is just a stack switching mechanism, and these interrupts
>> have been around forever -- they're just buggy right now.
>>
>> How about:
>>
>> x86 has multiple types of NMI-like interrupts: real NMIs, machine
>> checks, and, for some values of NMI-like, debugging and breakpoint
>> interrupts.  These interrupts can nest inside each other.  Andy
>> Lutomirski is adding RCU support to these interrupts, so
>> rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.
>>
>> Other than that, I like it.
>
> And here is the updated version.  Left to my normal workflow, this would
> go into 3.20.  Please let me know if you need it earlier, but in that
> case, I will need you to test it fairly soon.  (I don't have any way
> to test nested NMI-like things.)

Dunno.  Tony and Borislav -- when do you want the IST stack switching stuff?

--Andy

>
>                                                         Thanx, Paul
>
> ------------------------------------------------------------------------
>
> rcu: Make rcu_nmi_enter() handle nesting
>
> The x86 architecture has multiple types of NMI-like interrupts: real
> NMIs, machine checks, and, for some values of NMI-like, debugging
> and breakpoint interrupts.  These interrupts can nest inside each
> other.  Andy Lutomirski is adding RCU support to these interrupts,
> so rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.
>
> This commit therefore introduces nesting, using a clever NMI-coordination
> algorithm suggested by Andy.  The trick is to atomically increment
> ->dynticks (if needed) before manipulating ->dynticks_nmi_nesting on entry
> (and, accordingly, after on exit).  In addition, ->dynticks_nmi_nesting
> is incremented by one if ->dynticks was incremented and by two otherwise.
> This means that when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal
> to one, it knows that ->dynticks must be atomically incremented.
>
> This NMI-coordination algorithms has been validated by the following
> Promela model, for whatever that might be worth:
>
> /*
>  * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
>  * that allows nesting.
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License as published by
>  * the Free Software Foundation; either version 2 of the License, or
>  * (at your option) any later version.
>  *
>  * This program is distributed in the hope that it will be useful,
>  * but WITHOUT ANY WARRANTY; without even the implied warranty of
>  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>  * GNU General Public License for more details.
>  *
>  * You should have received a copy of the GNU General Public License
>  * along with this program; if not, you can access it online at
>  * http://www.gnu.org/licenses/gpl-2.0.html.
>  *
>  * Copyright IBM Corporation, 2014
>  *
>  * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>  */
>
> byte dynticks_nesting = 0;
> byte dynticks_nmi_nesting = 0;
> byte dynticks = 0;
>
> /*
>  * Promela verision of rcu_nmi_enter().
>  */
> inline rcu_nmi_enter()
> {
>         byte incby;
>
>         incby = 2;
>         assert(dynticks_nmi_nesting >= 0);
>         if
>         :: (dynticks & 1) == 0 ->
>                 atomic {
>                         dynticks = dynticks + 1;
>                 }
>                 assert((dynticks & 1) == 1);
>                 incby = 1;
>         :: else ->
>                 skip;
>         fi;
>         dynticks_nmi_nesting = dynticks_nmi_nesting + incby;
>         assert(dynticks_nmi_nesting >= 1);
> }
>
> /*
>  * Promela verision of rcu_nmi_exit().
>  */
> inline rcu_nmi_exit()
> {
>         assert(dynticks_nmi_nesting > 0);
>         assert((dynticks & 1) != 0);
>         if
>         :: dynticks_nmi_nesting != 1 ->
>                 dynticks_nmi_nesting = dynticks_nmi_nesting - 2;
>         :: else ->
>                 dynticks_nmi_nesting = 0;
>                 atomic {
>                         dynticks = dynticks + 1;
>                 }
>                 assert((dynticks & 1) == 0);
>         fi;
> }
>
> /*
>  * Base-level NMI runs non-atomically.  Crudely emulates process-level
>  * dynticks-idle entry/exit.
>  */
> proctype base_NMI()
> {
>         byte busy;
>
>         busy = 0;
>         do
>         ::      if
>                 :: 1 -> atomic {
>                                 dynticks = dynticks + 1;
>                         }
>                         busy = 0;
>                 :: 1 -> skip;
>                 fi;
>                 rcu_nmi_enter();
>                 assert((dynticks & 1) == 1);
>                 rcu_nmi_exit();
>                 if
>                 :: busy -> skip;
>                 :: !busy ->
>                         atomic {
>                                 dynticks = dynticks + 1;
>                         }
>                         busy = 1;
>                 fi;
>         od;
> }
>
> /*
>  * Nested NMI runs atomically to emulate interrupting base_level().
>  */
> proctype nested_NMI()
> {
>         do
>         ::      atomic {
>                         rcu_nmi_enter();
>                         assert((dynticks & 1) == 1);
>                         rcu_nmi_exit();
>                 }
>         od;
> }
>
> init {
>         run base_NMI();
>         run nested_NMI();
> }
>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8749f43f3f05..fc0236992655 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -759,39 +759,71 @@ void rcu_irq_enter(void)
>  /**
>   * rcu_nmi_enter - inform RCU of entry to NMI context
>   *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is active.
> + * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
> + * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
> + * that the CPU is active.  This implementation permits nested NMIs, as
> + * long as the nesting level does not overflow an int.  (You will probably
> + * run out of stack space first.)
>   */
>  void rcu_nmi_enter(void)
>  {
>         struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> +       int incby = 2;
>
> -       if (rdtp->dynticks_nmi_nesting == 0 &&
> -           (atomic_read(&rdtp->dynticks) & 0x1))
> -               return;
> -       rdtp->dynticks_nmi_nesting++;
> -       smp_mb__before_atomic();  /* Force delay from prior write. */
> -       atomic_inc(&rdtp->dynticks);
> -       /* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> -       smp_mb__after_atomic();  /* See above. */
> -       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +       /* Complain about underflow. */
> +       WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
> +
> +       /*
> +        * If idle from RCU viewpoint, atomically increment ->dynticks
> +        * to mark non-idle and increment ->dynticks_nmi_nesting by one.
> +        * Otherwise, increment ->dynticks_nmi_nesting by two.  This means
> +        * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
> +        * to be in the outermost NMI handler that interrupted an RCU-idle
> +        * period (observation due to Andy Lutomirski).
> +        */
> +       if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> +               smp_mb__before_atomic();  /* Force delay from prior write. */
> +               atomic_inc(&rdtp->dynticks);
> +               /* atomic_inc() before later RCU read-side crit sects */
> +               smp_mb__after_atomic();  /* See above. */
> +               WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +               incby = 1;
> +       }
> +       rdtp->dynticks_nmi_nesting += incby;
> +       barrier();
>  }
>
>  /**
>   * rcu_nmi_exit - inform RCU of exit from NMI context
>   *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is no longer active.
> + * If we are returning from the outermost NMI handler that interrupted an
> + * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
> + * to let the RCU grace-period handling know that the CPU is back to
> + * being RCU-idle.
>   */
>  void rcu_nmi_exit(void)
>  {
>         struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>
> -       if (rdtp->dynticks_nmi_nesting == 0 ||
> -           --rdtp->dynticks_nmi_nesting != 0)
> +       /*
> +        * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
> +        * (We are exiting an NMI handler, so RCU better be paying attention
> +        * to us!)
> +        */
> +       WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
> +       WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +
> +       /*
> +        * If the nesting level is not 1, the CPU wasn't RCU-idle, so
> +        * leave it in non-RCU-idle state.
> +        */
> +       if (rdtp->dynticks_nmi_nesting != 1) {
> +               rdtp->dynticks_nmi_nesting -= 2;
>                 return;
> +       }
> +
> +       /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
> +       rdtp->dynticks_nmi_nesting = 0;
>         /* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
>         smp_mb__before_atomic();  /* See above. */
>         atomic_inc(&rdtp->dynticks);
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 23:50                                         ` Paul E. McKenney
  2014-11-24 23:52                                           ` Andy Lutomirski
@ 2014-11-25 17:13                                           ` Paul E. McKenney
  2014-11-27  7:03                                           ` Lai Jiangshan
  2 siblings, 0 replies; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-25 17:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 03:50:58PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 24, 2014 at 03:35:55PM -0800, Andy Lutomirski wrote:
> > On Mon, Nov 24, 2014 at 3:31 PM, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > > On Mon, Nov 24, 2014 at 02:57:54PM -0800, Paul E. McKenney wrote:
> > >> On Mon, Nov 24, 2014 at 02:36:18PM -0800, Andy Lutomirski wrote:
> > >> > On Mon, Nov 24, 2014 at 2:34 PM, Paul E. McKenney
> > >> > <paulmck@linux.vnet.ibm.com> wrote:
> > >> > > On Mon, Nov 24, 2014 at 01:35:01PM -0800, Paul E. McKenney wrote:
> > >>
> > >> [ . . . ]
> > >>
> > >> > > And the following Promela model claims that your approach works.
> > >> > > Should I trust it?  ;-)
> > >> > >
> > >> >
> > >> > I think so.

OK, I added some coverage checks and also injected bugs, all of which
it detected, so I am feeling at least a bit more confident in the model,
the updated version of which is included below, along with the script
that runs it.

							Thanx, Paul

------------------------------------------------------------------------
rcu_nmi.spin
------------------------------------------------------------------------

/*
 * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
 * that allows nesting.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, you can access it online at
 * http://www.gnu.org/licenses/gpl-2.0.html.
 *
 * Copyright IBM Corporation, 2014
 *
 * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
 */

byte dynticks_nmi_nesting = 0;
byte dynticks = 0;
#define BUSY_INCBY 2 /* set to 1 to force failure. */

/*
 * Promela verision of rcu_nmi_enter().
 */
inline rcu_nmi_enter()
{
	byte incby;
	byte tmp;

	incby = BUSY_INCBY;
	assert(dynticks_nmi_nesting >= 0);
	if
	:: (dynticks & 1) == 0 ->
		atomic {
			dynticks = dynticks + 1;
		}
		assert((dynticks & 1) == 1);
		incby = 1;
	:: else ->
		skip;
	fi;
	tmp = dynticks_nmi_nesting;
	tmp = tmp + incby;
	dynticks_nmi_nesting = tmp;
	assert(dynticks_nmi_nesting >= 1);
}

/*
 * Promela verision of rcu_nmi_exit().
 */
inline rcu_nmi_exit()
{
	byte tmp;

	assert(dynticks_nmi_nesting > 0);
	assert((dynticks & 1) != 0);
	if
	:: dynticks_nmi_nesting != 1 ->
		tmp = dynticks_nmi_nesting;
		tmp = tmp - BUSY_INCBY;
		dynticks_nmi_nesting = tmp;
	:: else ->
		dynticks_nmi_nesting = 0;
		atomic {
			dynticks = dynticks + 1;
		}
		assert((dynticks & 1) == 0);
	fi;
}

/*
 * Base-level NMI runs non-atomically.  Crudely emulates process-level
 * dynticks-idle entry/exit.
 */
proctype base_NMI()
{
	byte busy;

	busy = 0;
	do
	::	/* Emulate base-level dynticks and not. */
		if
		:: 1 ->	atomic {
				dynticks = dynticks + 1;
			}
			busy = 1;
		:: 1 ->	skip;
		fi;

		/* Verify that we only sometimes have base-level dynticks. */
		if
		:: busy == 0 -> skip;
		:: busy == 1 -> skip;
		fi;

		/* Model RCU's NMI entry and exit actions. */
		rcu_nmi_enter();
		assert((dynticks & 1) == 1);
		rcu_nmi_exit();

		/* Emulated re-entering base-level dynticks and not. */
		if
		:: !busy -> skip;
		:: busy ->
			atomic {
				dynticks = dynticks + 1;
			}
			busy = 0;
		fi;

		/* We had better now be in dyntick-idle mode. */
		assert((dynticks & 1) == 0);
	od;
}

/*
 * Nested NMI runs atomically to emulate interrupting base_level().
 */
proctype nested_NMI()
{
	do
	::	/*
		 * Use an atomic section to model a nested NMI.  This is
		 * guaranteed to interleave into base_NMI() between a pair
		 * of base_NMI() statements, just as a nested NMI would.
		 */
		atomic {
			/* Verify that we only sometimes are in dynticks. */
			if
			:: (dynticks & 1) == 0 -> skip;
			:: (dynticks & 1) == 1 -> skip;
			fi;

			/* Model RCU's NMI entry and exit actions. */
			rcu_nmi_enter();
			assert((dynticks & 1) == 1);
			rcu_nmi_exit();
		}
	od;
}

init {
	run base_NMI();
	run nested_NMI();
}

------------------------------------------------------------------------
rcu_nmi.sh
------------------------------------------------------------------------

if ! spin -a rcu_nmi.spin
then
	echo Spin errors!!!
	exit 1
fi
if ! cc -DSAFETY -o pan pan.c
then
	echo Compilation errors!!!
	exit 1
fi
./pan -m100000
# For errors: spin -p -l -g -t rcu_nmi.spin < rcu_nmi.spin.trail


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 23:52                                           ` Andy Lutomirski
@ 2014-11-25 18:58                                             ` Borislav Petkov
  2014-11-25 19:16                                               ` Paul E. McKenney
  2014-12-11  0:22                                               ` Tony Luck
  0 siblings, 2 replies; 72+ messages in thread
From: Borislav Petkov @ 2014-11-25 18:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paul McKenney, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Nov 24, 2014 at 03:52:10PM -0800, Andy Lutomirski wrote:
> Dunno. Tony and Borislav -- when do you want the IST stack switching
> stuff?

I'd leave that up to Tony and his testbench. I mean, we can hammer on
it as much as we can and it can pass all testing locally but the real
fun starts once it hits upstream and for that it doesn't matter which
release... IMHO.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-25 18:58                                             ` Borislav Petkov
@ 2014-11-25 19:16                                               ` Paul E. McKenney
  2014-12-11  0:22                                               ` Tony Luck
  1 sibling, 0 replies; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-25 19:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Tue, Nov 25, 2014 at 07:58:03PM +0100, Borislav Petkov wrote:
> On Mon, Nov 24, 2014 at 03:52:10PM -0800, Andy Lutomirski wrote:
> > Dunno. Tony and Borislav -- when do you want the IST stack switching
> > stuff?
> 
> I'd leave that up to Tony and his testbench. I mean, we can hammer on
> it as much as we can and it can pass all testing locally but the real
> fun starts once it hits upstream and for that it doesn't matter which
> release... IMHO.

Nothing quite like oddball hardware to make test planning a challenge,
no two ways about it!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-24 23:50                                         ` Paul E. McKenney
  2014-11-24 23:52                                           ` Andy Lutomirski
  2014-11-25 17:13                                           ` [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context Paul E. McKenney
@ 2014-11-27  7:03                                           ` Lai Jiangshan
  2014-11-27 16:46                                             ` Paul E. McKenney
  2 siblings, 1 reply; 72+ messages in thread
From: Lai Jiangshan @ 2014-11-27  7:03 UTC (permalink / raw)
  To: paulmck
  Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Josh Triplett, Frédéric Weisbecker


> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 



Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>


> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8749f43f3f05..fc0236992655 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -759,39 +759,71 @@ void rcu_irq_enter(void)
>  /**
>   * rcu_nmi_enter - inform RCU of entry to NMI context
>   *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is active.
> + * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
> + * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
> + * that the CPU is active.  This implementation permits nested NMIs, as
> + * long as the nesting level does not overflow an int.  (You will probably
> + * run out of stack space first.)
>   */
>  void rcu_nmi_enter(void)
>  {
>  	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> +	int incby = 2;
>  
> -	if (rdtp->dynticks_nmi_nesting == 0 &&
> -	    (atomic_read(&rdtp->dynticks) & 0x1))
> -		return;
> -	rdtp->dynticks_nmi_nesting++;
> -	smp_mb__before_atomic();  /* Force delay from prior write. */
> -	atomic_inc(&rdtp->dynticks);
> -	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> -	smp_mb__after_atomic();  /* See above. */
> -	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +	/* Complain about underflow. */
> +	WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
> +
> +	/*
> +	 * If idle from RCU viewpoint, atomically increment ->dynticks
> +	 * to mark non-idle and increment ->dynticks_nmi_nesting by one.
> +	 * Otherwise, increment ->dynticks_nmi_nesting by two.  This means
> +	 * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
> +	 * to be in the outermost NMI handler that interrupted an RCU-idle
> +	 * period (observation due to Andy Lutomirski).
> +	 */
> +	if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> +		smp_mb__before_atomic();  /* Force delay from prior write. */
> +		atomic_inc(&rdtp->dynticks);
> +		/* atomic_inc() before later RCU read-side crit sects */
> +		smp_mb__after_atomic();  /* See above. */
> +		WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +		incby = 1;
> +	}
> +	rdtp->dynticks_nmi_nesting += incby;

I prefer a "else" branch here.

	if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
		...
		WARN_ON_ONCE(rdtp->dynticks_nmi_nesting); /* paired with "rdtp->dynticks_nmi_nesting = 0" */
		rdtp->dynticks_nmi_nesting = 1;  /* paired with "if (rdtp->dynticks_nmi_nesting != 1) {" */
	} else {
		rdtp->dynticks_nmi_nesting += 2;
	}



> +	barrier();
>  }
>  
>  /**
>   * rcu_nmi_exit - inform RCU of exit from NMI context
>   *
> - * If the CPU was idle with dynamic ticks active, and there is no
> - * irq handler running, this updates rdtp->dynticks_nmi to let the
> - * RCU grace-period handling know that the CPU is no longer active.
> + * If we are returning from the outermost NMI handler that interrupted an
> + * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
> + * to let the RCU grace-period handling know that the CPU is back to
> + * being RCU-idle.
>   */
>  void rcu_nmi_exit(void)
>  {
>  	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
>  
> -	if (rdtp->dynticks_nmi_nesting == 0 ||
> -	    --rdtp->dynticks_nmi_nesting != 0)
> +	/*
> +	 * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
> +	 * (We are exiting an NMI handler, so RCU better be paying attention
> +	 * to us!)
> +	 */
> +	WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
> +	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> +
> +	/*
> +	 * If the nesting level is not 1, the CPU wasn't RCU-idle, so
> +	 * leave it in non-RCU-idle state.
> +	 */
> +	if (rdtp->dynticks_nmi_nesting != 1) {
> +		rdtp->dynticks_nmi_nesting -= 2;
>  		return;
> +	}
> +
> +	/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
> +	rdtp->dynticks_nmi_nesting = 0;
>  	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
>  	smp_mb__before_atomic();  /* See above. */
>  	atomic_inc(&rdtp->dynticks);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> .
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-27  7:03                                           ` Lai Jiangshan
@ 2014-11-27 16:46                                             ` Paul E. McKenney
  0 siblings, 0 replies; 72+ messages in thread
From: Paul E. McKenney @ 2014-11-27 16:46 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Andy Lutomirski, Borislav Petkov, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Josh Triplett, Frédéric Weisbecker

On Thu, Nov 27, 2014 at 03:03:22PM +0800, Lai Jiangshan wrote:
> 
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Thank you, recorded!

> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 8749f43f3f05..fc0236992655 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -759,39 +759,71 @@ void rcu_irq_enter(void)
> >  /**
> >   * rcu_nmi_enter - inform RCU of entry to NMI context
> >   *
> > - * If the CPU was idle with dynamic ticks active, and there is no
> > - * irq handler running, this updates rdtp->dynticks_nmi to let the
> > - * RCU grace-period handling know that the CPU is active.
> > + * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
> > + * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
> > + * that the CPU is active.  This implementation permits nested NMIs, as
> > + * long as the nesting level does not overflow an int.  (You will probably
> > + * run out of stack space first.)
> >   */
> >  void rcu_nmi_enter(void)
> >  {
> >  	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> > +	int incby = 2;
> >  
> > -	if (rdtp->dynticks_nmi_nesting == 0 &&
> > -	    (atomic_read(&rdtp->dynticks) & 0x1))
> > -		return;
> > -	rdtp->dynticks_nmi_nesting++;
> > -	smp_mb__before_atomic();  /* Force delay from prior write. */
> > -	atomic_inc(&rdtp->dynticks);
> > -	/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
> > -	smp_mb__after_atomic();  /* See above. */
> > -	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > +	/* Complain about underflow. */
> > +	WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
> > +
> > +	/*
> > +	 * If idle from RCU viewpoint, atomically increment ->dynticks
> > +	 * to mark non-idle and increment ->dynticks_nmi_nesting by one.
> > +	 * Otherwise, increment ->dynticks_nmi_nesting by two.  This means
> > +	 * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
> > +	 * to be in the outermost NMI handler that interrupted an RCU-idle
> > +	 * period (observation due to Andy Lutomirski).
> > +	 */
> > +	if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> > +		smp_mb__before_atomic();  /* Force delay from prior write. */
> > +		atomic_inc(&rdtp->dynticks);
> > +		/* atomic_inc() before later RCU read-side crit sects */
> > +		smp_mb__after_atomic();  /* See above. */
> > +		WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > +		incby = 1;
> > +	}
> > +	rdtp->dynticks_nmi_nesting += incby;
> 
> I prefer a "else" branch here.
> 
> 	if (!(atomic_read(&rdtp->dynticks) & 0x1)) {
> 		...
> 		WARN_ON_ONCE(rdtp->dynticks_nmi_nesting); /* paired with "rdtp->dynticks_nmi_nesting = 0" */
> 		rdtp->dynticks_nmi_nesting = 1;  /* paired with "if (rdtp->dynticks_nmi_nesting != 1) {" */
> 	} else {
> 		rdtp->dynticks_nmi_nesting += 2;
> 	}

I avoided this approach due to the extra line.

							Thanx, Paul

> > +	barrier();
> >  }
> >  
> >  /**
> >   * rcu_nmi_exit - inform RCU of exit from NMI context
> >   *
> > - * If the CPU was idle with dynamic ticks active, and there is no
> > - * irq handler running, this updates rdtp->dynticks_nmi to let the
> > - * RCU grace-period handling know that the CPU is no longer active.
> > + * If we are returning from the outermost NMI handler that interrupted an
> > + * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
> > + * to let the RCU grace-period handling know that the CPU is back to
> > + * being RCU-idle.
> >   */
> >  void rcu_nmi_exit(void)
> >  {
> >  	struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >  
> > -	if (rdtp->dynticks_nmi_nesting == 0 ||
> > -	    --rdtp->dynticks_nmi_nesting != 0)
> > +	/*
> > +	 * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
> > +	 * (We are exiting an NMI handler, so RCU better be paying attention
> > +	 * to us!)
> > +	 */
> > +	WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
> > +	WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
> > +
> > +	/*
> > +	 * If the nesting level is not 1, the CPU wasn't RCU-idle, so
> > +	 * leave it in non-RCU-idle state.
> > +	 */
> > +	if (rdtp->dynticks_nmi_nesting != 1) {
> > +		rdtp->dynticks_nmi_nesting -= 2;
> >  		return;
> > +	}
> > +
> > +	/* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
> > +	rdtp->dynticks_nmi_nesting = 0;
> >  	/* CPUs seeing atomic_inc() must see prior RCU read-side crit sects */
> >  	smp_mb__before_atomic();  /* See above. */
> >  	atomic_inc(&rdtp->dynticks);
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > .
> > 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-25 18:58                                             ` Borislav Petkov
  2014-11-25 19:16                                               ` Paul E. McKenney
@ 2014-12-11  0:22                                               ` Tony Luck
  2014-12-11  0:24                                                 ` Andy Lutomirski
  1 sibling, 1 reply; 72+ messages in thread
From: Tony Luck @ 2014-12-11  0:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Paul McKenney, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Tue, Nov 25, 2014 at 10:58 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Nov 24, 2014 at 03:52:10PM -0800, Andy Lutomirski wrote:
>> Dunno. Tony and Borislav -- when do you want the IST stack switching
>> stuff?
>
> I'd leave that up to Tony and his testbench. I mean, we can hammer on
> it as much as we can and it can pass all testing locally but the real
> fun starts once it hits upstream and for that it doesn't matter which
> release... IMHO.

So what was the net result of all the mode/RCU discussions?

Do I need some extra magic incantations in the final version
of do_machine_check() beyond what was in this patch:

  https://lkml.kernel.org/r/546d169211835aded@agluck-desk.sc.intel.com

to make everything happy?

-Tony

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-12-11  0:22                                               ` Tony Luck
@ 2014-12-11  0:24                                                 ` Andy Lutomirski
  2015-01-05 21:46                                                   ` Tony Luck
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2014-12-11  0:24 UTC (permalink / raw)
  To: Tony Luck
  Cc: Borislav Petkov, Paul McKenney, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Wed, Dec 10, 2014 at 4:22 PM, Tony Luck <tony.luck@gmail.com> wrote:
> On Tue, Nov 25, 2014 at 10:58 AM, Borislav Petkov <bp@alien8.de> wrote:
>> On Mon, Nov 24, 2014 at 03:52:10PM -0800, Andy Lutomirski wrote:
>>> Dunno. Tony and Borislav -- when do you want the IST stack switching
>>> stuff?
>>
>> I'd leave that up to Tony and his testbench. I mean, we can hammer on
>> it as much as we can and it can pass all testing locally but the real
>> fun starts once it hits upstream and for that it doesn't matter which
>> release... IMHO.
>
> So what was the net result of all the mode/RCU discussions?
>
> Do I need some extra magic incantations in the final version
> of do_machine_check() beyond what was in this patch:
>
>   https://lkml.kernel.org/r/546d169211835aded@agluck-desk.sc.intel.com
>
> to make everything happy?
>

I think you need ist_begin_non_atomic() before local_irq_enable() and
ist_end_non_atomic() after local_irq_disable().  Otherwise it should
be good.

--Andy

> -Tony



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-12-11  0:24                                                 ` Andy Lutomirski
@ 2015-01-05 21:46                                                   ` Tony Luck
  2015-01-05 21:54                                                     ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Tony Luck @ 2015-01-05 21:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Paul McKenney, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Wed, Dec 10, 2014 at 4:24 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> I think you need ist_begin_non_atomic() before local_irq_enable() and
> ist_end_non_atomic() after local_irq_disable().  Otherwise it should
> be good.

In your x86/paranoid branch you added:

     prev_state = ist_enter(regs);
          .... body of do_machine_check() here ...
     ist_exit(regs, prev_state);

Does that override the previous advice? Or do I still need something before
I call local_irq_enable() and after local_irq_disable()?

-Tony

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-05 21:46                                                   ` Tony Luck
@ 2015-01-05 21:54                                                     ` Andy Lutomirski
  2015-01-06  0:44                                                       ` [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks Luck, Tony
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2015-01-05 21:54 UTC (permalink / raw)
  To: Tony Luck
  Cc: Borislav Petkov, Paul McKenney, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Jan 5, 2015 at 1:46 PM, Tony Luck <tony.luck@gmail.com> wrote:
> On Wed, Dec 10, 2014 at 4:24 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> I think you need ist_begin_non_atomic() before local_irq_enable() and
>> ist_end_non_atomic() after local_irq_disable().  Otherwise it should
>> be good.
>
> In your x86/paranoid branch you added:
>
>      prev_state = ist_enter(regs);
>           .... body of do_machine_check() here ...
>      ist_exit(regs, prev_state);
>
> Does that override the previous advice? Or do I still need something before
> I call local_irq_enable() and after local_irq_disable()?

I think I was just being vague.  It would be:

     prev_state = ist_enter(regs);
          .... beginning of do_machine_check() here ...

     if (whatever condition) {
        ist_begin_non_atomic();
        local_irq_enable();

        ...

        local_irq_disable();
        ist_end_non_atomic();
     ist_exit(regs, prev_state);

--Andy

>
> -Tony



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks
  2015-01-05 21:54                                                     ` Andy Lutomirski
@ 2015-01-06  0:44                                                       ` Luck, Tony
  2015-01-06  1:01                                                         ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Luck, Tony @ 2015-01-06  0:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Paul McKenney, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Andi Kleen,
	Josh Triplett, fweisbec

We now switch to the kernel stack when a machine check interrupts
during user mode.  This means that we can perform recovery actions
in the tail of do_machine_check()

Signed-off-by: Tony Luck <tony.luck@intel.com>

---
On top of Andy's x86/paranoid branch
Andy: Should I really move that:
	pr_err("Uncorrected hardware memory error ...
inside the ist_begin_non_atomic() section?

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 51b26e895933..9b3de99dc004 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -190,7 +190,6 @@ enum mcp_flags {
 void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
 int mce_notify_irq(void);
-void mce_notify_process(void);
 
 DECLARE_PER_CPU(struct mce, injectm);
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8b13b0fbda8e..e82e95abc92b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -75,7 +75,6 @@ struct thread_info {
 #define TIF_SYSCALL_EMU		6	/* syscall emulation active */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
-#define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
@@ -100,7 +99,6 @@ struct thread_info {
 #define _TIF_SYSCALL_EMU	(1 << TIF_SYSCALL_EMU)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
-#define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
@@ -140,7 +138,7 @@ struct thread_info {
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
-	(_TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_NOTIFY_RESUME |	\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |				\
 	 _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE)
 
 /* flags to check in __switch_to() */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 800d423f1e92..f17a4f1aaa5d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1004,51 +1004,6 @@ static void mce_clear_state(unsigned long *toclear)
 }
 
 /*
- * Need to save faulting physical address associated with a process
- * in the machine check handler some place where we can grab it back
- * later in mce_notify_process()
- */
-#define	MCE_INFO_MAX	16
-
-struct mce_info {
-	atomic_t		inuse;
-	struct task_struct	*t;
-	__u64			paddr;
-	int			restartable;
-} mce_info[MCE_INFO_MAX];
-
-static void mce_save_info(__u64 addr, int c)
-{
-	struct mce_info *mi;
-
-	for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++) {
-		if (atomic_cmpxchg(&mi->inuse, 0, 1) == 0) {
-			mi->t = current;
-			mi->paddr = addr;
-			mi->restartable = c;
-			return;
-		}
-	}
-
-	mce_panic("Too many concurrent recoverable errors", NULL, NULL);
-}
-
-static struct mce_info *mce_find_info(void)
-{
-	struct mce_info *mi;
-
-	for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++)
-		if (atomic_read(&mi->inuse) && mi->t == current)
-			return mi;
-	return NULL;
-}
-
-static void mce_clear_info(struct mce_info *mi)
-{
-	atomic_set(&mi->inuse, 0);
-}
-
-/*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
  *
@@ -1086,6 +1041,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	DECLARE_BITMAP(toclear, MAX_NR_BANKS);
 	DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
 	char *msg = "Unknown";
+	u64 recover_paddr = ~0ull;
+	int flags = MF_ACTION_REQUIRED;
 
 	prev_state = ist_enter(regs);
 
@@ -1207,9 +1164,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		if (no_way_out)
 			mce_panic("Fatal machine check on current CPU", &m, msg);
 		if (worst == MCE_AR_SEVERITY) {
-			/* schedule action before return to userland */
-			mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
-			set_thread_flag(TIF_MCE_NOTIFY);
+			recover_paddr = m.addr;
+			if (!(m.mcgstatus & MCG_STATUS_RIPV))
+				flags |= MF_MUST_KILL;
 		} else if (kill_it) {
 			force_sig(SIGBUS, current);
 		}
@@ -1220,6 +1177,26 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
 out:
 	sync_core();
+
+	if (recover_paddr == ~0ull)
+		goto done;
+
+	pr_err("Uncorrected hardware memory error in user-access at %llx",
+		 recover_paddr);
+	/*
+	 * We must call memory_failure() here even if the current process is
+	 * doomed. We still need to mark the page as poisoned and alert any
+	 * other users of the page.
+	 */
+	ist_begin_non_atomic(regs);
+	local_irq_enable();
+	if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) {
+		pr_err("Memory error not recovered");
+		force_sig(SIGBUS, current);
+	}
+	local_irq_disable();
+	ist_end_non_atomic();
+done:
 	ist_exit(regs, prev_state);
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
@@ -1238,42 +1215,6 @@ int memory_failure(unsigned long pfn, int vector, int flags)
 #endif
 
 /*
- * Called in process context that interrupted by MCE and marked with
- * TIF_MCE_NOTIFY, just before returning to erroneous userland.
- * This code is allowed to sleep.
- * Attempt possible recovery such as calling the high level VM handler to
- * process any corrupted pages, and kill/signal current process if required.
- * Action required errors are handled here.
- */
-void mce_notify_process(void)
-{
-	unsigned long pfn;
-	struct mce_info *mi = mce_find_info();
-	int flags = MF_ACTION_REQUIRED;
-
-	if (!mi)
-		mce_panic("Lost physical address for unconsumed uncorrectable error", NULL, NULL);
-	pfn = mi->paddr >> PAGE_SHIFT;
-
-	clear_thread_flag(TIF_MCE_NOTIFY);
-
-	pr_err("Uncorrected hardware memory error in user-access at %llx",
-		 mi->paddr);
-	/*
-	 * We must call memory_failure() here even if the current process is
-	 * doomed. We still need to mark the page as poisoned and alert any
-	 * other users of the page.
-	 */
-	if (!mi->restartable)
-		flags |= MF_MUST_KILL;
-	if (memory_failure(pfn, MCE_VECTOR, flags) < 0) {
-		pr_err("Memory error not recovered");
-		force_sig(SIGBUS, current);
-	}
-	mce_clear_info(mi);
-}
-
-/*
  * Action optional processing happens here (picking up
  * from the list of faulting pages that do_machine_check()
  * placed into the "ring").
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index ed37a768d0fc..2a33c8f68319 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -740,12 +740,6 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 {
 	user_exit();
 
-#ifdef CONFIG_X86_MCE
-	/* notify userspace of pending MCEs */
-	if (thread_info_flags & _TIF_MCE_NOTIFY)
-		mce_notify_process();
-#endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
-
 	if (thread_info_flags & _TIF_UPROBE)
 		uprobe_notify_resume(regs);
 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks
  2015-01-06  0:44                                                       ` [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks Luck, Tony
@ 2015-01-06  1:01                                                         ` Andy Lutomirski
  2015-01-06 18:00                                                           ` Luck, Tony
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2015-01-06  1:01 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Paul McKenney, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Mon, Jan 5, 2015 at 4:44 PM, Luck, Tony <tony.luck@intel.com> wrote:
> We now switch to the kernel stack when a machine check interrupts
> during user mode.  This means that we can perform recovery actions
> in the tail of do_machine_check()
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
>
> ---
> On top of Andy's x86/paranoid branch
> Andy: Should I really move that:
>         pr_err("Uncorrected hardware memory error ...
> inside the ist_begin_non_atomic() section?
>

I think I like it as is.

[...]

> @@ -1220,6 +1177,26 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>         mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
>  out:
>         sync_core();
> +
> +       if (recover_paddr == ~0ull)
> +               goto done;
> +
> +       pr_err("Uncorrected hardware memory error in user-access at %llx",
> +                recover_paddr);

printk is safe from IRQ context, so this should be okay unless we've
totally screwed up.  And, if we totally screwed up, seeing this before
the BUGs in ist_begin_non_atomic would be nice.

> +       /*
> +        * We must call memory_failure() here even if the current process is
> +        * doomed. We still need to mark the page as poisoned and alert any
> +        * other users of the page.
> +        */
> +       ist_begin_non_atomic(regs);
> +       local_irq_enable();
> +       if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) {
> +               pr_err("Memory error not recovered");
> +               force_sig(SIGBUS, current);
> +       }
> +       local_irq_disable();
> +       ist_end_non_atomic();
> +done:
>         ist_exit(regs, prev_state);
>  }

For the context-related bits:

Reviewed-by: Andy Lutomirski <luto@amacapital.net>

Should I stick this in my -next branch so it can stew?

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks
  2015-01-06  1:01                                                         ` Andy Lutomirski
@ 2015-01-06 18:00                                                           ` Luck, Tony
  2015-01-07 12:13                                                             ` Borislav Petkov
  0 siblings, 1 reply; 72+ messages in thread
From: Luck, Tony @ 2015-01-06 18:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Paul McKenney, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

[-- Warning: decoded text below may be mangled --]
[-- Attachment #0: Type: text/plain; charset="utf-8", Size: 367 bytes --]

> For the context-related bits:
>
> Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Thanks.

> Should I stick this in my -next branch so it can stew?

That's fine with me.  Boris ... any comments?

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks
  2015-01-06 18:00                                                           ` Luck, Tony
@ 2015-01-07 12:13                                                             ` Borislav Petkov
  2015-01-07 15:51                                                               ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Borislav Petkov @ 2015-01-07 12:13 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Paul McKenney, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Tue, Jan 06, 2015 at 06:00:08PM +0000, Luck, Tony wrote:
> That's fine with me.  Boris ... any comments?

Same here:

Acked-by: Borislav Petkov <bp@suse.de>

I like negative diffstats:

 include/asm/mce.h         |    1
 include/asm/thread_info.h |    4 -
 kernel/cpu/mcheck/mce.c   |  109 ++++++++++------------------------------------
 kernel/signal.c           |    6 --
 4 files changed, 26 insertions(+), 94 deletions(-)

Procedural question: so this is going in, in the next merge window,
aimed at 3.20, right?

IMO that is a good plan as we have max time to deal with fallout, should
there be any.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks
  2015-01-07 12:13                                                             ` Borislav Petkov
@ 2015-01-07 15:51                                                               ` Andy Lutomirski
  2015-01-07 15:58                                                                 ` Borislav Petkov
  2015-01-07 16:12                                                                 ` Paul E. McKenney
  0 siblings, 2 replies; 72+ messages in thread
From: Andy Lutomirski @ 2015-01-07 15:51 UTC (permalink / raw)
  To: Borislav Petkov, Thomas Gleixner, Ingo Molnar
  Cc: Luck, Tony, Paul McKenney, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Andi Kleen, Josh Triplett,
	Frédéric Weisbecker

On Wed, Jan 7, 2015 at 4:13 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Tue, Jan 06, 2015 at 06:00:08PM +0000, Luck, Tony wrote:
>> That's fine with me.  Boris ... any comments?
>
> Same here:
>
> Acked-by: Borislav Petkov <bp@suse.de>
>
> I like negative diffstats:
>
>  include/asm/mce.h         |    1
>  include/asm/thread_info.h |    4 -
>  kernel/cpu/mcheck/mce.c   |  109 ++++++++++------------------------------------
>  kernel/signal.c           |    6 --
>  4 files changed, 26 insertions(+), 94 deletions(-)
>
> Procedural question: so this is going in, in the next merge window,
> aimed at 3.20, right?

Yes, assuming that Ingo and Thomas agree.

When does Paul usually send the rcu pull request?  I can send these as
a pull request or an emailed series after that happens.

--Andy

>
> IMO that is a good plan as we have max time to deal with fallout, should
> there be any.
>
> Thanks.
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks
  2015-01-07 15:51                                                               ` Andy Lutomirski
@ 2015-01-07 15:58                                                                 ` Borislav Petkov
  2015-01-07 16:12                                                                 ` Paul E. McKenney
  1 sibling, 0 replies; 72+ messages in thread
From: Borislav Petkov @ 2015-01-07 15:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, Luck, Tony, Paul McKenney, X86 ML,
	Linus Torvalds, linux-kernel, Peter Zijlstra, Oleg Nesterov,
	Andi Kleen, Josh Triplett, Frédéric Weisbecker

On Wed, Jan 07, 2015 at 07:51:26AM -0800, Andy Lutomirski wrote:
> When does Paul usually send the rcu pull request?  I can send these as
> a pull request or an emailed series after that happens.

I'd put a note in the pull request for tip guys to pull after Paul's
pull req., for example.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks
  2015-01-07 15:51                                                               ` Andy Lutomirski
  2015-01-07 15:58                                                                 ` Borislav Petkov
@ 2015-01-07 16:12                                                                 ` Paul E. McKenney
  1 sibling, 0 replies; 72+ messages in thread
From: Paul E. McKenney @ 2015-01-07 16:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Thomas Gleixner, Ingo Molnar, Luck, Tony,
	X86 ML, Linus Torvalds, linux-kernel, Peter Zijlstra,
	Oleg Nesterov, Andi Kleen, Josh Triplett,
	Frédéric Weisbecker

On Wed, Jan 07, 2015 at 07:51:26AM -0800, Andy Lutomirski wrote:
> On Wed, Jan 7, 2015 at 4:13 AM, Borislav Petkov <bp@alien8.de> wrote:
> > On Tue, Jan 06, 2015 at 06:00:08PM +0000, Luck, Tony wrote:
> >> That's fine with me.  Boris ... any comments?
> >
> > Same here:
> >
> > Acked-by: Borislav Petkov <bp@suse.de>
> >
> > I like negative diffstats:
> >
> >  include/asm/mce.h         |    1
> >  include/asm/thread_info.h |    4 -
> >  kernel/cpu/mcheck/mce.c   |  109 ++++++++++------------------------------------
> >  kernel/signal.c           |    6 --
> >  4 files changed, 26 insertions(+), 94 deletions(-)
> >
> > Procedural question: so this is going in, in the next merge window,
> > aimed at 3.20, right?
> 
> Yes, assuming that Ingo and Thomas agree.
> 
> When does Paul usually send the rcu pull request?  I can send these as
> a pull request or an emailed series after that happens.

I normally send it to -tip at about -rc5.  Which is coming right up,
isn't it?  Guess I had better get my tree organized.  ;-)

							Thanx, Paul

> --Andy
> 
> >
> > IMO that is a good plan as we have max time to deal with fallout, should
> > there be any.
> >
> > Thanks.
> >
> > --
> > Regards/Gruss,
> >     Boris.
> >
> > Sent from a fat crate under my desk. Formatting is fine.
> > --
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2014-11-21 21:26 ` [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context Andy Lutomirski
                     ` (2 preceding siblings ...)
  2014-11-22 17:20   ` Borislav Petkov
@ 2015-01-22 21:52   ` Sasha Levin
  2015-01-23 17:58     ` Andy Lutomirski
  3 siblings, 1 reply; 72+ messages in thread
From: Sasha Levin @ 2015-01-22 21:52 UTC (permalink / raw)
  To: Andy Lutomirski, Borislav Petkov, x86, Linus Torvalds
  Cc: linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Paul E. McKenney, Josh Triplett,
	Frédéric Weisbecker

On 11/21/2014 04:26 PM, Andy Lutomirski wrote:
> We currently pretend that IST context is like standard exception
> context, but this is incorrect.  IST entries from userspace are like
> standard exceptions except that they use per-cpu stacks, so they are
> atomic.  IST entries from kernel space are like NMIs from RCU's
> perspective -- they are not quiescent states even if they
> interrupted the kernel during a quiescent state.
> 
> Add and use ist_enter and ist_exit to track IST context.  Even
> though x86_32 has no IST stacks, we track these interrupts the same
> way.
> 
> This fixes two issues:
> 
>  - Scheduling from an IST interrupt handler will now warn.  It would
>    previously appear to work as long as we got lucky and nothing
>    overwrote the stack frame.  (I don't know of any bugs in this
>    that would trigger the warning, but it's good to be on the safe
>    side.)
> 
>  - RCU handling in IST context was dangerous.  As far as I know,
>    only machine checks were likely to trigger this, but it's good to
>    be on the safe side.
> 
> Note that the machine check handlers appears to have been missing
> any context tracking at all before this patch.

Hi Andy, Paul,

I *suspect* that the following is a result of this commit:

[  543.999079] ===============================
[  543.999079] [ INFO: suspicious RCU usage. ]
[  543.999079] 3.19.0-rc5-next-20150121-sasha-00064-g3c37e35-dirty #1809 Not tainted
[  543.999079] -------------------------------
[  543.999079] include/linux/rcupdate.h:892 rcu_read_lock() used illegally while idle!
[  543.999079]
[  543.999079] other info that might help us debug this:
[  543.999079]
[  543.999079]
[  543.999079] RCU used illegally from idle CPU!
[  543.999079] rcu_scheduler_active = 1, debug_locks = 1
[  543.999079] RCU used illegally from extended quiescent state!
[  543.999079] 1 lock held by trinity-main/15058:
[  543.999079] #0: (rcu_read_lock){......}, at: atomic_notifier_call_chain (kernel/notifier.c:192)
[  543.999079]
[  543.999079] stack backtrace:
[  543.999079] CPU: 16 PID: 15058 Comm: trinity-main Not tainted 3.19.0-rc5-next-20150121-sasha-00064-g3c37e35-dirty #1809
[  543.999079]  0000000000000000 0000000000000000 0000000000000001 ffff8801af907d88
[  543.999079]  ffffffff92e9e917 0000000000000011 ffff8801afcf8000 ffff8801af907db8
[  543.999079]  ffffffff815f5613 ffffffff9654d4a0 0000000000000003 ffff8801af907e28
[  543.999079] Call Trace:
[  543.999079] dump_stack (lib/dump_stack.c:52)
[  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
[  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
[  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
[  543.999079] notify_die (kernel/notifier.c:538)
[  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
[  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  543.999079] do_debug (arch/x86/kernel/traps.c:652)
[  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
[  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
[  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
[  543.999079] debug (arch/x86/kernel/entry_64.S:1310)


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-22 21:52   ` Sasha Levin
@ 2015-01-23 17:58     ` Andy Lutomirski
  2015-01-23 18:04       ` Borislav Petkov
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2015-01-23 17:58 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Paul E. McKenney, Josh Triplett, Frédéric Weisbecker

On Thu, Jan 22, 2015 at 1:52 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
> On 11/21/2014 04:26 PM, Andy Lutomirski wrote:
>> We currently pretend that IST context is like standard exception
>> context, but this is incorrect.  IST entries from userspace are like
>> standard exceptions except that they use per-cpu stacks, so they are
>> atomic.  IST entries from kernel space are like NMIs from RCU's
>> perspective -- they are not quiescent states even if they
>> interrupted the kernel during a quiescent state.
>>
>> Add and use ist_enter and ist_exit to track IST context.  Even
>> though x86_32 has no IST stacks, we track these interrupts the same
>> way.
>>
>> This fixes two issues:
>>
>>  - Scheduling from an IST interrupt handler will now warn.  It would
>>    previously appear to work as long as we got lucky and nothing
>>    overwrote the stack frame.  (I don't know of any bugs in this
>>    that would trigger the warning, but it's good to be on the safe
>>    side.)
>>
>>  - RCU handling in IST context was dangerous.  As far as I know,
>>    only machine checks were likely to trigger this, but it's good to
>>    be on the safe side.
>>
>> Note that the machine check handlers appears to have been missing
>> any context tracking at all before this patch.
>
> Hi Andy, Paul,
>
> I *suspect* that the following is a result of this commit:
>
> [  543.999079] ===============================
> [  543.999079] [ INFO: suspicious RCU usage. ]
> [  543.999079] 3.19.0-rc5-next-20150121-sasha-00064-g3c37e35-dirty #1809 Not tainted
> [  543.999079] -------------------------------
> [  543.999079] include/linux/rcupdate.h:892 rcu_read_lock() used illegally while idle!
> [  543.999079]
> [  543.999079] other info that might help us debug this:
> [  543.999079]
> [  543.999079]
> [  543.999079] RCU used illegally from idle CPU!
> [  543.999079] rcu_scheduler_active = 1, debug_locks = 1
> [  543.999079] RCU used illegally from extended quiescent state!
> [  543.999079] 1 lock held by trinity-main/15058:
> [  543.999079] #0: (rcu_read_lock){......}, at: atomic_notifier_call_chain (kernel/notifier.c:192)
> [  543.999079]
> [  543.999079] stack backtrace:
> [  543.999079] CPU: 16 PID: 15058 Comm: trinity-main Not tainted 3.19.0-rc5-next-20150121-sasha-00064-g3c37e35-dirty #1809
> [  543.999079]  0000000000000000 0000000000000000 0000000000000001 ffff8801af907d88
> [  543.999079]  ffffffff92e9e917 0000000000000011 ffff8801afcf8000 ffff8801af907db8
> [  543.999079]  ffffffff815f5613 ffffffff9654d4a0 0000000000000003 ffff8801af907e28
> [  543.999079] Call Trace:
> [  543.999079] dump_stack (lib/dump_stack.c:52)
> [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
> [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
> [  543.999079] notify_die (kernel/notifier.c:538)
> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
> [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
> [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
> [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
> [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
> [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)

I don't know how to read this stack trace.  Are we in do_int3,
do_debug, or both?  I didn't change do_debug at all.

I think that nesting exception_enter inside rcu_nmi_enter should be
okay (and it had better be, even in old kernels, because I think perf
does that).

So you have any idea what you (or trinity) did to trigger this?

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-23 17:58     ` Andy Lutomirski
@ 2015-01-23 18:04       ` Borislav Petkov
  2015-01-23 18:34         ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Borislav Petkov @ 2015-01-23 18:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sasha Levin, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Paul E. McKenney, Josh Triplett, Frédéric Weisbecker

On Fri, Jan 23, 2015 at 09:58:01AM -0800, Andy Lutomirski wrote:
> > [  543.999079] Call Trace:
> > [  543.999079] dump_stack (lib/dump_stack.c:52)
> > [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
> > [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
> > [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
> > [  543.999079] notify_die (kernel/notifier.c:538)
> > [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
> > [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> > [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
> > [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
> > [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
> > [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
> > [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)
> 
> I don't know how to read this stack trace.  Are we in do_int3,
> do_debug, or both?  I didn't change do_debug at all.

It looks like we're in do_debug. do_int3 is only on the stack but not
part of the current frame if I can trust the '?' ...

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-23 18:04       ` Borislav Petkov
@ 2015-01-23 18:34         ` Andy Lutomirski
  2015-01-23 20:48           ` Sasha Levin
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2015-01-23 18:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Sasha Levin, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Paul E. McKenney, Josh Triplett, Frédéric Weisbecker

On Fri, Jan 23, 2015 at 10:04 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jan 23, 2015 at 09:58:01AM -0800, Andy Lutomirski wrote:
>> > [  543.999079] Call Trace:
>> > [  543.999079] dump_stack (lib/dump_stack.c:52)
>> > [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
>> > [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
>> > [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
>> > [  543.999079] notify_die (kernel/notifier.c:538)
>> > [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
>> > [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
>> > [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
>> > [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
>> > [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
>> > [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
>> > [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)
>>
>> I don't know how to read this stack trace.  Are we in do_int3,
>> do_debug, or both?  I didn't change do_debug at all.
>
> It looks like we're in do_debug. do_int3 is only on the stack but not
> part of the current frame if I can trust the '?' ...
>

It's possible that an int3 happened and I did something wrong on
return that caused a subsequent do_debug to screw up, but I don't see
how my patch would have caused that.

Were there any earlier log messages?

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-23 18:34         ` Andy Lutomirski
@ 2015-01-23 20:48           ` Sasha Levin
  2015-01-24  1:25             ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Sasha Levin @ 2015-01-23 20:48 UTC (permalink / raw)
  To: Andy Lutomirski, Borislav Petkov
  Cc: X86 ML, Linus Torvalds, linux-kernel, Peter Zijlstra,
	Oleg Nesterov, Tony Luck, Andi Kleen, Paul E. McKenney,
	Josh Triplett, Frédéric Weisbecker

On 01/23/2015 01:34 PM, Andy Lutomirski wrote:
> On Fri, Jan 23, 2015 at 10:04 AM, Borislav Petkov <bp@alien8.de> wrote:
>> On Fri, Jan 23, 2015 at 09:58:01AM -0800, Andy Lutomirski wrote:
>>>> [  543.999079] Call Trace:
>>>> [  543.999079] dump_stack (lib/dump_stack.c:52)
>>>> [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
>>>> [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
>>>> [  543.999079] notify_die (kernel/notifier.c:538)
>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
>>>> [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
>>>> [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
>>>> [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
>>>> [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
>>>> [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
>>>> [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)
>>>
>>> I don't know how to read this stack trace.  Are we in do_int3,
>>> do_debug, or both?  I didn't change do_debug at all.
>>
>> It looks like we're in do_debug. do_int3 is only on the stack but not
>> part of the current frame if I can trust the '?' ...
>>
> 
> It's possible that an int3 happened and I did something wrong on
> return that caused a subsequent do_debug to screw up, but I don't see
> how my patch would have caused that.
> 
> Were there any earlier log messages?

Nope, nothing odd before or after.


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-23 20:48           ` Sasha Levin
@ 2015-01-24  1:25             ` Andy Lutomirski
  2015-01-28 16:33               ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2015-01-24  1:25 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Paul E. McKenney, Josh Triplett, Frédéric Weisbecker

On Fri, Jan 23, 2015 at 12:48 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
> On 01/23/2015 01:34 PM, Andy Lutomirski wrote:
>> On Fri, Jan 23, 2015 at 10:04 AM, Borislav Petkov <bp@alien8.de> wrote:
>>> On Fri, Jan 23, 2015 at 09:58:01AM -0800, Andy Lutomirski wrote:
>>>>> [  543.999079] Call Trace:
>>>>> [  543.999079] dump_stack (lib/dump_stack.c:52)
>>>>> [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
>>>>> [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
>>>>> [  543.999079] notify_die (kernel/notifier.c:538)
>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
>>>>> [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
>>>>> [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
>>>>> [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
>>>>> [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
>>>>> [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
>>>>> [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)
>>>>
>>>> I don't know how to read this stack trace.  Are we in do_int3,
>>>> do_debug, or both?  I didn't change do_debug at all.
>>>
>>> It looks like we're in do_debug. do_int3 is only on the stack but not
>>> part of the current frame if I can trust the '?' ...
>>>
>>
>> It's possible that an int3 happened and I did something wrong on
>> return that caused a subsequent do_debug to screw up, but I don't see
>> how my patch would have caused that.
>>
>> Were there any earlier log messages?
>
> Nope, nothing odd before or after.

Trinity just survived for a decent amount of time for me with my
patches, other than a bunch of apparently expected OOM kills.  I have
no idea how to tell trinity how much memory to use.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-24  1:25             ` Andy Lutomirski
@ 2015-01-28 16:33               ` Andy Lutomirski
  2015-01-28 17:48                 ` Paul E. McKenney
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2015-01-28 16:33 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Paul E. McKenney, Josh Triplett, Frédéric Weisbecker

On Fri, Jan 23, 2015 at 5:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Jan 23, 2015 at 12:48 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>> On 01/23/2015 01:34 PM, Andy Lutomirski wrote:
>>> On Fri, Jan 23, 2015 at 10:04 AM, Borislav Petkov <bp@alien8.de> wrote:
>>>> On Fri, Jan 23, 2015 at 09:58:01AM -0800, Andy Lutomirski wrote:
>>>>>> [  543.999079] Call Trace:
>>>>>> [  543.999079] dump_stack (lib/dump_stack.c:52)
>>>>>> [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
>>>>>> [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
>>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
>>>>>> [  543.999079] notify_die (kernel/notifier.c:538)
>>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
>>>>>> [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
>>>>>> [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
>>>>>> [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
>>>>>> [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
>>>>>> [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
>>>>>> [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)
>>>>>
>>>>> I don't know how to read this stack trace.  Are we in do_int3,
>>>>> do_debug, or both?  I didn't change do_debug at all.
>>>>
>>>> It looks like we're in do_debug. do_int3 is only on the stack but not
>>>> part of the current frame if I can trust the '?' ...
>>>>
>>>
>>> It's possible that an int3 happened and I did something wrong on
>>> return that caused a subsequent do_debug to screw up, but I don't see
>>> how my patch would have caused that.
>>>
>>> Were there any earlier log messages?
>>
>> Nope, nothing odd before or after.
>
> Trinity just survived for a decent amount of time for me with my
> patches, other than a bunch of apparently expected OOM kills.  I have
> no idea how to tell trinity how much memory to use.
>

A longer trinity run on a larger VM survived (still with some OOM
kills, but no taint) with these patches.  I suspect that it's a
regression somewhere else in the RCU changes.  I have
CONFIG_PROVE_RCU=y, so I should have seen the failure if it was there,
I think.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-28 16:33               ` Andy Lutomirski
@ 2015-01-28 17:48                 ` Paul E. McKenney
  2015-01-28 21:02                   ` Andy Lutomirski
  0 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2015-01-28 17:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sasha Levin, Borislav Petkov, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Josh Triplett, Frédéric Weisbecker

On Wed, Jan 28, 2015 at 08:33:06AM -0800, Andy Lutomirski wrote:
> On Fri, Jan 23, 2015 at 5:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Fri, Jan 23, 2015 at 12:48 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
> >> On 01/23/2015 01:34 PM, Andy Lutomirski wrote:
> >>> On Fri, Jan 23, 2015 at 10:04 AM, Borislav Petkov <bp@alien8.de> wrote:
> >>>> On Fri, Jan 23, 2015 at 09:58:01AM -0800, Andy Lutomirski wrote:
> >>>>>> [  543.999079] Call Trace:
> >>>>>> [  543.999079] dump_stack (lib/dump_stack.c:52)
> >>>>>> [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
> >>>>>> [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
> >>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
> >>>>>> [  543.999079] notify_die (kernel/notifier.c:538)
> >>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
> >>>>>> [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> >>>>>> [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
> >>>>>> [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
> >>>>>> [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
> >>>>>> [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
> >>>>>> [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)
> >>>>>
> >>>>> I don't know how to read this stack trace.  Are we in do_int3,
> >>>>> do_debug, or both?  I didn't change do_debug at all.
> >>>>
> >>>> It looks like we're in do_debug. do_int3 is only on the stack but not
> >>>> part of the current frame if I can trust the '?' ...
> >>>>
> >>>
> >>> It's possible that an int3 happened and I did something wrong on
> >>> return that caused a subsequent do_debug to screw up, but I don't see
> >>> how my patch would have caused that.
> >>>
> >>> Were there any earlier log messages?
> >>
> >> Nope, nothing odd before or after.
> >
> > Trinity just survived for a decent amount of time for me with my
> > patches, other than a bunch of apparently expected OOM kills.  I have
> > no idea how to tell trinity how much memory to use.
> 
> A longer trinity run on a larger VM survived (still with some OOM
> kills, but no taint) with these patches.  I suspect that it's a
> regression somewhere else in the RCU changes.  I have
> CONFIG_PROVE_RCU=y, so I should have seen the failure if it was there,
> I think.

If by "RCU changes" you mean my changes to the RCU infrastructure, I am
going to need more of a hint than I see in this thread thus far.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-28 17:48                 ` Paul E. McKenney
@ 2015-01-28 21:02                   ` Andy Lutomirski
  2015-01-30 19:57                     ` Sasha Levin
  0 siblings, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2015-01-28 21:02 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Sasha Levin, Borislav Petkov, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Josh Triplett, Frédéric Weisbecker

On Wed, Jan 28, 2015 at 9:48 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Wed, Jan 28, 2015 at 08:33:06AM -0800, Andy Lutomirski wrote:
>> On Fri, Jan 23, 2015 at 5:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Fri, Jan 23, 2015 at 12:48 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>> >> On 01/23/2015 01:34 PM, Andy Lutomirski wrote:
>> >>> On Fri, Jan 23, 2015 at 10:04 AM, Borislav Petkov <bp@alien8.de> wrote:
>> >>>> On Fri, Jan 23, 2015 at 09:58:01AM -0800, Andy Lutomirski wrote:
>> >>>>>> [  543.999079] Call Trace:
>> >>>>>> [  543.999079] dump_stack (lib/dump_stack.c:52)
>> >>>>>> [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
>> >>>>>> [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
>> >>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
>> >>>>>> [  543.999079] notify_die (kernel/notifier.c:538)
>> >>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
>> >>>>>> [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
>> >>>>>> [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
>> >>>>>> [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
>> >>>>>> [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
>> >>>>>> [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
>> >>>>>> [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)
>> >>>>>
>> >>>>> I don't know how to read this stack trace.  Are we in do_int3,
>> >>>>> do_debug, or both?  I didn't change do_debug at all.
>> >>>>
>> >>>> It looks like we're in do_debug. do_int3 is only on the stack but not
>> >>>> part of the current frame if I can trust the '?' ...
>> >>>>
>> >>>
>> >>> It's possible that an int3 happened and I did something wrong on
>> >>> return that caused a subsequent do_debug to screw up, but I don't see
>> >>> how my patch would have caused that.
>> >>>
>> >>> Were there any earlier log messages?
>> >>
>> >> Nope, nothing odd before or after.
>> >
>> > Trinity just survived for a decent amount of time for me with my
>> > patches, other than a bunch of apparently expected OOM kills.  I have
>> > no idea how to tell trinity how much memory to use.
>>
>> A longer trinity run on a larger VM survived (still with some OOM
>> kills, but no taint) with these patches.  I suspect that it's a
>> regression somewhere else in the RCU changes.  I have
>> CONFIG_PROVE_RCU=y, so I should have seen the failure if it was there,
>> I think.
>
> If by "RCU changes" you mean my changes to the RCU infrastructure, I am
> going to need more of a hint than I see in this thread thus far.  ;-)
>

I can't help much, since I can't reproduce the problem.  Presumably if
it's a bug in -tip, someone else will trigger it, too.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-28 21:02                   ` Andy Lutomirski
@ 2015-01-30 19:57                     ` Sasha Levin
  2015-01-31  1:28                       ` Sasha Levin
  0 siblings, 1 reply; 72+ messages in thread
From: Sasha Levin @ 2015-01-30 19:57 UTC (permalink / raw)
  To: Andy Lutomirski, Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On 01/28/2015 04:02 PM, Andy Lutomirski wrote:
> On Wed, Jan 28, 2015 at 9:48 AM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
>> On Wed, Jan 28, 2015 at 08:33:06AM -0800, Andy Lutomirski wrote:
>>> On Fri, Jan 23, 2015 at 5:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Fri, Jan 23, 2015 at 12:48 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>>>>> On 01/23/2015 01:34 PM, Andy Lutomirski wrote:
>>>>>> On Fri, Jan 23, 2015 at 10:04 AM, Borislav Petkov <bp@alien8.de> wrote:
>>>>>>> On Fri, Jan 23, 2015 at 09:58:01AM -0800, Andy Lutomirski wrote:
>>>>>>>>> [  543.999079] Call Trace:
>>>>>>>>> [  543.999079] dump_stack (lib/dump_stack.c:52)
>>>>>>>>> [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
>>>>>>>>> [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
>>>>>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
>>>>>>>>> [  543.999079] notify_die (kernel/notifier.c:538)
>>>>>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
>>>>>>>>> [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
>>>>>>>>> [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
>>>>>>>>> [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
>>>>>>>>> [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
>>>>>>>>> [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
>>>>>>>>> [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)
>>>>>>>>
>>>>>>>> I don't know how to read this stack trace.  Are we in do_int3,
>>>>>>>> do_debug, or both?  I didn't change do_debug at all.
>>>>>>>
>>>>>>> It looks like we're in do_debug. do_int3 is only on the stack but not
>>>>>>> part of the current frame if I can trust the '?' ...
>>>>>>>
>>>>>>
>>>>>> It's possible that an int3 happened and I did something wrong on
>>>>>> return that caused a subsequent do_debug to screw up, but I don't see
>>>>>> how my patch would have caused that.
>>>>>>
>>>>>> Were there any earlier log messages?
>>>>>
>>>>> Nope, nothing odd before or after.
>>>>
>>>> Trinity just survived for a decent amount of time for me with my
>>>> patches, other than a bunch of apparently expected OOM kills.  I have
>>>> no idea how to tell trinity how much memory to use.
>>>
>>> A longer trinity run on a larger VM survived (still with some OOM
>>> kills, but no taint) with these patches.  I suspect that it's a
>>> regression somewhere else in the RCU changes.  I have
>>> CONFIG_PROVE_RCU=y, so I should have seen the failure if it was there,
>>> I think.
>>
>> If by "RCU changes" you mean my changes to the RCU infrastructure, I am
>> going to need more of a hint than I see in this thread thus far.  ;-)
>>
> 
> I can't help much, since I can't reproduce the problem.  Presumably if
> it's a bug in -tip, someone else will trigger it, too.

I'm not sure what to tell you here, I'm not using any weird options for trinity
to reproduce it.

It doesn't happen to frequently, but I still see it happening.

Would you like me to try a debug patch or something similar?


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-30 19:57                     ` Sasha Levin
@ 2015-01-31  1:28                       ` Sasha Levin
  2015-01-31  3:12                         ` Andy Lutomirski
  2015-01-31 13:01                         ` [PATCH] x86, traps: Fix ist_enter from userspace Andy Lutomirski
  0 siblings, 2 replies; 72+ messages in thread
From: Sasha Levin @ 2015-01-31  1:28 UTC (permalink / raw)
  To: Andy Lutomirski, Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Oleg Nesterov, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On 01/30/2015 02:57 PM, Sasha Levin wrote:
> On 01/28/2015 04:02 PM, Andy Lutomirski wrote:
>> On Wed, Jan 28, 2015 at 9:48 AM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>>> On Wed, Jan 28, 2015 at 08:33:06AM -0800, Andy Lutomirski wrote:
>>>> On Fri, Jan 23, 2015 at 5:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Fri, Jan 23, 2015 at 12:48 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>>>>>> On 01/23/2015 01:34 PM, Andy Lutomirski wrote:
>>>>>>> On Fri, Jan 23, 2015 at 10:04 AM, Borislav Petkov <bp@alien8.de> wrote:
>>>>>>>> On Fri, Jan 23, 2015 at 09:58:01AM -0800, Andy Lutomirski wrote:
>>>>>>>>>> [  543.999079] Call Trace:
>>>>>>>>>> [  543.999079] dump_stack (lib/dump_stack.c:52)
>>>>>>>>>> [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
>>>>>>>>>> [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
>>>>>>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
>>>>>>>>>> [  543.999079] notify_die (kernel/notifier.c:538)
>>>>>>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
>>>>>>>>>> [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
>>>>>>>>>> [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
>>>>>>>>>> [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
>>>>>>>>>> [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
>>>>>>>>>> [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
>>>>>>>>>> [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)
>>>>>>>>>
>>>>>>>>> I don't know how to read this stack trace.  Are we in do_int3,
>>>>>>>>> do_debug, or both?  I didn't change do_debug at all.
>>>>>>>>
>>>>>>>> It looks like we're in do_debug. do_int3 is only on the stack but not
>>>>>>>> part of the current frame if I can trust the '?' ...
>>>>>>>>
>>>>>>>
>>>>>>> It's possible that an int3 happened and I did something wrong on
>>>>>>> return that caused a subsequent do_debug to screw up, but I don't see
>>>>>>> how my patch would have caused that.
>>>>>>>
>>>>>>> Were there any earlier log messages?
>>>>>>
>>>>>> Nope, nothing odd before or after.
>>>>>
>>>>> Trinity just survived for a decent amount of time for me with my
>>>>> patches, other than a bunch of apparently expected OOM kills.  I have
>>>>> no idea how to tell trinity how much memory to use.
>>>>
>>>> A longer trinity run on a larger VM survived (still with some OOM
>>>> kills, but no taint) with these patches.  I suspect that it's a
>>>> regression somewhere else in the RCU changes.  I have
>>>> CONFIG_PROVE_RCU=y, so I should have seen the failure if it was there,
>>>> I think.
>>>
>>> If by "RCU changes" you mean my changes to the RCU infrastructure, I am
>>> going to need more of a hint than I see in this thread thus far.  ;-)
>>>
>>
>> I can't help much, since I can't reproduce the problem.  Presumably if
>> it's a bug in -tip, someone else will trigger it, too.
> 
> I'm not sure what to tell you here, I'm not using any weird options for trinity
> to reproduce it.
> 
> It doesn't happen to frequently, but I still see it happening.
> 
> Would you like me to try a debug patch or something similar?

After talking with Paul we know what's going on here:

do_debug() calls ist_enter() to indicate we're running on the interrupt
stack. The first think ist_enter() does is:

        preempt_count_add(HARDIRQ_OFFSET);

After this, as far as the kernel is concerned, we're in interrupt mode
so in_interrupt() will return true.

Next, we'll call exception_enter() which won't do anything since:

	void context_tracking_user_exit(void)
	{
	        unsigned long flags;
	
	        if (!context_tracking_is_enabled())
	                return;
	
	        if (in_interrupt())  <=== This returns true, so nothing else gets done
	                return;

At this stage we never tell RCU that we exited user mode, but then we
try to use it calling the notifiers, which explains the warnings I'm seeing.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-31  1:28                       ` Sasha Levin
@ 2015-01-31  3:12                         ` Andy Lutomirski
  2015-01-31 12:50                           ` Andy Lutomirski
  2015-01-31 13:01                         ` [PATCH] x86, traps: Fix ist_enter from userspace Andy Lutomirski
  1 sibling, 1 reply; 72+ messages in thread
From: Andy Lutomirski @ 2015-01-31  3:12 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Paul McKenney, Borislav Petkov, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Josh Triplett, Frédéric Weisbecker

On Fri, Jan 30, 2015 at 5:28 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
> On 01/30/2015 02:57 PM, Sasha Levin wrote:
>> On 01/28/2015 04:02 PM, Andy Lutomirski wrote:
>>> On Wed, Jan 28, 2015 at 9:48 AM, Paul E. McKenney
>>> <paulmck@linux.vnet.ibm.com> wrote:
>>>> On Wed, Jan 28, 2015 at 08:33:06AM -0800, Andy Lutomirski wrote:
>>>>> On Fri, Jan 23, 2015 at 5:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>> On Fri, Jan 23, 2015 at 12:48 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>>>>>>> On 01/23/2015 01:34 PM, Andy Lutomirski wrote:
>>>>>>>> On Fri, Jan 23, 2015 at 10:04 AM, Borislav Petkov <bp@alien8.de> wrote:
>>>>>>>>> On Fri, Jan 23, 2015 at 09:58:01AM -0800, Andy Lutomirski wrote:
>>>>>>>>>>> [  543.999079] Call Trace:
>>>>>>>>>>> [  543.999079] dump_stack (lib/dump_stack.c:52)
>>>>>>>>>>> [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
>>>>>>>>>>> [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
>>>>>>>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
>>>>>>>>>>> [  543.999079] notify_die (kernel/notifier.c:538)
>>>>>>>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
>>>>>>>>>>> [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
>>>>>>>>>>> [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
>>>>>>>>>>> [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
>>>>>>>>>>> [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
>>>>>>>>>>> [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
>>>>>>>>>>> [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)
>>>>>>>>>>
>>>>>>>>>> I don't know how to read this stack trace.  Are we in do_int3,
>>>>>>>>>> do_debug, or both?  I didn't change do_debug at all.
>>>>>>>>>
>>>>>>>>> It looks like we're in do_debug. do_int3 is only on the stack but not
>>>>>>>>> part of the current frame if I can trust the '?' ...
>>>>>>>>>
>>>>>>>>
>>>>>>>> It's possible that an int3 happened and I did something wrong on
>>>>>>>> return that caused a subsequent do_debug to screw up, but I don't see
>>>>>>>> how my patch would have caused that.
>>>>>>>>
>>>>>>>> Were there any earlier log messages?
>>>>>>>
>>>>>>> Nope, nothing odd before or after.
>>>>>>
>>>>>> Trinity just survived for a decent amount of time for me with my
>>>>>> patches, other than a bunch of apparently expected OOM kills.  I have
>>>>>> no idea how to tell trinity how much memory to use.
>>>>>
>>>>> A longer trinity run on a larger VM survived (still with some OOM
>>>>> kills, but no taint) with these patches.  I suspect that it's a
>>>>> regression somewhere else in the RCU changes.  I have
>>>>> CONFIG_PROVE_RCU=y, so I should have seen the failure if it was there,
>>>>> I think.
>>>>
>>>> If by "RCU changes" you mean my changes to the RCU infrastructure, I am
>>>> going to need more of a hint than I see in this thread thus far.  ;-)
>>>>
>>>
>>> I can't help much, since I can't reproduce the problem.  Presumably if
>>> it's a bug in -tip, someone else will trigger it, too.
>>
>> I'm not sure what to tell you here, I'm not using any weird options for trinity
>> to reproduce it.
>>
>> It doesn't happen to frequently, but I still see it happening.
>>
>> Would you like me to try a debug patch or something similar?
>
> After talking with Paul we know what's going on here:
>
> do_debug() calls ist_enter() to indicate we're running on the interrupt
> stack. The first think ist_enter() does is:

I wonder whether there's an easy way to trigger this.  Probably a
watchpoint on the user stack would do the trick.

>
>         preempt_count_add(HARDIRQ_OFFSET);
>
> After this, as far as the kernel is concerned, we're in interrupt mode
> so in_interrupt() will return true.
>
> Next, we'll call exception_enter() which won't do anything since:
>
>         void context_tracking_user_exit(void)
>         {
>                 unsigned long flags;
>
>                 if (!context_tracking_is_enabled())
>                         return;
>
>                 if (in_interrupt())  <=== This returns true, so nothing else gets done
>                         return;
>
> At this stage we never tell RCU that we exited user mode, but then we
> try to use it calling the notifiers, which explains the warnings I'm seeing.
>

Is fixing this as simple as calling exception_enter before
incrementing the preempt count?  I'll try to have a tested patch
tomorrow.

Thanks for tracking this down!  I've been out of town since you
reported this, so I haven't had enough time to track it down myself.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context
  2015-01-31  3:12                         ` Andy Lutomirski
@ 2015-01-31 12:50                           ` Andy Lutomirski
  0 siblings, 0 replies; 72+ messages in thread
From: Andy Lutomirski @ 2015-01-31 12:50 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Paul McKenney, Borislav Petkov, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	Andi Kleen, Josh Triplett, Frédéric Weisbecker

On Fri, Jan 30, 2015 at 7:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, Jan 30, 2015 at 5:28 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>> On 01/30/2015 02:57 PM, Sasha Levin wrote:
>>> On 01/28/2015 04:02 PM, Andy Lutomirski wrote:
>>>> On Wed, Jan 28, 2015 at 9:48 AM, Paul E. McKenney
>>>> <paulmck@linux.vnet.ibm.com> wrote:
>>>>> On Wed, Jan 28, 2015 at 08:33:06AM -0800, Andy Lutomirski wrote:
>>>>>> On Fri, Jan 23, 2015 at 5:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>>>> On Fri, Jan 23, 2015 at 12:48 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>>>>>>>> On 01/23/2015 01:34 PM, Andy Lutomirski wrote:
>>>>>>>>> On Fri, Jan 23, 2015 at 10:04 AM, Borislav Petkov <bp@alien8.de> wrote:
>>>>>>>>>> On Fri, Jan 23, 2015 at 09:58:01AM -0800, Andy Lutomirski wrote:
>>>>>>>>>>>> [  543.999079] Call Trace:
>>>>>>>>>>>> [  543.999079] dump_stack (lib/dump_stack.c:52)
>>>>>>>>>>>> [  543.999079] lockdep_rcu_suspicious (kernel/locking/lockdep.c:4259)
>>>>>>>>>>>> [  543.999079] atomic_notifier_call_chain (include/linux/rcupdate.h:892 kernel/notifier.c:182 kernel/notifier.c:193)
>>>>>>>>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:192)
>>>>>>>>>>>> [  543.999079] notify_die (kernel/notifier.c:538)
>>>>>>>>>>>> [  543.999079] ? atomic_notifier_call_chain (kernel/notifier.c:538)
>>>>>>>>>>>> [  543.999079] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
>>>>>>>>>>>> [  543.999079] do_debug (arch/x86/kernel/traps.c:652)
>>>>>>>>>>>> [  543.999079] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
>>>>>>>>>>>> [  543.999079] ? do_int3 (arch/x86/kernel/traps.c:610)
>>>>>>>>>>>> [  543.999079] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
>>>>>>>>>>>> [  543.999079] debug (arch/x86/kernel/entry_64.S:1310)
>>>>>>>>>>>
>>>>>>>>>>> I don't know how to read this stack trace.  Are we in do_int3,
>>>>>>>>>>> do_debug, or both?  I didn't change do_debug at all.
>>>>>>>>>>
>>>>>>>>>> It looks like we're in do_debug. do_int3 is only on the stack but not
>>>>>>>>>> part of the current frame if I can trust the '?' ...
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It's possible that an int3 happened and I did something wrong on
>>>>>>>>> return that caused a subsequent do_debug to screw up, but I don't see
>>>>>>>>> how my patch would have caused that.
>>>>>>>>>
>>>>>>>>> Were there any earlier log messages?
>>>>>>>>
>>>>>>>> Nope, nothing odd before or after.
>>>>>>>
>>>>>>> Trinity just survived for a decent amount of time for me with my
>>>>>>> patches, other than a bunch of apparently expected OOM kills.  I have
>>>>>>> no idea how to tell trinity how much memory to use.
>>>>>>
>>>>>> A longer trinity run on a larger VM survived (still with some OOM
>>>>>> kills, but no taint) with these patches.  I suspect that it's a
>>>>>> regression somewhere else in the RCU changes.  I have
>>>>>> CONFIG_PROVE_RCU=y, so I should have seen the failure if it was there,
>>>>>> I think.
>>>>>
>>>>> If by "RCU changes" you mean my changes to the RCU infrastructure, I am
>>>>> going to need more of a hint than I see in this thread thus far.  ;-)
>>>>>
>>>>
>>>> I can't help much, since I can't reproduce the problem.  Presumably if
>>>> it's a bug in -tip, someone else will trigger it, too.
>>>
>>> I'm not sure what to tell you here, I'm not using any weird options for trinity
>>> to reproduce it.
>>>
>>> It doesn't happen to frequently, but I still see it happening.
>>>
>>> Would you like me to try a debug patch or something similar?
>>
>> After talking with Paul we know what's going on here:
>>
>> do_debug() calls ist_enter() to indicate we're running on the interrupt
>> stack. The first think ist_enter() does is:
>
> I wonder whether there's an easy way to trigger this.  Probably a
> watchpoint on the user stack would do the trick.

This is embarrassing.  I just stuck an assertion in do_int3 and I can
reproduce it with int3 from user space.  Patch coming.

>
>>
>>         preempt_count_add(HARDIRQ_OFFSET);
>>
>> After this, as far as the kernel is concerned, we're in interrupt mode
>> so in_interrupt() will return true.
>>
>> Next, we'll call exception_enter() which won't do anything since:
>>
>>         void context_tracking_user_exit(void)
>>         {
>>                 unsigned long flags;
>>
>>                 if (!context_tracking_is_enabled())
>>                         return;
>>
>>                 if (in_interrupt())  <=== This returns true, so nothing else gets done
>>                         return;
>>
>> At this stage we never tell RCU that we exited user mode, but then we
>> try to use it calling the notifiers, which explains the warnings I'm seeing.
>>
>
> Is fixing this as simple as calling exception_enter before
> incrementing the preempt count?  I'll try to have a tested patch
> tomorrow.
>
> Thanks for tracking this down!  I've been out of town since you
> reported this, so I haven't had enough time to track it down myself.
>
> --Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH] x86, traps: Fix ist_enter from userspace
  2015-01-31  1:28                       ` Sasha Levin
  2015-01-31  3:12                         ` Andy Lutomirski
@ 2015-01-31 13:01                         ` Andy Lutomirski
  2015-01-31 15:09                           ` Sasha Levin
                                             ` (2 more replies)
  1 sibling, 3 replies; 72+ messages in thread
From: Andy Lutomirski @ 2015-01-31 13:01 UTC (permalink / raw)
  To: Sasha Levin, Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Tony Luck, Andi Kleen, Josh Triplett,
	Frédéric Weisbecker, Andy Lutomirski

context_tracking_user_exit() has no effect if in_interrupt() returns true,
so ist_enter() didn't work.  Fix it by calling exception_enter(), and thus
context_tracking_user_exit(), before incrementing the preempt count.

This also adds an assertion that will catch the problem reliably if
CONFIG_PROVE_RCU=y to help prevent the bug from being reintroduced.

Fixes: 959274753857 x86, traps: Track entry into and exit from IST context
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/traps.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 7176f84f95a4..c74f2f5652da 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -110,15 +110,11 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
 
 enum ctx_state ist_enter(struct pt_regs *regs)
 {
-	/*
-	 * We are atomic because we're on the IST stack (or we're on x86_32,
-	 * in which case we still shouldn't schedule.
-	 */
-	preempt_count_add(HARDIRQ_OFFSET);
+	enum ctx_state prev_state;
 
 	if (user_mode_vm(regs)) {
 		/* Other than that, we're just an exception. */
-		return exception_enter();
+		prev_state = exception_enter();
 	} else {
 		/*
 		 * We might have interrupted pretty much anything.  In
@@ -127,12 +123,27 @@ enum ctx_state ist_enter(struct pt_regs *regs)
 		 * but we need to notify RCU.
 		 */
 		rcu_nmi_enter();
-		return IN_KERNEL;  /* the value is irrelevant. */
+		prev_state = IN_KERNEL;  /* the value is irrelevant. */
 	}
+
+	/*
+	 * We are atomic because we're on the IST stack (or we're on x86_32,
+	 * in which case we still shouldn't schedule).
+	 *
+	 * This must be after exception_enter(), because exception_enter()
+	 * won't do anything if in_interrupt() returns true.
+	 */
+	preempt_count_add(HARDIRQ_OFFSET);
+
+	/* This code is a bit fragile.  Test it. */
+	rcu_lockdep_assert(rcu_is_watching(), "ist_enter didn't work");
+
+	return prev_state;
 }
 
 void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
 {
+	/* Must be before exception_exit. */
 	preempt_count_sub(HARDIRQ_OFFSET);
 
 	if (user_mode_vm(regs))
-- 
2.1.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] x86, traps: Fix ist_enter from userspace
  2015-01-31 13:01                         ` [PATCH] x86, traps: Fix ist_enter from userspace Andy Lutomirski
@ 2015-01-31 15:09                           ` Sasha Levin
  2015-01-31 16:18                           ` Paul E. McKenney
  2015-02-04  6:01                           ` [tip:x86/asm] " tip-bot for Andy Lutomirski
  2 siblings, 0 replies; 72+ messages in thread
From: Sasha Levin @ 2015-01-31 15:09 UTC (permalink / raw)
  To: Andy Lutomirski, Paul McKenney
  Cc: Borislav Petkov, X86 ML, Linus Torvalds, linux-kernel,
	Peter Zijlstra, Tony Luck, Andi Kleen, Josh Triplett,
	Frédéric Weisbecker

On 01/31/2015 08:01 AM, Andy Lutomirski wrote:
> context_tracking_user_exit() has no effect if in_interrupt() returns true,
> so ist_enter() didn't work.  Fix it by calling exception_enter(), and thus
> context_tracking_user_exit(), before incrementing the preempt count.
> 
> This also adds an assertion that will catch the problem reliably if
> CONFIG_PROVE_RCU=y to help prevent the bug from being reintroduced.
> 
> Fixes: 959274753857 x86, traps: Track entry into and exit from IST context
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>

I was running something like that overnight, looks like it fixes the problem.

	Tested-by: Sasha Levin <sasha.levin@oracle.com>


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] x86, traps: Fix ist_enter from userspace
  2015-01-31 13:01                         ` [PATCH] x86, traps: Fix ist_enter from userspace Andy Lutomirski
  2015-01-31 15:09                           ` Sasha Levin
@ 2015-01-31 16:18                           ` Paul E. McKenney
  2015-02-01  2:17                             ` Andy Lutomirski
  2015-02-04  6:01                           ` [tip:x86/asm] " tip-bot for Andy Lutomirski
  2 siblings, 1 reply; 72+ messages in thread
From: Paul E. McKenney @ 2015-01-31 16:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sasha Levin, Borislav Petkov, X86 ML, Linus Torvalds,
	linux-kernel, Peter Zijlstra, Tony Luck, Andi Kleen,
	Josh Triplett, Frédéric Weisbecker

On Sat, Jan 31, 2015 at 05:01:34AM -0800, Andy Lutomirski wrote:
> context_tracking_user_exit() has no effect if in_interrupt() returns true,
> so ist_enter() didn't work.  Fix it by calling exception_enter(), and thus
> context_tracking_user_exit(), before incrementing the preempt count.
> 
> This also adds an assertion that will catch the problem reliably if
> CONFIG_PROVE_RCU=y to help prevent the bug from being reintroduced.
> 
> Fixes: 959274753857 x86, traps: Track entry into and exit from IST context
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/kernel/traps.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 7176f84f95a4..c74f2f5652da 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -110,15 +110,11 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
> 
>  enum ctx_state ist_enter(struct pt_regs *regs)
>  {
> -	/*
> -	 * We are atomic because we're on the IST stack (or we're on x86_32,
> -	 * in which case we still shouldn't schedule.
> -	 */
> -	preempt_count_add(HARDIRQ_OFFSET);
> +	enum ctx_state prev_state;
> 
>  	if (user_mode_vm(regs)) {
>  		/* Other than that, we're just an exception. */
> -		return exception_enter();
> +		prev_state = exception_enter();
>  	} else {
>  		/*
>  		 * We might have interrupted pretty much anything.  In
> @@ -127,12 +123,27 @@ enum ctx_state ist_enter(struct pt_regs *regs)
>  		 * but we need to notify RCU.
>  		 */
>  		rcu_nmi_enter();
> -		return IN_KERNEL;  /* the value is irrelevant. */
> +		prev_state = IN_KERNEL;  /* the value is irrelevant. */
>  	}
> +
> +	/*
> +	 * We are atomic because we're on the IST stack (or we're on x86_32,
> +	 * in which case we still shouldn't schedule).
> +	 *
> +	 * This must be after exception_enter(), because exception_enter()
> +	 * won't do anything if in_interrupt() returns true.
> +	 */
> +	preempt_count_add(HARDIRQ_OFFSET);

So the reason we need this here is that ist_exit() does not have
access to the bits tested by user_mode_vm()?  Or is the idea to
avoid an additional test in ist_exit()?  (Wondering why this isn't
in the above "else" clause.)

							Thanx, Paul

> +
> +	/* This code is a bit fragile.  Test it. */
> +	rcu_lockdep_assert(rcu_is_watching(), "ist_enter didn't work");
> +
> +	return prev_state;
>  }
> 
>  void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
>  {
> +	/* Must be before exception_exit. */
>  	preempt_count_sub(HARDIRQ_OFFSET);
> 
>  	if (user_mode_vm(regs))
> -- 
> 2.1.0
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH] x86, traps: Fix ist_enter from userspace
  2015-01-31 16:18                           ` Paul E. McKenney
@ 2015-02-01  2:17                             ` Andy Lutomirski
  0 siblings, 0 replies; 72+ messages in thread
From: Andy Lutomirski @ 2015-02-01  2:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, Borislav Petkov, X86 ML, Sasha Levin,
	Peter Zijlstra, Linus Torvalds, Josh Triplett, Andi Kleen,
	linux-kernel, Tony Luck

On Jan 31, 2015 12:18 PM, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
>
> On Sat, Jan 31, 2015 at 05:01:34AM -0800, Andy Lutomirski wrote:
> > context_tracking_user_exit() has no effect if in_interrupt() returns true,
> > so ist_enter() didn't work.  Fix it by calling exception_enter(), and thus
> > context_tracking_user_exit(), before incrementing the preempt count.
> >
> > This also adds an assertion that will catch the problem reliably if
> > CONFIG_PROVE_RCU=y to help prevent the bug from being reintroduced.
> >
> > Fixes: 959274753857 x86, traps: Track entry into and exit from IST context
> > Reported-by: Sasha Levin <sasha.levin@oracle.com>
> > Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> > ---
> >  arch/x86/kernel/traps.c | 25 ++++++++++++++++++-------
> >  1 file changed, 18 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index 7176f84f95a4..c74f2f5652da 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -110,15 +110,11 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
> >
> >  enum ctx_state ist_enter(struct pt_regs *regs)
> >  {
> > -     /*
> > -      * We are atomic because we're on the IST stack (or we're on x86_32,
> > -      * in which case we still shouldn't schedule.
> > -      */
> > -     preempt_count_add(HARDIRQ_OFFSET);
> > +     enum ctx_state prev_state;
> >
> >       if (user_mode_vm(regs)) {
> >               /* Other than that, we're just an exception. */
> > -             return exception_enter();
> > +             prev_state = exception_enter();
> >       } else {
> >               /*
> >                * We might have interrupted pretty much anything.  In
> > @@ -127,12 +123,27 @@ enum ctx_state ist_enter(struct pt_regs *regs)
> >                * but we need to notify RCU.
> >                */
> >               rcu_nmi_enter();
> > -             return IN_KERNEL;  /* the value is irrelevant. */
> > +             prev_state = IN_KERNEL;  /* the value is irrelevant. */
> >       }
> > +
> > +     /*
> > +      * We are atomic because we're on the IST stack (or we're on x86_32,
> > +      * in which case we still shouldn't schedule).
> > +      *
> > +      * This must be after exception_enter(), because exception_enter()
> > +      * won't do anything if in_interrupt() returns true.
> > +      */
> > +     preempt_count_add(HARDIRQ_OFFSET);
>
> So the reason we need this here is that ist_exit() does not have
> access to the bits tested by user_mode_vm()?  Or is the idea to
> avoid an additional test in ist_exit()?  (Wondering why this isn't
> in the above "else" clause.)

It's result of Linus's suggestion here:

https://lkml.org/lkml/2014/11/19/1131

The idea is that scheduling in IST context is generally bad, and,
before this change, we wouldn't have warned.  By bumping the preempt
count unconditionally, we make it clear that we can't schedule.

For the one exceptional special case, we have ist_begin_non_atomic.

--Andy

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [tip:x86/asm] x86, traps: Fix ist_enter from userspace
  2015-01-31 13:01                         ` [PATCH] x86, traps: Fix ist_enter from userspace Andy Lutomirski
  2015-01-31 15:09                           ` Sasha Levin
  2015-01-31 16:18                           ` Paul E. McKenney
@ 2015-02-04  6:01                           ` " tip-bot for Andy Lutomirski
  2 siblings, 0 replies; 72+ messages in thread
From: tip-bot for Andy Lutomirski @ 2015-02-04  6:01 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: luto, mingo, linux-kernel, hpa, tglx, sasha.levin

Commit-ID:  b926e6f61a26036ee9eabe6761483954d481ad25
Gitweb:     http://git.kernel.org/tip/b926e6f61a26036ee9eabe6761483954d481ad25
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Sat, 31 Jan 2015 04:53:53 -0800
Committer:  Andy Lutomirski <luto@amacapital.net>
CommitDate: Sun, 1 Feb 2015 04:02:53 -0800

x86, traps: Fix ist_enter from userspace

context_tracking_user_exit() has no effect if in_interrupt() returns true,
so ist_enter() didn't work.  Fix it by calling exception_enter(), and thus
context_tracking_user_exit(), before incrementing the preempt count.

This also adds an assertion that will catch the problem reliably if
CONFIG_PROVE_RCU=y to help prevent the bug from being reintroduced.

Link: http://lkml.kernel.org/r/261ebee6aee55a4724746d0d7024697013c40a08.1422709102.git.luto@amacapital.net
Fixes: 959274753857 x86, traps: Track entry into and exit from IST context
Reported-and-tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/traps.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 7176f84..c74f2f5 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -110,15 +110,11 @@ static inline void preempt_conditional_cli(struct pt_regs *regs)
 
 enum ctx_state ist_enter(struct pt_regs *regs)
 {
-	/*
-	 * We are atomic because we're on the IST stack (or we're on x86_32,
-	 * in which case we still shouldn't schedule.
-	 */
-	preempt_count_add(HARDIRQ_OFFSET);
+	enum ctx_state prev_state;
 
 	if (user_mode_vm(regs)) {
 		/* Other than that, we're just an exception. */
-		return exception_enter();
+		prev_state = exception_enter();
 	} else {
 		/*
 		 * We might have interrupted pretty much anything.  In
@@ -127,12 +123,27 @@ enum ctx_state ist_enter(struct pt_regs *regs)
 		 * but we need to notify RCU.
 		 */
 		rcu_nmi_enter();
-		return IN_KERNEL;  /* the value is irrelevant. */
+		prev_state = IN_KERNEL;  /* the value is irrelevant. */
 	}
+
+	/*
+	 * We are atomic because we're on the IST stack (or we're on x86_32,
+	 * in which case we still shouldn't schedule).
+	 *
+	 * This must be after exception_enter(), because exception_enter()
+	 * won't do anything if in_interrupt() returns true.
+	 */
+	preempt_count_add(HARDIRQ_OFFSET);
+
+	/* This code is a bit fragile.  Test it. */
+	rcu_lockdep_assert(rcu_is_watching(), "ist_enter didn't work");
+
+	return prev_state;
 }
 
 void ist_exit(struct pt_regs *regs, enum ctx_state prev_state)
 {
+	/* Must be before exception_exit. */
 	preempt_count_sub(HARDIRQ_OFFSET);
 
 	if (user_mode_vm(regs))

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, back to index

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-21 21:26 [PATCH v4 0/5] x86: Rework IST interrupts Andy Lutomirski
2014-11-21 21:26 ` [PATCH v4 1/5] uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME Andy Lutomirski
2014-11-22 16:55   ` Borislav Petkov
2014-11-24 17:58     ` Andy Lutomirski
2014-11-21 21:26 ` [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context Andy Lutomirski
2014-11-21 21:32   ` Andy Lutomirski
2014-11-21 22:07     ` Paul E. McKenney
2014-11-21 22:19       ` Andy Lutomirski
2014-11-21 22:55         ` Paul E. McKenney
2014-11-21 23:06           ` Andy Lutomirski
2014-11-21 23:38             ` Paul E. McKenney
2014-11-22  2:00               ` Andy Lutomirski
2014-11-22  4:20                 ` Paul E. McKenney
2014-11-22  5:53                   ` Andy Lutomirski
2014-11-22 23:41                     ` Paul E. McKenney
2014-11-24 20:22                       ` Andy Lutomirski
2014-11-24 20:54                         ` Paul E. McKenney
2014-11-24 21:02                           ` Andy Lutomirski
2014-11-24 21:35                             ` Paul E. McKenney
2014-11-24 22:34                               ` Paul E. McKenney
2014-11-24 22:36                                 ` Andy Lutomirski
2014-11-24 22:57                                   ` Paul E. McKenney
2014-11-24 23:31                                     ` Paul E. McKenney
2014-11-24 23:35                                       ` Andy Lutomirski
2014-11-24 23:50                                         ` Paul E. McKenney
2014-11-24 23:52                                           ` Andy Lutomirski
2014-11-25 18:58                                             ` Borislav Petkov
2014-11-25 19:16                                               ` Paul E. McKenney
2014-12-11  0:22                                               ` Tony Luck
2014-12-11  0:24                                                 ` Andy Lutomirski
2015-01-05 21:46                                                   ` Tony Luck
2015-01-05 21:54                                                     ` Andy Lutomirski
2015-01-06  0:44                                                       ` [PATCH] x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks Luck, Tony
2015-01-06  1:01                                                         ` Andy Lutomirski
2015-01-06 18:00                                                           ` Luck, Tony
2015-01-07 12:13                                                             ` Borislav Petkov
2015-01-07 15:51                                                               ` Andy Lutomirski
2015-01-07 15:58                                                                 ` Borislav Petkov
2015-01-07 16:12                                                                 ` Paul E. McKenney
2014-11-25 17:13                                           ` [PATCH v4 2/5] x86, traps: Track entry into and exit from IST context Paul E. McKenney
2014-11-27  7:03                                           ` Lai Jiangshan
2014-11-27 16:46                                             ` Paul E. McKenney
2014-11-24 21:27                           ` Paul E. McKenney
2014-11-21 22:20       ` Frederic Weisbecker
2014-11-21 22:00   ` Paul E. McKenney
2014-11-22 17:20   ` Borislav Petkov
2014-11-24 19:48     ` Andy Lutomirski
2015-01-22 21:52   ` Sasha Levin
2015-01-23 17:58     ` Andy Lutomirski
2015-01-23 18:04       ` Borislav Petkov
2015-01-23 18:34         ` Andy Lutomirski
2015-01-23 20:48           ` Sasha Levin
2015-01-24  1:25             ` Andy Lutomirski
2015-01-28 16:33               ` Andy Lutomirski
2015-01-28 17:48                 ` Paul E. McKenney
2015-01-28 21:02                   ` Andy Lutomirski
2015-01-30 19:57                     ` Sasha Levin
2015-01-31  1:28                       ` Sasha Levin
2015-01-31  3:12                         ` Andy Lutomirski
2015-01-31 12:50                           ` Andy Lutomirski
2015-01-31 13:01                         ` [PATCH] x86, traps: Fix ist_enter from userspace Andy Lutomirski
2015-01-31 15:09                           ` Sasha Levin
2015-01-31 16:18                           ` Paul E. McKenney
2015-02-01  2:17                             ` Andy Lutomirski
2015-02-04  6:01                           ` [tip:x86/asm] " tip-bot for Andy Lutomirski
2014-11-21 21:26 ` [PATCH v4 3/5] x86, entry: Switch stacks on a paranoid entry " Andy Lutomirski
2014-11-24 15:55   ` Borislav Petkov
2014-11-21 21:26 ` [PATCH v4 4/5] x86: Clean up current_stack_pointer Andy Lutomirski
2014-11-24 11:39   ` Borislav Petkov
2014-11-21 21:26 ` [PATCH v4 5/5] x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic Andy Lutomirski
2014-11-24 15:54   ` Borislav Petkov
2014-11-24 19:52     ` Andy Lutomirski

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox