All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/3] x86: Find a way to allow breakpoints in NMIs
@ 2011-12-08 19:30 Steven Rostedt
  2011-12-08 19:30 ` [RFC][PATCH 1/3] x86: Do not schedule while still in NMI context Steven Rostedt
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Steven Rostedt @ 2011-12-08 19:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers


As been previously discussed on LKML, x86 has an issue with NMIs and iret.
If an NMI takes an exception or breakpoint, iret that those handlers do
will take the CPU out of NMI context, which could allow another NMI to
come in and corrupt the stack.

Linus has discussed a way to solve this and I tried to implement it.
What I came up with was a bit different, but I'm posting it now to get
some feedback from it. See patch 3. I wrote a very detailed change log
there and wont repeat it here.

Let me know what everyone thinks.

Thanks!

-- Steve

---


Linus Torvalds (1):
      x86: Do not schedule while still in NMI context

Steven Rostedt (2):
      x86: Document the NMI handler about not using paranoid_exit
      x86: Add workaround to NMI iret woes

----
 arch/x86/kernel/entry_64.S |  209 +++++++++++++++++++++++++++++++++++++-------
 1 files changed, 176 insertions(+), 33 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC][PATCH 1/3] x86: Do not schedule while still in NMI context
  2011-12-08 19:30 [RFC][PATCH 0/3] x86: Find a way to allow breakpoints in NMIs Steven Rostedt
@ 2011-12-08 19:30 ` Steven Rostedt
  2011-12-08 19:30 ` [RFC][PATCH 2/3] x86: Document the NMI handler about not using paranoid_exit Steven Rostedt
  2011-12-08 19:30 ` [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes Steven Rostedt
  2 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2011-12-08 19:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, Andi Kleen, Ingo Molnar, H. Peter Anvin,
	Paul Turner

[-- Attachment #1: 0001-x86-Do-not-schedule-while-still-in-NMI-context.patch --]
[-- Type: text/plain, Size: 2907 bytes --]

From: Linus Torvalds <torvalds@linux-foundation.org>

The NMI handler uses the paranoid_exit routine that checks the
NEED_RESCHED flag, and if it is set and the return is for userspace,
then interrupts are enabled, the stack is swapped to the thread's stack,
and schedule is called. The problem with this is that we are still in an
NMI context until an iret is executed. This means that any new NMIs are
now starved until an interrupt or exception occurs and does the iret.

As NMIs can not be masked and can interrupt any location, they are
treated as a special case. NEED_RESCHED should not be set in an NMI
handler. The interruption by the NMI should not disturb the work flow
for scheduling. Any IPI sent to a processor after sending the
NEED_RESCHED would have to wait for the NMI anyway, and after the IPI
finishes the schedule would be called as required.

There is no reason to do anything special leaving an NMI. Remove the
call to paranoid_exit and do a simple return. This not only fixes the
bug of starved NMIs, but it also cleans up the code.

Link: http://lkml.kernel.org/r/CA+55aFzgM55hXTs4griX5e9=v_O+=ue+7Rj0PTD=M7hFYpyULQ@mail.gmail.com

Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/x86/kernel/entry_64.S |   32 --------------------------------
 1 files changed, 0 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index faf8d5e..3819ea9 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1489,46 +1489,14 @@ ENTRY(nmi)
 	movq %rsp,%rdi
 	movq $-1,%rsi
 	call do_nmi
-#ifdef CONFIG_TRACE_IRQFLAGS
-	/* paranoidexit; without TRACE_IRQS_OFF */
-	/* ebx:	no swapgs flag */
-	DISABLE_INTERRUPTS(CLBR_NONE)
 	testl %ebx,%ebx				/* swapgs needed? */
 	jnz nmi_restore
-	testl $3,CS(%rsp)
-	jnz nmi_userspace
 nmi_swapgs:
 	SWAPGS_UNSAFE_STACK
 nmi_restore:
 	RESTORE_ALL 8
 	jmp irq_return
-nmi_userspace:
-	GET_THREAD_INFO(%rcx)
-	movl TI_flags(%rcx),%ebx
-	andl $_TIF_WORK_MASK,%ebx
-	jz nmi_swapgs
-	movq %rsp,%rdi			/* &pt_regs */
-	call sync_regs
-	movq %rax,%rsp			/* switch stack for scheduling */
-	testl $_TIF_NEED_RESCHED,%ebx
-	jnz nmi_schedule
-	movl %ebx,%edx			/* arg3: thread flags */
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	xorl %esi,%esi 			/* arg2: oldset */
-	movq %rsp,%rdi 			/* arg1: &pt_regs */
-	call do_notify_resume
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	jmp nmi_userspace
-nmi_schedule:
-	ENABLE_INTERRUPTS(CLBR_ANY)
-	call schedule
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	jmp nmi_userspace
 	CFI_ENDPROC
-#else
-	jmp paranoid_exit
-	CFI_ENDPROC
-#endif
 END(nmi)
 
 ENTRY(ignore_sysret)
-- 
1.7.7.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC][PATCH 2/3] x86: Document the NMI handler about not using paranoid_exit
  2011-12-08 19:30 [RFC][PATCH 0/3] x86: Find a way to allow breakpoints in NMIs Steven Rostedt
  2011-12-08 19:30 ` [RFC][PATCH 1/3] x86: Do not schedule while still in NMI context Steven Rostedt
@ 2011-12-08 19:30 ` Steven Rostedt
  2011-12-08 19:30 ` [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes Steven Rostedt
  2 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2011-12-08 19:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, Andi Kleen

[-- Attachment #1: 0002-x86-Document-the-NMI-handler-about-not-using-paranoi.patch --]
[-- Type: text/plain, Size: 1295 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Linus cleaned up the NMI handler but it still needs some comments to
explain why it uses save_paranoid but not paranoid_exit. Just to keep
others from adding that in the future, document why it's not used.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/entry_64.S |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3819ea9..d1d5434 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1480,9 +1480,16 @@ END(error_exit)
 ENTRY(nmi)
 	INTR_FRAME
 	PARAVIRT_ADJUST_EXCEPTION_FRAME
-	pushq_cfi $-1
+	pushq_cfi $-1		/* ORIG_RAX: no syscall to restart */
 	subq $ORIG_RAX-R15, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
+	/*
+	 * Use save_paranoid to handle SWAPGS, but no need to use paranoid_exit
+	 * as we should not be calling schedule in NMI context.
+	 * Even with normal interrupts enabled. An NMI should not be
+	 * setting NEED_RESCHED or anything that normal interrupts and
+	 * exceptions might do.
+	 */
 	call save_paranoid
 	DEFAULT_FRAME 0
 	/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
-- 
1.7.7.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-08 19:30 [RFC][PATCH 0/3] x86: Find a way to allow breakpoints in NMIs Steven Rostedt
  2011-12-08 19:30 ` [RFC][PATCH 1/3] x86: Do not schedule while still in NMI context Steven Rostedt
  2011-12-08 19:30 ` [RFC][PATCH 2/3] x86: Document the NMI handler about not using paranoid_exit Steven Rostedt
@ 2011-12-08 19:30 ` Steven Rostedt
  2011-12-08 19:36   ` Steven Rostedt
  2 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2011-12-08 19:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, H. Peter Anvin, Paul Turner

[-- Attachment #1: 0003-x86-Add-workaround-to-NMI-iret-woes.patch --]
[-- Type: text/plain, Size: 13720 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

In x86, when an NMI goes off, the CPU goes into an NMI context that
prevents other NMIs to trigger on that CPU. If an NMI is suppose to
trigger, it has to wait till the previous NMI leaves NMI context.
At that time, the next NMI can trigger (note, only one more NMI will
trigger, as only one can be latched at a time).

The way x86 gets out of NMI context is by calling iret. The problem
with this is that this causes problems if the NMI handle either
triggers an exception, or a breakponit. Both the exception and the
breakpoint handlers will finish with an iret. If this happens while
in NMI context, the CPU will leave NMI context and a new NMI may come
in. As NMI handlers are not made to be re-entrant, this can cause
havoc with the system, not to mention, the nested NMI will write
all over the previous NMI's stack.

Linus Torvalds proposed the following workaround to this problem:

https://lkml.org/lkml/2010/7/14/264

"In fact, I wonder if we couldn't just do a software NMI disable
instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
allocated statically) that points to the NMI stack frame, and just
make the NMI code itself do something like

 NMI entry:
 - load percpu NMI stack frame pointer
 - if non-zero we know we're nested, and should ignore this NMI:
    - we're returning to kernel mode, so return immediately by using
"popf/ret", which also keeps NMI's disabled in the hardware until the
"real" NMI iret happens.
    - before the popf/iret, use the NMI stack pointer to make the NMI
return stack be invalid and cause a fault
  - set the NMI stack pointer to the current stack pointer

 NMI exit (not the above "immediate exit because we nested"):
   clear the percpu NMI stack pointer
   Just do the iret.

Now, the thing is, now the "iret" is atomic. If we had a nested NMI,
we'll take a fault, and that re-does our "delayed" NMI - and NMI's
will stay masked.

And if we didn't have a nested NMI, that iret will now unmask NMI's,
and everything is happy."

I first tried to follow this advice but as I started implementing this
code a few gotchas showed up.

One, is accessing per-cpu variables in the NMI handler.

The problem is that per-cpu variables use the %gs register to get the
variable for the given CPU. But as the NMI may happen in userspace,
we must first perform a SWAPGS to get to it. The NMI handler already
does this later in the code, but its too late as we have saved off
all the registers and we don't want to do that for a disabled NMI.

Peter Zijlstra suggested to keep all variables on the stack. This
simplifies things greatly and it has the added benefit of cache locality.

Two, faulting on the iret.

I really wanted to make this work, but it was becoming very hacky, and
I never got it to be stable. The iret already had a fault handler for
userspace faulting with bad segment registers, and getting NMI to trigger
a fault and detect it was very tricky. But for strange reasons, the system
would usually take a double fault and crash. I never figured out why
and decided to go with a simple "jmp" approach. The new approach I took
also simplified things.

Finally, the last problem with Linus's approach was to have the nested
NMI handler do a ret instead of an iret to give the first NMI, NMI context
again.

The problem is that ret is much more limited than an iret. I couldn't figure
out how to get the stack back where it belonged. I could have copied the
current stack, pushed the return onto it, but my fear here is that there
may be some place that writes data below the stack pointer. I know that
is not something code should depend on, but I don't want to chance it.
I may add this feature later, but for now, an NMI handler that loses NMI
context will not get it back.

Here's what is done:

When an NMI comes in, the HW pushes the interrupt stack frame onto the
per cpu NMI stack that is selected by the IST.

A special location on the NMI stack holds a variable that is set when
the first NMI handler runs. If this variable is set then we know that
this is a nested NMI and we process the nested NMI code.

There is still a race when this variable is cleared and an NMI comes
in just before the first NMI does the return. For this case, if the
variable is cleared, we also check if the interrupted stack is the
NMI stack. If it is, then we process the nested NMI code.

Why the two tests and not just test the interrupted stack?

If the first NMI hits a breakpoint and loses NMI context, and then it
hits another breakpoint and while processing that breakpoint we get a
nested NMI. When processing a breakpoint, the stack changes to the
breakpoint stack. If another NMI comes in here we can't rely on the
interrupted stack to be the NMI stack.

If the variable is not set and the interrupted task's stack is not the
NMI stack, then we know this is the first NMI and we can process things
normally. But in order to do so, we need to do a few things first.

1) Set the stack variable that tells us that we are in an NMI handler

2) Make two copies of the interrupt stack frame.
   One copy is used to return on iret
   The other is used to restore the first one if we have a nested NMI.

This is what the stack will look like:

	  +-------------------------+
	  | original SS             |
	  | original Return RSP     |
	  | original RFLAGS         |
	  | original CS             |
	  | original RIP            |
	  +-------------------------+
	  | temp storage for rdx    |
	  +-------------------------+
	  | NMI executing variable  |
	  +-------------------------+
	  | Saved SS                |
	  | Saved Return RSP        |
	  | Saved RFLAGS            |
	  | Saved CS                |
	  | Saved RIP               |
	  +-------------------------+
	  | copied SS               |
	  | copied Return RSP       |
	  | copied RFLAGS           |
	  | copied CS               |
	  | copied RIP              |
	  +-------------------------+
	  | pt_regs                 |
	  +-------------------------+

The original stack frame contains what the HW put in when we entered
the NMI.

We store %rdx as a temp variable to use. Both the original HW stack
frame and this %rdx storage will be clobbered by nested NMIs so we
can not rely on them.

The next item is the special stack variable that is set when we execute
the rest of the NMI handler.

Then we have two copies of the interrupt stack. The second copy is
modified by any nested NMIs to let the first NMI know that we triggered
a second NMI (latched) and that we should repeat the NMI handler.

If the first NMI hits an exception or breakpoint that takes it out of
NMI context, if a second NMI comes in before the first one finishes,
it will update the copied interrupt stack to point to a fix up location
to trigger another NMI.

When the first NMI calls iret, it will instead jump to the fix up
location. This fix up location will copy the saved interrupt stack back
to the copy and execute the nmi handler again.

Note, the nested NMI knows enough to check if it preempted a previous
NMI handler while it is in the fixup location. If it has, it will not
modify the copied interrupt stack and will just leave as if nothing
happened. As the NMI handle is about to execute again, there's no reason
to latch now.

To test all this, I forced the NMI handler to call iret and take itself
out of NMI context. I also added assemble code to write to the serial to
make sure that it hits the nested path as well as the fix up path.
Everything seems to be working fine.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/entry_64.S |  168 ++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 168 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index d1d5434..b02409e 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1475,11 +1475,163 @@ ENTRY(error_exit)
 	CFI_ENDPROC
 END(error_exit)
 
+/*
+ * Test if a given stack is an NMI stack or not.
+ */
+	.macro test_in_nmi reg stack nmi_ret normal_ret
+	cmpq %\reg, \stack
+	ja \normal_ret
+	subq $4096, %\reg
+	cmpq %\reg, \stack
+	jb \normal_ret
+	jmp \nmi_ret
+	.endm
 
 	/* runs on exception stack */
 ENTRY(nmi)
 	INTR_FRAME
 	PARAVIRT_ADJUST_EXCEPTION_FRAME
+	/*
+	 * We allow breakpoints in NMIs. If a breakpoint occurs, then
+	 * the iretq it performs will take us out of NMI context.
+	 * This means that we can have nested NMIs where the next
+	 * NMI is using the top of the stack of the previous  NMI. We
+	 * can't let it execute because the nested NMI will corrupt the
+	 * stack of the previous NMI. NMI handlers are not re-entrant
+	 * anyway.
+	 *
+	 * To handle this case we do the following:
+	 *  Check the a special location on the stack that contains
+	 *  a variable that is set when NMIs are executing.
+	 *  The inerrupted task's stack is also checked to see if it
+	 *  is an NMI stack.
+	 *  If the variable is not set and the stack is not the NMI
+	 *  stack then:
+	 *    o Set the special variable on the stack
+	 *    o Copy the interrupt frame into a "saved" location on the stack
+	 *    o Copy the interrupt frame into a "copy" location on the stack
+	 *    o Continue processing the NMI
+	 *  If the variable is set or the previous stack is the NMI stack:
+	 *    o Modify the "copy" location to jump to the repeate_nmi
+	 *    o return back to the first NMI
+	 *
+	 * Now on exit of the first NMI, we first clear the stack variable
+	 * The NMI stack will tell any nested NMIs at that point that it is
+	 * nested. Then we pop the stack normally with iret, and if there was
+	 * a nested NMI that updated the copy interrupt stack frame, a
+	 * jump will be made to the repeat_nmi code that will handle the second
+	 * NMI.
+	 */
+
+	/* Use %rdx as out temp variable throughout */
+	pushq_cfi %rdx
+
+	/*
+	 * Check the special variable on the stack to see if NMIs are
+	 * executing.
+	 */
+	cmp $1, -8(%rsp)
+	je nested_nmi
+
+	/*
+	 * Now test if the previous stack was an NMI stack.
+	 * We need the double check. We check the NMI stack to satisfy the
+	 * race when the first NMI clears the variable before returning.
+	 * We check the variable because the first NMI could be in a
+	 * breakpoint routine using a breakpoint stack.
+	 */
+	lea 6*8(%rsp), %rdx
+	test_in_nmi rdx, 4*8(%rsp), nested_nmi, first_nmi
+
+nested_nmi:
+	/*
+	 * Do nothing if we interrupted the fixup in repeat_nmi.
+	 * It's about to repeat the NMI handler, so we are fine
+	 * with ignoring this one.
+	 */
+	leaq repeat_nmi, %rdx
+	cmpq 8(%rsp), %rdx
+	ja 1f
+	leaq end_repeat_nmi, %rdx
+	cmpq 8(%rsp), %rdx
+	ja nested_nmi_out
+
+1:
+	/* Set up the interrupted NMIs stack to jump to repeat_nmi */
+	leaq -6*8(%rsp), %rdi
+	movq %rdi, %rsp
+	pushq $__KERNEL_DS
+	pushq %rdi
+	pushfq
+	pushq $__KERNEL_CS
+	leaq repeat_nmi, %rdx
+	pushq_cfi %rdx
+
+	/* Put stack back */
+	addq $(11*8), %rsp
+
+nested_nmi_out:
+	popq_cfi %rdx
+
+	/* No need to check faults here */
+	INTERRUPT_RETURN
+
+first_nmi:
+	/*
+	 * Because nested NMIs will use the pushed location that we
+	 * stored rdx, we must keep that space available.
+	 * Here's what our stack frame will look like:
+	 * +-------------------------+
+	 * | original SS             |
+	 * | original Return RSP     |
+	 * | original RFLAGS         |
+	 * | original CS             |
+	 * | original RIP            |
+	 * +-------------------------+
+	 * | temp storage for rdx    |
+	 * +-------------------------+
+	 * | NMI executing variable  |
+	 * +-------------------------+
+	 * | Saved SS                |
+	 * | Saved Return RSP        |
+	 * | Saved RFLAGS            |
+	 * | Saved CS                |
+	 * | Saved RIP               |
+	 * +-------------------------+
+	 * | copied SS               |
+	 * | copied Return RSP       |
+	 * | copied RFLAGS           |
+	 * | copied CS               |
+	 * | copied RIP              |
+	 * +-------------------------+
+	 * | pt_regs                 |
+	 * +-------------------------+
+	 *
+	 * The saved RIP is used to fix up the copied RIP that a nested
+	 * NMI may zero out. The original stack frame and the temp storage
+	 * is also used by nested NMIs and can not be trusted on exit.
+	 */
+	/* Set the NMI executing variable on the stack. */
+	pushq_cfi $1
+
+	/* Copy the stack frame to the Saved frame */
+	pushq_cfi 6*8(%rsp)
+	pushq_cfi 6*8(%rsp)
+	pushq_cfi 6*8(%rsp)
+	pushq_cfi 6*8(%rsp)
+	pushq_cfi 6*8(%rsp)
+
+	/* Make another copy, this one may be modified by nested NMIs */
+	pushq_cfi 4*8(%rsp)
+	pushq_cfi 4*8(%rsp)
+	pushq_cfi 4*8(%rsp)
+	pushq_cfi 4*8(%rsp)
+	pushq_cfi 4*8(%rsp)
+
+	/* Do not pop rdx, nested NMIs will corrupt it */
+	movq 11*8(%rsp), %rdx
+
+restart_nmi:
 	pushq_cfi $-1		/* ORIG_RAX: no syscall to restart */
 	subq $ORIG_RAX-R15, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
@@ -1502,10 +1654,26 @@ nmi_swapgs:
 	SWAPGS_UNSAFE_STACK
 nmi_restore:
 	RESTORE_ALL 8
+	/* Clear the NMI executing stack variable */
+	movq $0, 10*8(%rsp)
 	jmp irq_return
 	CFI_ENDPROC
 END(nmi)
 
+repeat_nmi:
+	/* Update the stack variable to say we are still in NMI */
+	movq $1, 5*8(%rsp)
+
+	/* copy the saved stack back to copy stack */
+	pushq 4*8(%rsp)
+	pushq 4*8(%rsp)
+	pushq 4*8(%rsp)
+	pushq 4*8(%rsp)
+	pushq 4*8(%rsp)
+
+	jmp restart_nmi
+end_repeat_nmi:
+
 ENTRY(ignore_sysret)
 	CFI_STARTPROC
 	mov $-ENOSYS,%eax
-- 
1.7.7.3



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-08 19:30 ` [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes Steven Rostedt
@ 2011-12-08 19:36   ` Steven Rostedt
  2011-12-09  2:43     ` Steven Rostedt
  2011-12-09 12:40     ` Mathieu Desnoyers
  0 siblings, 2 replies; 21+ messages in thread
From: Steven Rostedt @ 2011-12-08 19:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, H. Peter Anvin, Paul Turner

On Thu, 2011-12-08 at 14:30 -0500, Steven Rostedt wrote:

> If the first NMI hits a breakpoint and loses NMI context, and then it
> hits another breakpoint and while processing that breakpoint we get a
> nested NMI. When processing a breakpoint, the stack changes to the
> breakpoint stack. If another NMI comes in here we can't rely on the
> interrupted stack to be the NMI stack. 

As I wrote this part of the change log, I thought of another nasty
gotcha with breakpoints in NMIs.

If you have a breakpoint in both normal context and NMI context. When
the breakpoint is being processed, if an NMI comes in and it too
triggers a breakpoint, this processing of the breakpoint has the same
problem as nested NMIs. The NMI breakpoint handler will corrupt the
stack of the breakpoint that was being processed when the NMI triggered.

I'm not sure how to handle this case. We could do something similar in
the break point code to handle the same thing. But this just seems
really ugly.

Anyone with any better ideas?

-- Steve



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-08 19:36   ` Steven Rostedt
@ 2011-12-09  2:43     ` Steven Rostedt
  2011-12-09  9:22       ` Peter Zijlstra
                         ` (3 more replies)
  2011-12-09 12:40     ` Mathieu Desnoyers
  1 sibling, 4 replies; 21+ messages in thread
From: Steven Rostedt @ 2011-12-09  2:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, H. Peter Anvin, Paul Turner

On Thu, 2011-12-08 at 14:36 -0500, Steven Rostedt wrote:

> I'm not sure how to handle this case. We could do something similar in
> the break point code to handle the same thing. But this just seems
> really ugly.
> 
> Anyone with any better ideas?

On IRC, Peter Zijlstra mentioned changing the IDT in NMI. I'm not sure
if he was joking or not. But I decided to try it out. It seems to
work :)


What's nice is that this change is in the C code, which makes things
much easier.

I check on nmi entry, if the nmi preempted code running with the debug
stack. If it has, then I load a special "nmi_idt_table" that has a zero
IST for the debug stack, which will make the debug interrupt in the NMI
not change the stack pointer. Then on exit, it returns it to the old IDT
table.

Just to see how bad this was, I made it make the change on every NMI. I
didn't notice any difference, but I wasn't doing any real benchmark,
except doing a perf top -c 1000.

But in practice, it should seldom hit, if ever. It only updates the IDT
if it interrupted debug processing.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 41935fa..e95822d 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -35,6 +35,8 @@ static inline void fill_ldt(struct desc_struct *desc, const struct user_desc *in
 
 extern struct desc_ptr idt_descr;
 extern gate_desc idt_table[];
+extern struct desc_ptr nmi_idt_descr;
+extern gate_desc nmi_idt_table[];
 
 struct gdt_page {
 	struct desc_struct gdt[GDT_ENTRIES];
@@ -307,6 +309,16 @@ static inline void set_desc_limit(struct desc_struct *desc, unsigned long limit)
 	desc->limit = (limit >> 16) & 0xf;
 }
 
+#ifdef CONFIG_X86_64
+static inline void set_nmi_gate(int gate, void *addr)
+{
+	gate_desc s;
+
+	pack_gate(&s, GATE_INTERRUPT, (unsigned long)addr, 0, 0, __KERNEL_CS);
+	write_idt_entry(nmi_idt_table, gate, &s);
+}
+#endif
+
 static inline void _set_gate(int gate, unsigned type, void *addr,
 			     unsigned dpl, unsigned ist, unsigned seg)
 {
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index b650435..d748d1f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -402,6 +402,9 @@ DECLARE_PER_CPU(char *, irq_stack_ptr);
 DECLARE_PER_CPU(unsigned int, irq_count);
 extern unsigned long kernel_eflags;
 extern asmlinkage void ignore_sysret(void);
+int is_debug_stack(unsigned long addr);
+void zero_debug_stack(void);
+void reset_debug_stack(void);
 #else	/* X86_64 */
 #ifdef CONFIG_CC_STACKPROTECTOR
 /*
@@ -416,6 +419,9 @@ struct stack_canary {
 };
 DECLARE_PER_CPU_ALIGNED(struct stack_canary, stack_canary);
 #endif
+static inline int is_debug_stack(unsigned long addr) { return 0; }
+static inline void zero_debug_stack(void) { }
+static inline void reset_debug_stack(void) { }
 #endif	/* X86_64 */
 
 extern unsigned int xstate_size;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index aa003b1..11a86d5 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1026,6 +1026,8 @@ __setup("clearcpuid=", setup_disablecpuid);
 
 #ifdef CONFIG_X86_64
 struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table };
+struct desc_ptr nmi_idt_descr = { NR_VECTORS * 16 - 1,
+				    (unsigned long) nmi_idt_table };
 
 DEFINE_PER_CPU_FIRST(union irq_stack_union,
 		     irq_stack_union) __aligned(PAGE_SIZE);
@@ -1090,6 +1092,24 @@ unsigned long kernel_eflags;
  */
 DEFINE_PER_CPU(struct orig_ist, orig_ist);
 
+static DEFINE_PER_CPU(unsigned long, debug_stack_addr);
+
+int is_debug_stack(unsigned long addr)
+{
+	return addr <= __get_cpu_var(debug_stack_addr) &&
+		addr > (__get_cpu_var(debug_stack_addr) - 4096);
+}
+
+void zero_debug_stack(void)
+{
+	load_idt((const struct desc_ptr *)&nmi_idt_descr);
+}
+
+void reset_debug_stack(void)
+{
+	load_idt((const struct desc_ptr *)&idt_descr);
+}
+
 #else	/* CONFIG_X86_64 */
 
 DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
@@ -1208,6 +1228,8 @@ void __cpuinit cpu_init(void)
 			estacks += exception_stack_sizes[v];
 			oist->ist[v] = t->x86_tss.ist[v] =
 					(unsigned long)estacks;
+			if (v == DEBUG_STACK)
+				per_cpu(debug_stack_addr, cpu) = (unsigned long)estacks;
 		}
 	}
 
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index e11e394..40f4eb3 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -417,6 +417,10 @@ ENTRY(phys_base)
 ENTRY(idt_table)
 	.skip IDT_ENTRIES * 16
 
+	.align L1_CACHE_BYTES
+ENTRY(nmi_idt_table)
+	.skip IDT_ENTRIES * 16
+
 	__PAGE_ALIGNED_BSS
 	.align PAGE_SIZE
 ENTRY(empty_zero_page)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index b9c8628..d35f554 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -407,13 +407,29 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 dotraplinkage notrace __kprobes void
 do_nmi(struct pt_regs *regs, long error_code)
 {
+	int update_debug_stack = 0;
+
 	nmi_enter();
 
+	/*
+	 * If we interrupted a breakpoint, it is possible that
+	 * the nmi handler will have breakpoints too. We need to
+	 * change the IDT such that breakpoints that happen here
+	 * continue to use the NMI stack.
+	 */
+	if (unlikely(is_debug_stack(regs->sp))) {
+		zero_debug_stack();
+		update_debug_stack = 1;
+	}
+	
 	inc_irq_stat(__nmi_count);
 
 	if (!ignore_nmis)
 		default_do_nmi(regs);
 
+	if (unlikely(update_debug_stack))
+		reset_debug_stack();
+
 	nmi_exit();
 }
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a8e3eb8..906a02a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -723,4 +723,9 @@ void __init trap_init(void)
 	cpu_init();
 
 	x86_init.irqs.trap_init();
+
+#ifdef CONFIG_X86_64
+	memcpy(&nmi_idt_table, &idt_table, IDT_ENTRIES * 16);
+	set_nmi_gate(1, &debug);
+#endif
 }



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09  2:43     ` Steven Rostedt
@ 2011-12-09  9:22       ` Peter Zijlstra
  2011-12-09 15:00         ` Steven Rostedt
  2011-12-09 15:20       ` Steven Rostedt
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2011-12-09  9:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, H. Peter Anvin, Paul Turner

On Thu, 2011-12-08 at 21:43 -0500, Steven Rostedt wrote:
> On Thu, 2011-12-08 at 14:36 -0500, Steven Rostedt wrote:
> 
> > I'm not sure how to handle this case. We could do something similar in
> > the break point code to handle the same thing. But this just seems
> > really ugly.
> > 
> > Anyone with any better ideas?
> 
> On IRC, Peter Zijlstra mentioned changing the IDT in NMI. I'm not sure
> if he was joking or not. But I decided to try it out. It seems to
> work :)

Its was definitely tongue in cheek, also I did say this'll be a massive
pain with paravirt since I doubt paravirt calls are NMI safe. 

But yeah, it might all be slightly less painful than trying to teach the
INT3 handler about this recursion.

This all started with wanting to do pagefaults from NMI context, those
too will have this recursion, although for faults we'll end up in the
double fault handler, which seems to allow a slightly saner way out.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-08 19:36   ` Steven Rostedt
  2011-12-09  2:43     ` Steven Rostedt
@ 2011-12-09 12:40     ` Mathieu Desnoyers
  2011-12-09 13:02       ` Mathieu Desnoyers
  1 sibling, 1 reply; 21+ messages in thread
From: Mathieu Desnoyers @ 2011-12-09 12:40 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Linus Torvalds, H. Peter Anvin,
	Frederic Weisbecker, Jason Baron, H. Peter Anvin, Paul Turner

Hi Steven,

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Thu, 2011-12-08 at 14:30 -0500, Steven Rostedt wrote:
> 
> > If the first NMI hits a breakpoint and loses NMI context, and then it
> > hits another breakpoint and while processing that breakpoint we get a
> > nested NMI. When processing a breakpoint, the stack changes to the
> > breakpoint stack. If another NMI comes in here we can't rely on the
> > interrupted stack to be the NMI stack. 
> 
> As I wrote this part of the change log, I thought of another nasty
> gotcha with breakpoints in NMIs.
> 
> If you have a breakpoint in both normal context and NMI context. When
> the breakpoint is being processed, if an NMI comes in and it too
> triggers a breakpoint, this processing of the breakpoint has the same
> problem as nested NMIs. The NMI breakpoint handler will corrupt the
> stack of the breakpoint that was being processed when the NMI triggered.
> 
> I'm not sure how to handle this case. We could do something similar in
> the break point code to handle the same thing. But this just seems
> really ugly.
> 
> Anyone with any better ideas?

The nesting counters + code region address checks I proposed a few days
ago should handle this correctly. Here is a very slightly updated
version:

variables used:

cpu-local int nmi_nest_count;
cpu-local int nmi_latch;
__nmi_epilogue_begin (pointer to text)
__nmi_epilogue_end (pointer to text)
REAL_NMI_STACK: beginning of the stack used for real nmi handler
LATCHED_NMI_STACK: beginning of the stack used for latched nmi handler

int in_nmi_epilogue(void)
{
  return (instruction_pointer() >= __nmi_epilogue_begin
		&& instruction_pointer() < __nmi_epilogue_end);
}

int in_nmi(void)
{
  return nmi_nest_count > 0;
}

/* Use REAL_NMI_STACK */
real_nmi_handler: /* always running with nmis disabled */
  /*
   * We disable interrupts to ensure we don't have to deal with IRQs
   * when NMIs get re-enabled due to an iret from a fault/exception.
   */
  local_irq_disable();
  if (in_nmi_epilogue()) {
    nmi_latch = 0;
    /* set stack pointer to start of LATCHED_NMI_STACK */
    /* populate start of LATCHED_NMI_STACK with values for iret */
    goto latched_nmi_handler;
  }
  if (in_nmi()) {
     nmi_latch = 1;
     iret
  }
  nmi_nest_count++;
  /* set stack pointer to start of LATCHED_NMI_STACK */
  /* populate start of LATCHED_NMI_STACK with values for iret */
  goto latched_nmi_handler;


/* Use LATCHED_NMI_STACK */
latched_nmi_handler:	/* Can fault and reenable NMIs. */

  [ execute actual system NMI handler, including faults, int3, ... ]

  /*
   * note: test nmi_latch and iret instruction are within the epilogue
   * range to deal with latch test vs iret non-atomicity.  If a real nmi
   * nests over this range, it clears the nmi_latch flag and just
   * restarts the latched nmi handler.  No faults/exceptions/interrupts
   * are permitted in this region, except for the real NMI and MCEs
   * (TODO).
   */
__nmi_epilogue_begin:
  /*
   * here we are restarting the latched nmi handler if an nmi happened
   * while nested within the nmi nest count.
   */
  if (nmi_latch) {
    nmi_latch = 0;
    goto latched_nmi_handler;
  }
  nmi_nest_count--;
  iret  /* restores interrupts */
__nmi_epilogue_end:


Best regards,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09 12:40     ` Mathieu Desnoyers
@ 2011-12-09 13:02       ` Mathieu Desnoyers
  2011-12-09 14:49         ` Steven Rostedt
  0 siblings, 1 reply; 21+ messages in thread
From: Mathieu Desnoyers @ 2011-12-09 13:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Linus Torvalds, H. Peter Anvin,
	Frederic Weisbecker, Jason Baron, H. Peter Anvin, Paul Turner

* Mathieu Desnoyers (mathieu.desnoyers@efficios.com) wrote:
> Hi Steven,
> 
> * Steven Rostedt (rostedt@goodmis.org) wrote:
> > On Thu, 2011-12-08 at 14:30 -0500, Steven Rostedt wrote:
> > 
> > > If the first NMI hits a breakpoint and loses NMI context, and then it
> > > hits another breakpoint and while processing that breakpoint we get a
> > > nested NMI. When processing a breakpoint, the stack changes to the
> > > breakpoint stack. If another NMI comes in here we can't rely on the
> > > interrupted stack to be the NMI stack. 
> > 
> > As I wrote this part of the change log, I thought of another nasty
> > gotcha with breakpoints in NMIs.
> > 
> > If you have a breakpoint in both normal context and NMI context. When
> > the breakpoint is being processed, if an NMI comes in and it too
> > triggers a breakpoint, this processing of the breakpoint has the same
> > problem as nested NMIs. The NMI breakpoint handler will corrupt the
> > stack of the breakpoint that was being processed when the NMI triggered.
> > 
> > I'm not sure how to handle this case. We could do something similar in
> > the break point code to handle the same thing. But this just seems
> > really ugly.
> > 
> > Anyone with any better ideas?
> 
> The nesting counters + code region address checks I proposed a few days
> ago should handle this correctly. Here is a very slightly updated
> version:

after a quick IRC discussion with Peter Zijlstra, one thing seems to be
missing here to handle the INT3->NMI->INT3 issue: this could be achieved
by splitting the DEBUG stack in 2 sub-stacks, and letting the int3
handler keep track of its nesting within its own stack with an extra
"int3_nest_count". AFAIU, supporting 2 nested int3 should be enough.

Thanks,

Mathieu 

> 
> variables used:
> 
> cpu-local int nmi_nest_count;
> cpu-local int nmi_latch;
> __nmi_epilogue_begin (pointer to text)
> __nmi_epilogue_end (pointer to text)
> REAL_NMI_STACK: beginning of the stack used for real nmi handler
> LATCHED_NMI_STACK: beginning of the stack used for latched nmi handler
> 
> int in_nmi_epilogue(void)
> {
>   return (instruction_pointer() >= __nmi_epilogue_begin
> 		&& instruction_pointer() < __nmi_epilogue_end);
> }
> 
> int in_nmi(void)
> {
>   return nmi_nest_count > 0;
> }
> 
> /* Use REAL_NMI_STACK */
> real_nmi_handler: /* always running with nmis disabled */
>   /*
>    * We disable interrupts to ensure we don't have to deal with IRQs
>    * when NMIs get re-enabled due to an iret from a fault/exception.
>    */
>   local_irq_disable();
>   if (in_nmi_epilogue()) {
>     nmi_latch = 0;
>     /* set stack pointer to start of LATCHED_NMI_STACK */
>     /* populate start of LATCHED_NMI_STACK with values for iret */
>     goto latched_nmi_handler;
>   }
>   if (in_nmi()) {
>      nmi_latch = 1;
>      iret
>   }
>   nmi_nest_count++;
>   /* set stack pointer to start of LATCHED_NMI_STACK */
>   /* populate start of LATCHED_NMI_STACK with values for iret */
>   goto latched_nmi_handler;
> 
> 
> /* Use LATCHED_NMI_STACK */
> latched_nmi_handler:	/* Can fault and reenable NMIs. */
> 
>   [ execute actual system NMI handler, including faults, int3, ... ]
> 
>   /*
>    * note: test nmi_latch and iret instruction are within the epilogue
>    * range to deal with latch test vs iret non-atomicity.  If a real nmi
>    * nests over this range, it clears the nmi_latch flag and just
>    * restarts the latched nmi handler.  No faults/exceptions/interrupts
>    * are permitted in this region, except for the real NMI and MCEs
>    * (TODO).
>    */
> __nmi_epilogue_begin:
>   /*
>    * here we are restarting the latched nmi handler if an nmi happened
>    * while nested within the nmi nest count.
>    */
>   if (nmi_latch) {
>     nmi_latch = 0;
>     goto latched_nmi_handler;
>   }
>   nmi_nest_count--;
>   iret  /* restores interrupts */
> __nmi_epilogue_end:
> 
> 
> Best regards,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09 13:02       ` Mathieu Desnoyers
@ 2011-12-09 14:49         ` Steven Rostedt
  2011-12-09 15:02           ` Mathieu Desnoyers
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2011-12-09 14:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Linus Torvalds, H. Peter Anvin,
	Frederic Weisbecker, Jason Baron, H. Peter Anvin, Paul Turner

On Fri, 2011-12-09 at 08:02 -0500, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers (mathieu.desnoyers@efficios.com) wrote:

> after a quick IRC discussion with Peter Zijlstra, one thing seems to be
> missing here to handle the INT3->NMI->INT3 issue: this could be achieved
> by splitting the DEBUG stack in 2 sub-stacks, and letting the int3
> handler keep track of its nesting within its own stack with an extra
> "int3_nest_count". AFAIU, supporting 2 nested int3 should be enough.

Here's the problem. When you take an int3, the hardware loads stuff onto
the stack for you. That's the SS, RSP, FLAGS, CS, RIP. If the NMI comes
in while we are processing a breakpoint, and the NMI hits an int3 too,
then the hardware will load the current SS, RSP, FLAGS, CS and RIP onto
the stack at the exact same place as the breakpoint processing that was
interrupted had it's interrupt frame. IOW, it just corrupted the stack.

To prevent this in the NMI code, I did ugly things like making copies of
the interrupt frame to keep a nested NMI from corrupting the first NMI.
Not only do I not want to do this ugly hack for debug exception, you
*can't* do it. It wont work!

The reason the NMI works is because while we are copying the stack
frame, NMIs are disabled because we are currently in an NMI.

But a normal int3, as it tries to do the copy and an NMI triggers, if
you don't update the IDT, any int3 that the NMI hits will corrupt the
previous int3 processing's stack. The hardware does it, there's nothing
a "split stack" will do to fix that.

-- Steve



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09  9:22       ` Peter Zijlstra
@ 2011-12-09 15:00         ` Steven Rostedt
  2011-12-09 15:10           ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2011-12-09 15:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, H. Peter Anvin, Paul Turner

On Fri, 2011-12-09 at 10:22 +0100, Peter Zijlstra wrote:

> Its was definitely tongue in cheek, also I did say this'll be a massive
> pain with paravirt since I doubt paravirt calls are NMI safe. 

But does paravirt simulate NMIs? Does a guest ever take an NMI? I can
enable paravirt to see if it breaks.

> 
> But yeah, it might all be slightly less painful than trying to teach the
> INT3 handler about this recursion.

That's my goal. Yes, I agree this is all just an ugly hack, but I'm
trying hard to keep the ugly hack contained in the NMI handling. Once
you start letting this hack spread, it will grow like a virus, and
resistance will be futile. And we will end up with nasty dependencies
that will become a horror show to maintain.

Right now, I've quarantined the NMI code, and I'm keeping the infection
at bay. The NMI code will grow warts and nasty appendages, but the NMI
code was ugly to begin with, and maybe these mutations might even make
it prettier. That's the nature of NMI. It's an ugly beast in all its
incarnations.

This is why I like my patches. It contains the damage to only the NMI
code. If something breaks from this code, it will be easy to find it. If
anything, a bisect should show exactly what caused change caused the
problem. But if we add hacks in other places, it will be much more
difficult to figure out.

> 
> This all started with wanting to do pagefaults from NMI context, those
> too will have this recursion, although for faults we'll end up in the
> double fault handler, which seems to allow a slightly saner way out.

It didn't start with the pagefaults. That discussion only brought the
issue to LKML. This has been a pain in our side or breakpoints from day
one.

-- Steve



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09 14:49         ` Steven Rostedt
@ 2011-12-09 15:02           ` Mathieu Desnoyers
  0 siblings, 0 replies; 21+ messages in thread
From: Mathieu Desnoyers @ 2011-12-09 15:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Linus Torvalds, H. Peter Anvin,
	Frederic Weisbecker, Jason Baron, H. Peter Anvin, Paul Turner

* Steven Rostedt (rostedt@goodmis.org) wrote:
> On Fri, 2011-12-09 at 08:02 -0500, Mathieu Desnoyers wrote:
> > * Mathieu Desnoyers (mathieu.desnoyers@efficios.com) wrote:
> 
> > after a quick IRC discussion with Peter Zijlstra, one thing seems to be
> > missing here to handle the INT3->NMI->INT3 issue: this could be achieved
> > by splitting the DEBUG stack in 2 sub-stacks, and letting the int3
> > handler keep track of its nesting within its own stack with an extra
> > "int3_nest_count". AFAIU, supporting 2 nested int3 should be enough.
> 
> Here's the problem. When you take an int3, the hardware loads stuff onto
> the stack for you. That's the SS, RSP, FLAGS, CS, RIP. If the NMI comes
> in while we are processing a breakpoint, and the NMI hits an int3 too,
> then the hardware will load the current SS, RSP, FLAGS, CS and RIP onto
> the stack at the exact same place as the breakpoint processing that was
> interrupted had it's interrupt frame. IOW, it just corrupted the stack.
> 
> To prevent this in the NMI code, I did ugly things like making copies of
> the interrupt frame to keep a nested NMI from corrupting the first NMI.
> Not only do I not want to do this ugly hack for debug exception, you
> *can't* do it. It wont work!
> 
> The reason the NMI works is because while we are copying the stack
> frame, NMIs are disabled because we are currently in an NMI.
> 
> But a normal int3, as it tries to do the copy and an NMI triggers, if
> you don't update the IDT, any int3 that the NMI hits will corrupt the
> previous int3 processing's stack. The hardware does it, there's nothing
> a "split stack" will do to fix that.

yep, doing that in the "real" nmi handler (with NMIs disabled at that
point) makes tons of sense.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09 15:00         ` Steven Rostedt
@ 2011-12-09 15:10           ` Peter Zijlstra
  2011-12-09 15:25             ` Steven Rostedt
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2011-12-09 15:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, H. Peter Anvin, Paul Turner

On Fri, 2011-12-09 at 10:00 -0500, Steven Rostedt wrote:
> On Fri, 2011-12-09 at 10:22 +0100, Peter Zijlstra wrote:
> 
> > Its was definitely tongue in cheek, also I did say this'll be a massive
> > pain with paravirt since I doubt paravirt calls are NMI safe. 
> 
> But does paravirt simulate NMIs? Does a guest ever take an NMI? I can
> enable paravirt to see if it breaks. 

KVM does NMIs, the write_idt thing is a paravirt call. Then again, a
vcpu is a single thread of execution, there can only ever be 1 hypercall
at the same time.

The only thing that remains is if the hypercall interface is NMI-safe,
what if the NMI comes in just as another context is also starting a
hypercall. I really don't know enough about all that crap to know.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09  2:43     ` Steven Rostedt
  2011-12-09  9:22       ` Peter Zijlstra
@ 2011-12-09 15:20       ` Steven Rostedt
  2011-12-09 16:34       ` Steven Rostedt
  2011-12-09 16:49       ` Jason Baron
  3 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2011-12-09 15:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, H. Peter Anvin, Paul Turner

On Thu, 2011-12-08 at 21:43 -0500, Steven Rostedt wrote:

> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index b9c8628..d35f554 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -407,13 +407,29 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
>  dotraplinkage notrace __kprobes void
>  do_nmi(struct pt_regs *regs, long error_code)
>  {
> +	int update_debug_stack = 0;
> +
>  	nmi_enter();
>  
> +	/*
> +	 * If we interrupted a breakpoint, it is possible that
> +	 * the nmi handler will have breakpoints too. We need to
> +	 * change the IDT such that breakpoints that happen here
> +	 * continue to use the NMI stack.
> +	 */
> +	if (unlikely(is_debug_stack(regs->sp))) {

For kicks, I made the update happen every NMI, but also changed the IDT
entry of the NMI to also not change the stack. Then I was able to move a
lot of the hacks from assembly into C code, just before the nmi_enter()
above. It seems it can work too. I didn't finish it, I just wanted to
see if it was doable. But I'm not sure we want to change the IDT (twice)
for all NMIs.

-- Steve


> +		zero_debug_stack();
> +		update_debug_stack = 1;
> +	}
> +	
>  	inc_irq_stat(__nmi_count);
>  
>  	if (!ignore_nmis)
>  		default_do_nmi(regs);
>  
> +	if (unlikely(update_debug_stack))
> +		reset_debug_stack();
> +
>  	nmi_exit();
>  }



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09 15:10           ` Peter Zijlstra
@ 2011-12-09 15:25             ` Steven Rostedt
  0 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2011-12-09 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, H. Peter Anvin, Paul Turner

On Fri, 2011-12-09 at 16:10 +0100, Peter Zijlstra wrote:

> KVM does NMIs, the write_idt thing is a paravirt call. Then again, a
> vcpu is a single thread of execution, there can only ever be 1 hypercall
> at the same time.
> 
> The only thing that remains is if the hypercall interface is NMI-safe,
> what if the NMI comes in just as another context is also starting a
> hypercall. I really don't know enough about all that crap to know.

Hmm, well we can detect that we are a paravirt guest or not (that's the
nature of paravirt). We can keep the current methods around, an for
paravirt guests, it will fall back to the old stop_machine() method.
Heck guests have issues with latencies anyway.

For running on bare-metal, we can switch to this method.

-- Steve




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09  2:43     ` Steven Rostedt
  2011-12-09  9:22       ` Peter Zijlstra
  2011-12-09 15:20       ` Steven Rostedt
@ 2011-12-09 16:34       ` Steven Rostedt
  2011-12-09 17:19         ` Steven Rostedt
  2011-12-09 16:49       ` Jason Baron
  3 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2011-12-09 16:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, H. Peter Anvin, Paul Turner

On Thu, 2011-12-08 at 21:43 -0500, Steven Rostedt wrote:

> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a8e3eb8..906a02a 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -723,4 +723,9 @@ void __init trap_init(void)
>  	cpu_init();
>  
>  	x86_init.irqs.trap_init();
> +
> +#ifdef CONFIG_X86_64
> +	memcpy(&nmi_idt_table, &idt_table, IDT_ENTRIES * 16);
> +	set_nmi_gate(1, &debug);

Frederic Weisbecker told me on IRC that int3 is 3 #bp, not 1 #db. I need
to also add that:

	set_nmi_gate(3, &int3)

Mathieu says we need to worry about MCEs, so maybe we can add that stack
as well.

	set_nmi_gate(18, &machine_check);


If we make NMIs not modify any stack, then we can remove the "NMI
executing variable" on the stack. As any nested NMIs will see that it
preempted an NMI by just checking the stack. We have to check it anyway,
and by removing another check, this may be good to do.

-- Steve

> +#endif
>  }
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09  2:43     ` Steven Rostedt
                         ` (2 preceding siblings ...)
  2011-12-09 16:34       ` Steven Rostedt
@ 2011-12-09 16:49       ` Jason Baron
  2011-12-09 17:14         ` Steven Rostedt
  3 siblings, 1 reply; 21+ messages in thread
From: Jason Baron @ 2011-12-09 16:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Linus Torvalds, H. Peter Anvin,
	Frederic Weisbecker, Mathieu Desnoyers, H. Peter Anvin,
	Paul Turner

On Thu, Dec 08, 2011 at 09:43:36PM -0500, Steven Rostedt wrote:
> On Thu, 2011-12-08 at 14:36 -0500, Steven Rostedt wrote:
> 
> > I'm not sure how to handle this case. We could do something similar in
> > the break point code to handle the same thing. But this just seems
> > really ugly.
> > 
> > Anyone with any better ideas?
> 
> On IRC, Peter Zijlstra mentioned changing the IDT in NMI. I'm not sure
> if he was joking or not. But I decided to try it out. It seems to
> work :)
> 
> 
> What's nice is that this change is in the C code, which makes things
> much easier.
> 
> I check on nmi entry, if the nmi preempted code running with the debug
> stack. If it has, then I load a special "nmi_idt_table" that has a zero
> IST for the debug stack, which will make the debug interrupt in the NMI
> not change the stack pointer. Then on exit, it returns it to the old IDT
> table.
> 
> Just to see how bad this was, I made it make the change on every NMI. I
> didn't notice any difference, but I wasn't doing any real benchmark,
> except doing a perf top -c 1000.
> 
> But in practice, it should seldom hit, if ever. It only updates the IDT
> if it interrupted debug processing.
> 

Then, I'm wondering if the same technique can be used for the original
nmi->int3->nmi case. That is, switch the IDT when the int3 comes in, so
that the subsequent nmi will be handled on the debug stack. As you pointed out,
these nesting and thus the IDT switching would be rare in
practice. (I know you don't want to touch any code outside of nmi :))

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09 16:49       ` Jason Baron
@ 2011-12-09 17:14         ` Steven Rostedt
  0 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2011-12-09 17:14 UTC (permalink / raw)
  To: Jason Baron
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Linus Torvalds, H. Peter Anvin,
	Frederic Weisbecker, Mathieu Desnoyers, H. Peter Anvin,
	Paul Turner

On Fri, 2011-12-09 at 11:49 -0500, Jason Baron wrote:
> > 
> 
> Then, I'm wondering if the same technique can be used for the original
> nmi->int3->nmi case. That is, switch the IDT when the int3 comes in, so
> that the subsequent nmi will be handled on the debug stack. As you pointed out,
> these nesting and thus the IDT switching would be rare in
> practice. (I know you don't want to touch any code outside of nmi :))

Right, I NMIs are ugly and we shouldn't uglify other code because of it.

Anyway, when could we do the switch in the int3 handler? As I said, the
NMI could come in at that moment the int3 is being processed, before it
does anything. If the NMI hits an int3, it just stomped over the
previous int3's interrupt frame.

The more I look at this, the more I like the original idea.

-- Steve



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09 16:34       ` Steven Rostedt
@ 2011-12-09 17:19         ` Steven Rostedt
  2011-12-09 17:49           ` Borislav Petkov
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Rostedt @ 2011-12-09 17:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Thomas Gleixner, Peter Zijlstra,
	Linus Torvalds, H. Peter Anvin, Frederic Weisbecker, Jason Baron,
	Mathieu Desnoyers, H. Peter Anvin, Paul Turner, Borislav Petkov

[ added Boris as he's my AMD guy ]

On Fri, 2011-12-09 at 11:34 -0500, Steven Rostedt wrote:
> On Thu, 2011-12-08 at 21:43 -0500, Steven Rostedt wrote:
> 
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index a8e3eb8..906a02a 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -723,4 +723,9 @@ void __init trap_init(void)
> >  	cpu_init();
> >  
> >  	x86_init.irqs.trap_init();
> > +
> > +#ifdef CONFIG_X86_64
> > +	memcpy(&nmi_idt_table, &idt_table, IDT_ENTRIES * 16);
> > +	set_nmi_gate(1, &debug);
> 
> Frederic Weisbecker told me on IRC that int3 is 3 #bp, not 1 #db. I need
> to also add that:
> 
> 	set_nmi_gate(3, &int3)
> 
> Mathieu says we need to worry about MCEs, so maybe we can add that stack
> as well.
> 
> 	set_nmi_gate(18, &machine_check);

Looking at the documentation, I'm not sure NMIs can interrupt an MCE.
The MCE is higher up on the exception priority chart (thanks Peter for
pointing that out). But the documentation is vague at best.

Boris, H. Peter,

Could you shed some light on this. Can an NMI interrupt an MCE in
progress?

Of course if it can, we have the NMI->MCE->NMI that could happen too.
And this problem exists today. Actually, just having an MCE happen
inside an NMI can cause the NMI->NMI problem as well.

-- Steve


> 
> 
> If we make NMIs not modify any stack, then we can remove the "NMI
> executing variable" on the stack. As any nested NMIs will see that it
> preempted an NMI by just checking the stack. We have to check it anyway,
> and by removing another check, this may be good to do.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09 17:19         ` Steven Rostedt
@ 2011-12-09 17:49           ` Borislav Petkov
  2011-12-09 18:20             ` Steven Rostedt
  0 siblings, 1 reply; 21+ messages in thread
From: Borislav Petkov @ 2011-12-09 17:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Linus Torvalds, H. Peter Anvin,
	Frederic Weisbecker, Jason Baron, Mathieu Desnoyers,
	H. Peter Anvin, Paul Turner, Borislav Petkov

Hey Steve,

On Fri, Dec 09, 2011 at 12:19:31PM -0500, Steven Rostedt wrote:
> Could you shed some light on this. Can an NMI interrupt an MCE in
> progress?

Easy, http://support.amd.com/us/Processor_TechDocs/APM_V2_24593.pdf,
section 8.5.

On amd64 #MC is along with processor reset the highest prio. Judging
from the text, an NMI occurring during an #MC is held until we return
from the #MC handler:

"When simultaneous interrupts occur, the processor transfers control
to the highest-priority interrupt handler. Lower-priority interrupts
from external sources are held pending by the processor, and they are
handled after the higher-priority interrupt is handled. Lower-priority
interrupts that result from internal sources are discarded. Those
interrupts reoccur when the high-priority interrupt handler completes
and transfers control back to the interrupted instruction."

HTH.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes
  2011-12-09 17:49           ` Borislav Petkov
@ 2011-12-09 18:20             ` Steven Rostedt
  0 siblings, 0 replies; 21+ messages in thread
From: Steven Rostedt @ 2011-12-09 18:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Thomas Gleixner,
	Peter Zijlstra, Linus Torvalds, H. Peter Anvin,
	Frederic Weisbecker, Jason Baron, Mathieu Desnoyers,
	H. Peter Anvin, Paul Turner

On Fri, 2011-12-09 at 18:49 +0100, Borislav Petkov wrote:
> Hey Steve,
> 
> On Fri, Dec 09, 2011 at 12:19:31PM -0500, Steven Rostedt wrote:
> > Could you shed some light on this. Can an NMI interrupt an MCE in
> > progress?
> 
> Easy, http://support.amd.com/us/Processor_TechDocs/APM_V2_24593.pdf,
> section 8.5.
> 
> On amd64 #MC is along with processor reset the highest prio. Judging
> from the text, an NMI occurring during an #MC is held until we return
> from the #MC handler:
> 
> "When simultaneous interrupts occur, the processor transfers control
> to the highest-priority interrupt handler. Lower-priority interrupts
> from external sources are held pending by the processor, and they are
> handled after the higher-priority interrupt is handled. Lower-priority
> interrupts that result from internal sources are discarded. Those
> interrupts reoccur when the high-priority interrupt handler completes
> and transfers control back to the interrupted instruction."

Yeah, I read that too. It's just a little confusing because how the SMI
section talks about disabling NMIs, but nothing else does. And according
to that same table, SMIs are higher priority. So why mention that it
disables NMIs?

Ah, looking at that section again, I see:

"SMM is entered using the system-management interrupt (SMI).
SMI is an external non-maskable interrupt that operates
differently from and independently of other interrupts. SMI has
priority over all other external interrupts, including NMI (see
“Priorities” on page 269 for a list of the interrupt priorities)."

That 'See "Priorities" on page 269' help verify that this is indeed the
case.

Thanks!

-- Steve



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-12-09 18:20 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-08 19:30 [RFC][PATCH 0/3] x86: Find a way to allow breakpoints in NMIs Steven Rostedt
2011-12-08 19:30 ` [RFC][PATCH 1/3] x86: Do not schedule while still in NMI context Steven Rostedt
2011-12-08 19:30 ` [RFC][PATCH 2/3] x86: Document the NMI handler about not using paranoid_exit Steven Rostedt
2011-12-08 19:30 ` [RFC][PATCH 3/3] x86: Add workaround to NMI iret woes Steven Rostedt
2011-12-08 19:36   ` Steven Rostedt
2011-12-09  2:43     ` Steven Rostedt
2011-12-09  9:22       ` Peter Zijlstra
2011-12-09 15:00         ` Steven Rostedt
2011-12-09 15:10           ` Peter Zijlstra
2011-12-09 15:25             ` Steven Rostedt
2011-12-09 15:20       ` Steven Rostedt
2011-12-09 16:34       ` Steven Rostedt
2011-12-09 17:19         ` Steven Rostedt
2011-12-09 17:49           ` Borislav Petkov
2011-12-09 18:20             ` Steven Rostedt
2011-12-09 16:49       ` Jason Baron
2011-12-09 17:14         ` Steven Rostedt
2011-12-09 12:40     ` Mathieu Desnoyers
2011-12-09 13:02       ` Mathieu Desnoyers
2011-12-09 14:49         ` Steven Rostedt
2011-12-09 15:02           ` Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.