All of lore.kernel.org
 help / color / mirror / Atom feed
* Perhaps a side effect regarding NMI returns
@ 2011-11-29  4:07 Steven Rostedt
  2011-11-29  4:53 ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2011-11-29  4:07 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Frederic Weisbecker, Thomas Gleixner, Mathieu Desnoyers,
	Paul Turner

I was looking at the return sequence of NMIs in x86_64 and I came across
this in entry_64.S:

	jz paranoid_swapgs
	movq %rsp,%rdi			/* &pt_regs */
	call sync_regs
	movq %rax,%rsp			/* switch stack for scheduling */
	testl $_TIF_NEED_RESCHED,%ebx
	jnz paranoid_schedule
	movl %ebx,%edx			/* arg3: thread flags */
	TRACE_IRQS_ON
	ENABLE_INTERRUPTS(CLBR_NONE)
	xorl %esi,%esi 			/* arg2: oldset */
	movq %rsp,%rdi 			/* arg1: &pt_regs */
	call do_notify_resume
	DISABLE_INTERRUPTS(CLBR_NONE)
	TRACE_IRQS_OFF
	jmp paranoid_userspace
paranoid_schedule:
	TRACE_IRQS_ON
	ENABLE_INTERRUPTS(CLBR_ANY)
	call schedule
	DISABLE_INTERRUPTS(CLBR_ANY)
	TRACE_IRQS_OFF
	jmp paranoid_userspace
	CFI_ENDPROC

Specifically the code after jnz paranoid_schedule.

Just before that jnz, we swap the stack back to the task's stack (no
more NMI stack). If NEED_RESCHED is set, we jump to paranoid_schedule
and enable interrupts and call schedule.

Is there a bit of a side effect here? What happens when you enable
interrupts in NMI context? Can more NMIs come in? If not, we just went
into schedule and went off and running, and NMIs will have to wait till
the next interrupt comes in and calls iretq to re-enable NMIs. If we
lock up here, don't expect NMI watchdog to help you out.

If enabling interrupts also enables NMIs, then there's no side effect.

This email is more of an FYI than anything else. Maybe there's an issue
here, and maybe there isn't. But this is so subtle that I figured I
would bring it to other people's attention. I'll let others do the hard
work to figure out if we should worry about this or not ;-)

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29  4:07 Perhaps a side effect regarding NMI returns Steven Rostedt
@ 2011-11-29  4:53 ` Linus Torvalds
  2011-11-29  7:33   ` Paul Turner
  2011-11-29 20:09   ` Andi Kleen
  0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2011-11-29  4:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Frederic Weisbecker, Thomas Gleixner, Mathieu Desnoyers,
	Paul Turner

On Mon, Nov 28, 2011 at 8:07 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> If enabling interrupts also enables NMIs, then there's no side effect.

The only thing that enables NMI's again is 'iret', afaik. Of course,
it can be *any* iret, so once you enable interrupts, you can get a
timer interrupt, and the timer interrupt returning with iret will
re-enable NMI's then too.

I would suggest that the actual NMI handler itself should probably
never use that paranoid exit at all, and just always use a regular
iret. Screw scheduling and TIF checks.

Probably only the exceptions that can happen *during* NMI (eg debug,
stack exception, double-fault etc) should use the paranoid versions
that try to avoid using iret.

Because I think you're right - we shouldn't call schedule() from
within the NMI handler, even if we do enable interrupts and switch to
the normal stack. Even if it probably does happen to work normally.

But I have to admit to not necessarily thinking it through a lot.

                            Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29  4:53 ` Linus Torvalds
@ 2011-11-29  7:33   ` Paul Turner
  2011-11-29 20:09   ` Andi Kleen
  1 sibling, 0 replies; 22+ messages in thread
From: Paul Turner @ 2011-11-29  7:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Frederic Weisbecker, Thomas Gleixner,
	Mathieu Desnoyers

On Mon, Nov 28, 2011 at 8:53 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Nov 28, 2011 at 8:07 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> If enabling interrupts also enables NMIs, then there's no side effect.
>
> The only thing that enables NMI's again is 'iret', afaik. Of course,
> it can be *any* iret, so once you enable interrupts, you can get a
> timer interrupt, and the timer interrupt returning with iret will
> re-enable NMI's then too.
>

I double checked versus the manuals tonight after seeing this and I
believe this correct -- an iret is required; some of the verbage is
reasonably explicit to this fact.

> I would suggest that the actual NMI handler itself should probably
> never use that paranoid exit at all, and just always use a regular
> iret. Screw scheduling and TIF checks.
>
> Probably only the exceptions that can happen *during* NMI (eg debug,
> stack exception, double-fault etc) should use the paranoid versions
> that try to avoid using iret.
>
> Because I think you're right - we shouldn't call schedule() from
> within the NMI handler, even if we do enable interrupts and switch to
> the normal stack. Even if it probably does happen to work normally.
>
> But I have to admit to not necessarily thinking it through a lot.
>

At a high level I think we could apply the same logic from the first
paragraph; namely that we will see a timer tick within a jiffy (where
we will have an opportunity to process TIF flags) and it's better to
push schedule() out than the NMI handler.

That said, if TIF_RESCHED has been set by a higher priority RT task
this temporary inversion may not be appreciated.  It's probably worth
fixing properly.

- Paul

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29  4:53 ` Linus Torvalds
  2011-11-29  7:33   ` Paul Turner
@ 2011-11-29 20:09   ` Andi Kleen
  2011-11-29 20:12     ` Linus Torvalds
  2011-11-29 20:35     ` Steven Rostedt
  1 sibling, 2 replies; 22+ messages in thread
From: Andi Kleen @ 2011-11-29 20:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Frederic Weisbecker, Thomas Gleixner,
	Mathieu Desnoyers, Paul Turner

Linus Torvalds <torvalds@linux-foundation.org> writes:

Not sure what problem Steven thinks is there. Once you 
switched the stack any nesting is fine.

> I would suggest that the actual NMI handler itself should probably
> never use that paranoid exit at all, and just always use a regular
> iret. Screw scheduling and TIF checks.

The reason I added them originally is to prevent the race of remote
kernel events being delayed for a long time. With Frederic's nohz work
this will be more important in the future. Today it would be eventually
picked up by the regular timer interrupts.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 20:09   ` Andi Kleen
@ 2011-11-29 20:12     ` Linus Torvalds
  2011-11-29 20:31       ` Andi Kleen
  2011-11-29 20:35     ` Steven Rostedt
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2011-11-29 20:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Frederic Weisbecker, Thomas Gleixner,
	Mathieu Desnoyers, Paul Turner

On Tue, Nov 29, 2011 at 12:09 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Not sure what problem Steven thinks is there. Once you
> switched the stack any nesting is fine.

The problem is that NMI's are blocked!

We're potentially switching to another process, there is no guarantee
there will be an "iret" *anywhere* in any patch for a long time. So
any subsequent NMI's will not be able to come in.

> The reason I added them originally is to prevent the race of remote
> kernel events being delayed for a long time. With Frederic's nohz work
> this will be more important in the future. Today it would be eventually
> picked up by the regular timer interrupts.

You're just delaying another kind of event: the next NMI.

And NMI's really shouldn't be triggering any scheduling-related events afaik.

                Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 20:12     ` Linus Torvalds
@ 2011-11-29 20:31       ` Andi Kleen
  2011-11-29 20:36         ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2011-11-29 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Frederic Weisbecker, Thomas Gleixner,
	Mathieu Desnoyers, Paul Turner

On Tue, Nov 29, 2011 at 12:12:44PM -0800, Linus Torvalds wrote:
> The problem is that NMI's are blocked!
> 
> We're potentially switching to another process, there is no guarantee
> there will be an "iret" *anywhere* in any patch for a long time. So
> any subsequent NMI's will not be able to come in.

True. That's a bug that needs to be fixed.

As a simple fix your proposal of forcing IRET sounds good.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 20:09   ` Andi Kleen
  2011-11-29 20:12     ` Linus Torvalds
@ 2011-11-29 20:35     ` Steven Rostedt
  2011-11-29 20:44       ` Linus Torvalds
  2011-11-29 21:28       ` Perhaps a side effect regarding NMI returns Andi Kleen
  1 sibling, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2011-11-29 20:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, LKML, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Frederic Weisbecker, Thomas Gleixner,
	Mathieu Desnoyers, Paul Turner

On Tue, 2011-11-29 at 12:09 -0800, Andi Kleen wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> Not sure what problem Steven thinks is there. Once you 
> switched the stack any nesting is fine.

Problem? I never said there was a problem. I just said there was a
subtle side effect, that may or may not be a problem. But I never said
there was an actually problem.

There may be a problem where you are profiling at a high frequency, and
then hit this and suddenly, your high frequency NMIs are stopped till
the next interrupt. If you are running at 100HZ, the next interrupt may
not happen for 10ms, where we don't have an NMI triggered until then.

Perhaps for some reason NEED_RESCHED is set, but the task is no longer
going to run. Not sure if that could even happen, but for kicks, lets
assume it can.  Then we schedule into the idle task that goes to sleep.
With NO_HZ, we may not even get another interrupt for a very long time.
But we keep NMIs disabled.

The difference here is that we call schedule without ever enabling NMIs.
Now is that a problem? It may be.


> 
> > I would suggest that the actual NMI handler itself should probably
> > never use that paranoid exit at all, and just always use a regular
> > iret. Screw scheduling and TIF checks.
> 
> The reason I added them originally is to prevent the race of remote
> kernel events being delayed for a long time. With Frederic's nohz work
> this will be more important in the future. Today it would be eventually
> picked up by the regular timer interrupts.

I'm curious to what remote events can be set by an NMI that wont take
affect in what the NMI interrupted. I would think that NMIs should be
treated as if they didn't exist, because they should not be calling
anything that sets NEED_RESCHED or grabbing locks and such.

If a process on another CPU sets NEED_RESCHED on the task for this CPU,
it doesn't matter if it is in an NMI or not. The other process still
needs to send an IPI or something (that will be blocked until the NMI
finishes) in which case, the return from the NMI will trigger the
scheduling.

I don't see the problem where we can't just ignore the flags for NMIs.
What exactly is the race that causes issues when NMIs are not checking
for scheduling and such?

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 20:31       ` Andi Kleen
@ 2011-11-29 20:36         ` Linus Torvalds
  2011-11-29 20:58           ` Steven Rostedt
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2011-11-29 20:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Frederic Weisbecker, Thomas Gleixner,
	Mathieu Desnoyers, Paul Turner

On Tue, Nov 29, 2011 at 12:31 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> As a simple fix your proposal of forcing IRET sounds good.

We could of course use iret to return to the regular kernel stack, and
do the schedule from there.

So instead of doing the manual stack switch, just build a fake iret
stack on our exception stack. Subtle and somewhat complicated. I'd
almost rather just do a blind iret, and leave the 'iret to regular
stack' as a possible future option.

                       Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 20:35     ` Steven Rostedt
@ 2011-11-29 20:44       ` Linus Torvalds
  2011-12-07 16:36         ` Steven Rostedt
  2012-01-08  8:55         ` [tip:perf/core] x86: Do not schedule while still in NMI context tip-bot for Linus Torvalds
  2011-11-29 21:28       ` Perhaps a side effect regarding NMI returns Andi Kleen
  1 sibling, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2011-11-29 20:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, LKML, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Frederic Weisbecker, Thomas Gleixner, Paul Turner

[-- Attachment #1: Type: text/plain, Size: 753 bytes --]

On Tue, Nov 29, 2011 at 12:35 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> There may be a problem where you are profiling at a high frequency, and
> then hit this and suddenly, your high frequency NMIs are stopped till
> the next interrupt. If you are running at 100HZ, the next interrupt may
> not happen for 10ms, where we don't have an NMI triggered until then.

It could be much worse than that with no-HZ and some CPU-intensive
load with no timers. I don't think "sysret" enables NMI's either, so
return-to-user-mode may well leave NMI blocked.

So I really think this might be a real issue. And the simplest
approach seems to be to just remove the code. Something like the
attached (TOTALLY UNTESTED!) patch.

Hmm?

                  Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1400 bytes --]

 arch/x86/kernel/entry_64.S |   32 --------------------------------
 1 files changed, 0 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index faf8d5e74b0b..3819ea907339 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1489,46 +1489,14 @@ ENTRY(nmi)
 	movq %rsp,%rdi
 	movq $-1,%rsi
 	call do_nmi
-#ifdef CONFIG_TRACE_IRQFLAGS
-	/* paranoidexit; without TRACE_IRQS_OFF */
-	/* ebx:	no swapgs flag */
-	DISABLE_INTERRUPTS(CLBR_NONE)
 	testl %ebx,%ebx				/* swapgs needed? */
 	jnz nmi_restore
-	testl $3,CS(%rsp)
-	jnz nmi_userspace
 nmi_swapgs:
 	SWAPGS_UNSAFE_STACK
 nmi_restore:
 	RESTORE_ALL 8
 	jmp irq_return
-nmi_userspace:
-	GET_THREAD_INFO(%rcx)
-	movl TI_flags(%rcx),%ebx
-	andl $_TIF_WORK_MASK,%ebx
-	jz nmi_swapgs
-	movq %rsp,%rdi			/* &pt_regs */
-	call sync_regs
-	movq %rax,%rsp			/* switch stack for scheduling */
-	testl $_TIF_NEED_RESCHED,%ebx
-	jnz nmi_schedule
-	movl %ebx,%edx			/* arg3: thread flags */
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	xorl %esi,%esi 			/* arg2: oldset */
-	movq %rsp,%rdi 			/* arg1: &pt_regs */
-	call do_notify_resume
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	jmp nmi_userspace
-nmi_schedule:
-	ENABLE_INTERRUPTS(CLBR_ANY)
-	call schedule
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	jmp nmi_userspace
 	CFI_ENDPROC
-#else
-	jmp paranoid_exit
-	CFI_ENDPROC
-#endif
 END(nmi)
 
 ENTRY(ignore_sysret)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 20:36         ` Linus Torvalds
@ 2011-11-29 20:58           ` Steven Rostedt
  2011-11-29 21:05             ` Linus Torvalds
  2011-11-29 22:14             ` Jason Baron
  0 siblings, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2011-11-29 20:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, LKML, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Frederic Weisbecker, Thomas Gleixner, Mathieu Desnoyers,
	Paul Turner

On Tue, 2011-11-29 at 12:36 -0800, Linus Torvalds wrote:
> On Tue, Nov 29, 2011 at 12:31 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >
> > As a simple fix your proposal of forcing IRET sounds good.
> 
> We could of course use iret to return to the regular kernel stack, and
> do the schedule from there.
> 
> So instead of doing the manual stack switch, just build a fake iret
> stack on our exception stack. Subtle and somewhat complicated. I'd
> almost rather just do a blind iret, and leave the 'iret to regular
> stack' as a possible future option.

Note, the reason that I've been looking at this code, is because I'm
looking at implementing your idea to handle irets in NMIs, caused by
faults, exceptions, and the reason I really care about: debugging.

Your proposal is here:

  https://lkml.org/lkml/2010/7/14/264

But to make this work, it would be really nice if the NMI routine wasn't
convoluted with the paranoid_exit code.

For things like static_branch()/jump_label and modifying ftrace nops to
calls and back, we currently use the big hammer approach stop_machine().
This keeps another CPU from executing code that is being modified.
There's also tricks to handle NMIs that may be running on the stopped
CPUs.

But people don't like the overhead that stop_machine() causes, and I
have code that can make the modifications for ftrace with break points.
By adding a break point, syncing, then modifying the code and break
point to a new op will greatly reduce the overhead. At least the latency
will be much less.

The problem is that ftrace affects code in NMIs. We tried to not trace
NMIs, but there's so many functions that NMIs call, it ended up being a
losing battle. But if we can fix the NMI enabled on iret, we can then
use the break point scheme for both static_branch() and ftrace, and
remove the overhead of stop_machine. I think there's a possibility to
use kprobes in NMIs too, with this fix.

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 20:58           ` Steven Rostedt
@ 2011-11-29 21:05             ` Linus Torvalds
  2011-11-29 21:22               ` Steven Rostedt
  2011-11-29 22:14             ` Jason Baron
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2011-11-29 21:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, LKML, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Frederic Weisbecker, Thomas Gleixner, Mathieu Desnoyers,
	Paul Turner

On Tue, Nov 29, 2011 at 12:58 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Note, the reason that I've been looking at this code, is because I'm
> looking at implementing your idea to handle irets in NMIs, caused by
> faults, exceptions, and the reason I really care about: debugging.
>
> Your proposal is here:
>
>  https://lkml.org/lkml/2010/7/14/264

Ahh, good that you pointed to it, I'd completely forgotten about this one.

Yeah. Simplifying NMI and not mixing up the paranoid stuff sounds like
a good idea, and then if we do the nice NMI counting thing and avoid
the whole problem with NMI and iret, that would be a nice cleanup in
itself.

So if that patch I posted works for you (with some NMI-heavy workload
like non-PEBS tracing) I think it's the way to go.

Too late for 3.2 obviously, since I don't think anybody has actually
reported the "delayed NMI's" as a real problem - so even if it's a
bug, it's not a bug we should try to fix at this stage.

                      Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 21:05             ` Linus Torvalds
@ 2011-11-29 21:22               ` Steven Rostedt
  0 siblings, 0 replies; 22+ messages in thread
From: Steven Rostedt @ 2011-11-29 21:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, LKML, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Frederic Weisbecker, Thomas Gleixner, Mathieu Desnoyers,
	Paul Turner

On Tue, 2011-11-29 at 13:05 -0800, Linus Torvalds wrote:

> Too late for 3.2 obviously, since I don't think anybody has actually
> reported the "delayed NMI's" as a real problem - so even if it's a
> bug, it's not a bug we should try to fix at this stage.

Oh no, I had no plans to get this into 3.2. I'm optimistic just to get
this into 3.3.

Thanks!

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 20:35     ` Steven Rostedt
  2011-11-29 20:44       ` Linus Torvalds
@ 2011-11-29 21:28       ` Andi Kleen
  1 sibling, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2011-11-29 21:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Linus Torvalds, LKML, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Frederic Weisbecker, Thomas Gleixner,
	Mathieu Desnoyers, Paul Turner

> I'm curious to what remote events can be set by an NMI that wont take
> affect in what the NMI interrupted. I would think that NMIs should be
> treated as if they didn't exist, because they should not be calling
> anything that sets NEED_RESCHED or grabbing locks and such.

Hmm, i thought there were cases where we checked if it was in
kernel mode instead of IPIng, and assume the check is done when
returning.

But cannot come up with a concrete example right now. It may have
been wrong. Or I forgot it :)

You're right anything with interrupts should be fine because
it's just blocked.

-Andi


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 20:58           ` Steven Rostedt
  2011-11-29 21:05             ` Linus Torvalds
@ 2011-11-29 22:14             ` Jason Baron
  2011-11-29 22:51               ` Steven Rostedt
  1 sibling, 1 reply; 22+ messages in thread
From: Jason Baron @ 2011-11-29 22:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Andi Kleen, LKML, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Frederic Weisbecker, Thomas Gleixner,
	Mathieu Desnoyers, Paul Turner

On Tue, Nov 29, 2011 at 03:58:21PM -0500, Steven Rostedt wrote:
> On Tue, 2011-11-29 at 12:36 -0800, Linus Torvalds wrote:
> > On Tue, Nov 29, 2011 at 12:31 PM, Andi Kleen <andi@firstfloor.org> wrote:
> > >
> > > As a simple fix your proposal of forcing IRET sounds good.
> > 
> > We could of course use iret to return to the regular kernel stack, and
> > do the schedule from there.
> > 
> > So instead of doing the manual stack switch, just build a fake iret
> > stack on our exception stack. Subtle and somewhat complicated. I'd
> > almost rather just do a blind iret, and leave the 'iret to regular
> > stack' as a possible future option.
> 
> Note, the reason that I've been looking at this code, is because I'm
> looking at implementing your idea to handle irets in NMIs, caused by
> faults, exceptions, and the reason I really care about: debugging.
> 
> Your proposal is here:
> 
>   https://lkml.org/lkml/2010/7/14/264
> 
> But to make this work, it would be really nice if the NMI routine wasn't
> convoluted with the paranoid_exit code.
> 
> For things like static_branch()/jump_label and modifying ftrace nops to
> calls and back, we currently use the big hammer approach stop_machine().
> This keeps another CPU from executing code that is being modified.
> There's also tricks to handle NMIs that may be running on the stopped
> CPUs.
> 
> But people don't like the overhead that stop_machine() causes, and I
> have code that can make the modifications for ftrace with break points.
> By adding a break point, syncing, then modifying the code and break

But if there's still has to be some sort of 'syncing' after we add a break
point, how much are we going to save? Or I guess your're using an IPI? 

Thanks,

-Jason

> point to a new op will greatly reduce the overhead. At least the latency
> will be much less.
> 
> The problem is that ftrace affects code in NMIs. We tried to not trace
> NMIs, but there's so many functions that NMIs call, it ended up being a
> losing battle. But if we can fix the NMI enabled on iret, we can then
> use the break point scheme for both static_branch() and ftrace, and
> remove the overhead of stop_machine. I think there's a possibility to
> use kprobes in NMIs too, with this fix.
> 
> -- Steve
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 22:14             ` Jason Baron
@ 2011-11-29 22:51               ` Steven Rostedt
  2011-11-30 11:56                 ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2011-11-29 22:51 UTC (permalink / raw)
  To: Jason Baron
  Cc: Linus Torvalds, Andi Kleen, LKML, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Frederic Weisbecker, Thomas Gleixner,
	Mathieu Desnoyers, Paul Turner

On Tue, 2011-11-29 at 17:14 -0500, Jason Baron wrote:
> On Tue, Nov 29, 2011 at 03:58:21PM -0500, Steven Rostedt wrote:

> > But people don't like the overhead that stop_machine() causes, and I
> > have code that can make the modifications for ftrace with break points.
> > By adding a break point, syncing, then modifying the code and break
> 
> But if there's still has to be some sort of 'syncing' after we add a break
> point, how much are we going to save? Or I guess your're using an IPI? 

Well, anything is better than stop machine, event synchronize_sched() ;)

But the code I have in ftrace does bulk changes. It adds a break point
to all functions, then it does the sync, then it updates all the points
to the new code.

Looking at my code, here's what I did after setting up the breakpoints:

static void do_sync_core(void *data)
{
        sync_core();
}

static void run_sync(void)
{
        int enable_irqs = irqs_disabled();

        /* We may be called with interrupts disbled. */
        if (enable_irqs)
                local_irq_enable();
        on_each_cpu(do_sync_core, NULL, 1);
        if (enable_irqs)
                local_irq_disable();
}

Note, it's fine to enable interrupts here, it's only used by ftrace.

-- Steve



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 22:51               ` Steven Rostedt
@ 2011-11-30 11:56                 ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2011-11-30 11:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jason Baron, Linus Torvalds, Andi Kleen, LKML, Ingo Molnar,
	H. Peter Anvin, Frederic Weisbecker, Thomas Gleixner,
	Mathieu Desnoyers, Paul Turner

On Tue, 2011-11-29 at 17:51 -0500, Steven Rostedt wrote:
> static void run_sync(void)
> {
>         int enable_irqs = irqs_disabled();
> 
>         /* We may be called with interrupts disbled. */
>         if (enable_irqs)
>                 local_irq_enable();
>         on_each_cpu(do_sync_core, NULL, 1);
>         if (enable_irqs)
>                 local_irq_disable();
> }
> 
> Note, it's fine to enable interrupts here, it's only used by ftrace. 

Still ugly as hell though, why not push that pain to the caller?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-11-29 20:44       ` Linus Torvalds
@ 2011-12-07 16:36         ` Steven Rostedt
  2011-12-07 16:44           ` Linus Torvalds
  2011-12-07 17:51           ` Andi Kleen
  2012-01-08  8:55         ` [tip:perf/core] x86: Do not schedule while still in NMI context tip-bot for Linus Torvalds
  1 sibling, 2 replies; 22+ messages in thread
From: Steven Rostedt @ 2011-12-07 16:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, LKML, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Frederic Weisbecker, Thomas Gleixner, Paul Turner

On Tue, 2011-11-29 at 12:44 -0800, Linus Torvalds wrote:
> On Tue, Nov 29, 2011 at 12:35 PM, Steven Rostedt <rostedt@goodmis.org> wrote:

> So I really think this might be a real issue. And the simplest
> approach seems to be to just remove the code. Something like the
> attached (TOTALLY UNTESTED!) patch.
> 

Linus,

I've tested this patch quite a bit with 'perf top -C 1000' and it holds
up well. Could you give me your Signed-off-by so I can include it into
my patch queue?

Thanks,

-- Steve

 arch/x86/kernel/entry_64.S |   32 --------------------------------
 1 files changed, 0 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index faf8d5e74b0b..3819ea907339 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1489,46 +1489,14 @@ ENTRY(nmi)
 	movq %rsp,%rdi
 	movq $-1,%rsi
 	call do_nmi
-#ifdef CONFIG_TRACE_IRQFLAGS
-	/* paranoidexit; without TRACE_IRQS_OFF */
-	/* ebx:	no swapgs flag */
-	DISABLE_INTERRUPTS(CLBR_NONE)
 	testl %ebx,%ebx				/* swapgs needed? */
 	jnz nmi_restore
-	testl $3,CS(%rsp)
-	jnz nmi_userspace
 nmi_swapgs:
 	SWAPGS_UNSAFE_STACK
 nmi_restore:
 	RESTORE_ALL 8
 	jmp irq_return
-nmi_userspace:
-	GET_THREAD_INFO(%rcx)
-	movl TI_flags(%rcx),%ebx
-	andl $_TIF_WORK_MASK,%ebx
-	jz nmi_swapgs
-	movq %rsp,%rdi			/* &pt_regs */
-	call sync_regs
-	movq %rax,%rsp			/* switch stack for scheduling */
-	testl $_TIF_NEED_RESCHED,%ebx
-	jnz nmi_schedule
-	movl %ebx,%edx			/* arg3: thread flags */
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	xorl %esi,%esi 			/* arg2: oldset */
-	movq %rsp,%rdi 			/* arg1: &pt_regs */
-	call do_notify_resume
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	jmp nmi_userspace
-nmi_schedule:
-	ENABLE_INTERRUPTS(CLBR_ANY)
-	call schedule
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	jmp nmi_userspace
 	CFI_ENDPROC
-#else
-	jmp paranoid_exit
-	CFI_ENDPROC
-#endif
 END(nmi)
 
 ENTRY(ignore_sysret)





^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-12-07 16:36         ` Steven Rostedt
@ 2011-12-07 16:44           ` Linus Torvalds
  2011-12-07 17:31             ` Steven Rostedt
  2011-12-07 17:51           ` Andi Kleen
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2011-12-07 16:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, LKML, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Frederic Weisbecker, Thomas Gleixner, Paul Turner

On Wed, Dec 7, 2011 at 8:36 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I've tested this patch quite a bit with 'perf top -C 1000' and it holds
> up well. Could you give me your Signed-off-by so I can include it into
> my patch queue?

Sure. If it's tested, I have no problem with signing off on it, so
just add it. Do you have some kind of commit message that makes sense
and explains the NMI shadow issue?

                     Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-12-07 16:44           ` Linus Torvalds
@ 2011-12-07 17:31             ` Steven Rostedt
  2011-12-07 17:48               ` Linus Torvalds
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Rostedt @ 2011-12-07 17:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, LKML, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Frederic Weisbecker, Thomas Gleixner, Paul Turner

On Wed, 2011-12-07 at 08:44 -0800, Linus Torvalds wrote:

> Sure. If it's tested, I have no problem with signing off on it, so
> just add it. Do you have some kind of commit message that makes sense
> and explains the NMI shadow issue?

How about something like this:

The NMI handler uses the paranoid_exit routine that checks the
NEED_RESCHED flag, and if it is set and the return is for userspace,
then interrupts are enabled, the stack is swapped to the thread's stack,
and schedule is called. The problem with this is that we are still in an
NMI context until an iret is executed. This means that any new NMIs are
now starved until an interrupt or exception occurs and does the iret.

As NMIs can not be masked and can interrupt any location, they are
treated as a special case. NEED_RESCHED should not be set in an NMI
handler. The interruption by the NMI should not disturb the work flow
for scheduling. Any IPI sent to a processor after sending the
NEED_RESCHED would have to wait for the NMI anyway, and after the IPI
finishes the schedule would be called as required.

There is no reason to do anything special leaving an NMI. Remove the
call to paranoid_exit and do a simple return. This not only fixes the
bug of starved NMIs, but it also cleans up the code.

-- Steve


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-12-07 17:31             ` Steven Rostedt
@ 2011-12-07 17:48               ` Linus Torvalds
  0 siblings, 0 replies; 22+ messages in thread
From: Linus Torvalds @ 2011-12-07 17:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, LKML, Ingo Molnar, Peter Zijlstra, H. Peter Anvin,
	Frederic Weisbecker, Thomas Gleixner, Paul Turner

On Wed, Dec 7, 2011 at 9:31 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> How about something like this:

Ack, looks good to me,

                   Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Perhaps a side effect regarding NMI returns
  2011-12-07 16:36         ` Steven Rostedt
  2011-12-07 16:44           ` Linus Torvalds
@ 2011-12-07 17:51           ` Andi Kleen
  1 sibling, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2011-12-07 17:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Andi Kleen, LKML, Ingo Molnar, Peter Zijlstra,
	H. Peter Anvin, Frederic Weisbecker, Thomas Gleixner,
	Paul Turner

> I've tested this patch quite a bit with 'perf top -C 1000' and it holds
> up well. Could you give me your Signed-off-by so I can include it into
> my patch queue?

Acked-by: Andi Kleen <ak@linux.intel.com>

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:perf/core] x86: Do not schedule while still in NMI context
  2011-11-29 20:44       ` Linus Torvalds
  2011-12-07 16:36         ` Steven Rostedt
@ 2012-01-08  8:55         ` tip-bot for Linus Torvalds
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot for Linus Torvalds @ 2012-01-08  8:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, pjt, peterz, fweisbec,
	rostedt, ak, tglx, hpa

Commit-ID:  549c89b98c4530b278dde1a3f68ce5ebbb1e6304
Gitweb:     http://git.kernel.org/tip/549c89b98c4530b278dde1a3f68ce5ebbb1e6304
Author:     Linus Torvalds <torvalds@linux-foundation.org>
AuthorDate: Tue, 29 Nov 2011 12:44:55 -0800
Committer:  Steven Rostedt <rostedt@goodmis.org>
CommitDate: Wed, 21 Dec 2011 15:38:52 -0500

x86: Do not schedule while still in NMI context

The NMI handler uses the paranoid_exit routine that checks the
NEED_RESCHED flag, and if it is set and the return is for userspace,
then interrupts are enabled, the stack is swapped to the thread's stack,
and schedule is called. The problem with this is that we are still in an
NMI context until an iret is executed. This means that any new NMIs are
now starved until an interrupt or exception occurs and does the iret.

As NMIs can not be masked and can interrupt any location, they are
treated as a special case. NEED_RESCHED should not be set in an NMI
handler. The interruption by the NMI should not disturb the work flow
for scheduling. Any IPI sent to a processor after sending the
NEED_RESCHED would have to wait for the NMI anyway, and after the IPI
finishes the schedule would be called as required.

There is no reason to do anything special leaving an NMI. Remove the
call to paranoid_exit and do a simple return. This not only fixes the
bug of starved NMIs, but it also cleans up the code.

Link: http://lkml.kernel.org/r/CA+55aFzgM55hXTs4griX5e9=v_O+=ue+7Rj0PTD=M7hFYpyULQ@mail.gmail.com

Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/entry_64.S |   32 --------------------------------
 1 files changed, 0 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index faf8d5e..3819ea9 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1489,46 +1489,14 @@ ENTRY(nmi)
 	movq %rsp,%rdi
 	movq $-1,%rsi
 	call do_nmi
-#ifdef CONFIG_TRACE_IRQFLAGS
-	/* paranoidexit; without TRACE_IRQS_OFF */
-	/* ebx:	no swapgs flag */
-	DISABLE_INTERRUPTS(CLBR_NONE)
 	testl %ebx,%ebx				/* swapgs needed? */
 	jnz nmi_restore
-	testl $3,CS(%rsp)
-	jnz nmi_userspace
 nmi_swapgs:
 	SWAPGS_UNSAFE_STACK
 nmi_restore:
 	RESTORE_ALL 8
 	jmp irq_return
-nmi_userspace:
-	GET_THREAD_INFO(%rcx)
-	movl TI_flags(%rcx),%ebx
-	andl $_TIF_WORK_MASK,%ebx
-	jz nmi_swapgs
-	movq %rsp,%rdi			/* &pt_regs */
-	call sync_regs
-	movq %rax,%rsp			/* switch stack for scheduling */
-	testl $_TIF_NEED_RESCHED,%ebx
-	jnz nmi_schedule
-	movl %ebx,%edx			/* arg3: thread flags */
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	xorl %esi,%esi 			/* arg2: oldset */
-	movq %rsp,%rdi 			/* arg1: &pt_regs */
-	call do_notify_resume
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	jmp nmi_userspace
-nmi_schedule:
-	ENABLE_INTERRUPTS(CLBR_ANY)
-	call schedule
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	jmp nmi_userspace
 	CFI_ENDPROC
-#else
-	jmp paranoid_exit
-	CFI_ENDPROC
-#endif
 END(nmi)
 
 ENTRY(ignore_sysret)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-01-08  8:56 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-29  4:07 Perhaps a side effect regarding NMI returns Steven Rostedt
2011-11-29  4:53 ` Linus Torvalds
2011-11-29  7:33   ` Paul Turner
2011-11-29 20:09   ` Andi Kleen
2011-11-29 20:12     ` Linus Torvalds
2011-11-29 20:31       ` Andi Kleen
2011-11-29 20:36         ` Linus Torvalds
2011-11-29 20:58           ` Steven Rostedt
2011-11-29 21:05             ` Linus Torvalds
2011-11-29 21:22               ` Steven Rostedt
2011-11-29 22:14             ` Jason Baron
2011-11-29 22:51               ` Steven Rostedt
2011-11-30 11:56                 ` Peter Zijlstra
2011-11-29 20:35     ` Steven Rostedt
2011-11-29 20:44       ` Linus Torvalds
2011-12-07 16:36         ` Steven Rostedt
2011-12-07 16:44           ` Linus Torvalds
2011-12-07 17:31             ` Steven Rostedt
2011-12-07 17:48               ` Linus Torvalds
2011-12-07 17:51           ` Andi Kleen
2012-01-08  8:55         ` [tip:perf/core] x86: Do not schedule while still in NMI context tip-bot for Linus Torvalds
2011-11-29 21:28       ` Perhaps a side effect regarding NMI returns Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.