linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
@ 2004-12-17 23:35 Chuck Ebbert
  0 siblings, 0 replies; 21+ messages in thread
From: Chuck Ebbert @ 2004-12-17 23:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, Andi Kleen, Andrew Morton

On Tue, 14 Dec 2004 at 15:00:56 -0800 (PST) Linus Torvalds wrote:

> Checking for kernel CS also requires checking that it's not vm86 mode, 
> btw. So that's not just a "regs->xcs & 0xffff == __KERNEL_CS" either.
>
> But something like
> 
>       static inline int kernel_mode(struct pt_regs *regs)
>       {
>               return !((regs->eflags & VM_MASK) | (regs->xcs & 3));
>       }
>
> should DTRT.
>
> Can you pls double-check my thinking, and test?


 There is already a user_mode() macro in asm-i386/ptrace.h but it's slow.

 x86_64's macro is ugly but at least there's no logical-or in it.

 Your i386 code is better, so how about applying this patch (boots/runs/is faster
on my tests):

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>

--- linux-2.6.9.1/include/asm-i386/ptrace.h     2004-10-19 15:28:18.000000000 -0400
+++ linux-2.6.9.2/include/asm-i386/ptrace.h     2004-12-17 16:59:39.956099664 -0500
@@ -55,7 +55,11 @@ struct pt_regs {
 #define PTRACE_SET_THREAD_AREA    26
 
 #ifdef __KERNEL__
-#define user_mode(regs) ((VM_MASK & (regs)->eflags) || (3 & (regs)->xcs))
+static inline int kernel_mode(struct pt_regs *regs)
+{
+       return !((3 & (regs)->xcs) | (VM_MASK & (regs)->eflags));
+}
+#define user_mode(regs) (!kernel_mode(regs))
 #define instruction_pointer(regs) ((regs)->eip)
 #if defined(CONFIG_SMP) && defined(CONFIG_FRAME_POINTER)
 extern unsigned long profile_pc(struct pt_regs *regs);

--
Please take it as a sign of my infinite respect for you,
that I insist on you doing all the work.
                                        -- Rusty Russell

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-16 14:51                             ` Ingo Molnar
  2004-12-16 15:08                               ` Maciej W. Rozycki
@ 2004-12-16 15:54                               ` Linus Torvalds
  1 sibling, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2004-12-16 15:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Lee Revell, Andrea Arcangeli, Manfred Spraul,
	Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml,
	Andrew Morton, Andi Kleen



On Thu, 16 Dec 2004, Ingo Molnar wrote:
> 
> i also played a bit with the %ss instructions, and combined them with
> the cli/sti instructions and other instructions in various ways, and
> with a bit of experimenting found the following, somewhat surprising
> results:
>
> [ snip ]
> 
> it shows a number of interesting effects:
> 
> - "mov %eax,%ss" followed by the _same_ instruction cancels the 
>   black-hole. This i suspect is done to prevent the lockup in vm86
>   mode.

I don't think it's the "same instruction". Looking at the pattern, I think
that a "mov->ss" always checks interrupts _before_ it executes, and never 
checks interrupts _after_ it executes.

So I think the pattern is (for your athlon64):

 - regular instructions check interrupts before they execute, _except_ if 
   the "dontcheck" flag was set. They clear "dontcheck" after execution.
 - "mov->ss" always checks interrupts before it executes, regardless of
   "dontcheck". It always sets "dontcheck".
 - "sti" sets "dontcheck" if interrupts were disabled before.

So you can get two-instruction holes by doing the sequence

	/* interrupts disabled */
	mov->ss
	sti
	/* any instruction except cli/mov->ss */

but no other combination (series of "mov->ss" will always check _before_
each "mov->ss", and series of "sti" will obviously only have interrupts 
disabled for the _first_ sti).

And I suspect this is very much micro-architecture-dependent, although the 
Athlon64 rules seem very simple and straightforward.

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-16 15:11                                 ` Ingo Molnar
@ 2004-12-16 15:42                                   ` Maciej W. Rozycki
  0 siblings, 0 replies; 21+ messages in thread
From: Maciej W. Rozycki @ 2004-12-16 15:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Alan Cox, Lee Revell, Andrea Arcangeli,
	Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar,
	ganzinger, lkml, Andrew Morton, Andi Kleen

On Thu, 16 Dec 2004, Ingo Molnar wrote:

> The 'sti' "shadows" the cli, i.e. we'll never get an interrupt that gets
> inbetween 'sti;cli'. I.e. sti is the black-hole generator, and 'cli' is
> in the black hole. In that sense the 'cli' is in a black hole to the
> NMI: the NMI will never see cli as the 'next to be executed'
> instruction.

 That's what I meant indeed, but I'd like to emphasise, for readers to be
aware, the black hole is not tied to the 'cli' instruction itself in any
way.  The black-holed instruction needs not be a 'cli' -- it can be an
arbitrary one except from ones creating black holes as you've observed
with your test.

  Maciej

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-16 15:08                               ` Maciej W. Rozycki
@ 2004-12-16 15:11                                 ` Ingo Molnar
  2004-12-16 15:42                                   ` Maciej W. Rozycki
  0 siblings, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2004-12-16 15:11 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Linus Torvalds, Alan Cox, Lee Revell, Andrea Arcangeli,
	Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar,
	ganzinger, lkml, Andrew Morton, Andi Kleen


* Maciej W. Rozycki <macro@linux-mips.org> wrote:

> On Thu, 16 Dec 2004, Ingo Molnar wrote:
> 
> > c0125ee9:     1529 	fa                   	cli    
> >                  ^---------------------------------- # of profiler hits
> > c0125eea:      507 	fb                   	sti    
> > c0125eeb:        0 	fa                   	cli    
> > c0125eec:     3719 	fb                   	sti    
> > c0125eed:        0 	fa                   	cli    
> > c0125eee:     1579 	fb                   	sti    
> > c0125eef:        0 	fa                   	cli    
> > c0125ef0:     3317 	fb                   	sti    
> > c0125ef1:        0 	fa                   	cli    
> > c0125ef2:     3030 	fb                   	sti    
> > c0125ef3:        0 	fa                   	cli    
> > c0125ef4:     2497 	fa                   	cli    
> > c0125ef5:     1055 	fb                   	sti    
> > c0125ef6:        0 	fa                   	cli    
> [...]
> > the 'cli' is always a 'black hole' to the NMI, while the second of two
> > consecutive cli's are not.
> 
>  It looks like the 'sti' is actually the black hole -- remember
> interrupts are traps, that is they are probed for and taken after
> instruction execution.

The 'sti' "shadows" the cli, i.e. we'll never get an interrupt that gets
inbetween 'sti;cli'. I.e. sti is the black-hole generator, and 'cli' is
in the black hole. In that sense the 'cli' is in a black hole to the
NMI: the NMI will never see cli as the 'next to be executed'
instruction.

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-16 14:51                             ` Ingo Molnar
@ 2004-12-16 15:08                               ` Maciej W. Rozycki
  2004-12-16 15:11                                 ` Ingo Molnar
  2004-12-16 15:54                               ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Maciej W. Rozycki @ 2004-12-16 15:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Alan Cox, Lee Revell, Andrea Arcangeli,
	Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar,
	ganzinger, lkml, Andrew Morton, Andi Kleen

On Thu, 16 Dec 2004, Ingo Molnar wrote:

> c0125ee9:     1529 	fa                   	cli    
>                  ^---------------------------------- # of profiler hits
> c0125eea:      507 	fb                   	sti    
> c0125eeb:        0 	fa                   	cli    
> c0125eec:     3719 	fb                   	sti    
> c0125eed:        0 	fa                   	cli    
> c0125eee:     1579 	fb                   	sti    
> c0125eef:        0 	fa                   	cli    
> c0125ef0:     3317 	fb                   	sti    
> c0125ef1:        0 	fa                   	cli    
> c0125ef2:     3030 	fb                   	sti    
> c0125ef3:        0 	fa                   	cli    
> c0125ef4:     2497 	fa                   	cli    
> c0125ef5:     1055 	fb                   	sti    
> c0125ef6:        0 	fa                   	cli    
[...]
> the 'cli' is always a 'black hole' to the NMI, while the second of two
> consecutive cli's are not.

 It looks like the 'sti' is actually the black hole -- remember interrupts
are traps, that is they are probed for and taken after instruction
execution.

  Maciej

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-16  1:58                           ` Linus Torvalds
@ 2004-12-16 14:51                             ` Ingo Molnar
  2004-12-16 15:08                               ` Maciej W. Rozycki
  2004-12-16 15:54                               ` Linus Torvalds
  0 siblings, 2 replies; 21+ messages in thread
From: Ingo Molnar @ 2004-12-16 14:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Lee Revell, Andrea Arcangeli, Manfred Spraul,
	Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml,
	Andrew Morton, Andi Kleen


* Linus Torvalds <torvalds@osdl.org> wrote:

> The irq window should actually be open every alternate instruction, I
> think. Although it's not actually architected, and I thought that
> there was some errata for some CPU about this..

i have generated an instruction-granularity profile of kernel code
executing the following sequence, driven by the NMI watchdog interrupt:

 asm ("cli; cli; sti; cli; sti; cli; sti; cli; sti; cli; sti; ");
 asm ("cli; cli; sti; cli; sti; cli; sti; cli; sti; cli; sti; ");
 asm ("cli; cli; sti; cli; sti; cli; sti; cli; sti; cli; sti; ");

the first CLI is done twice, to prove that the NMI profiling works and
that the kernel can be interrupted in those places. Then i called this
kernel code in a loop. Here's the result:

c0125ee9:     1529 	fa                   	cli    
                 ^---------------------------------- # of profiler hits
c0125eea:      507 	fb                   	sti    
c0125eeb:        0 	fa                   	cli    
c0125eec:     3719 	fb                   	sti    
c0125eed:        0 	fa                   	cli    
c0125eee:     1579 	fb                   	sti    
c0125eef:        0 	fa                   	cli    
c0125ef0:     3317 	fb                   	sti    
c0125ef1:        0 	fa                   	cli    
c0125ef2:     3030 	fb                   	sti    
c0125ef3:        0 	fa                   	cli    
c0125ef4:     2497 	fa                   	cli    
c0125ef5:     1055 	fb                   	sti    
c0125ef6:        0 	fa                   	cli    
c0125ef7:     4674 	fb                   	sti    
c0125ef8:        0 	fa                   	cli    
c0125ef9:     3827 	fb                   	sti    
c0125efa:        0 	fa                   	cli    
c0125efb:     1622 	fb                   	sti    
c0125efc:        0 	fa                   	cli    
c0125efd:     3155 	fb                   	sti    
c0125efe:        0 	fa                   	cli    
c0125eff:     1273 	fa                   	cli    
c0125f00:      512 	fb                   	sti    
c0125f01:        0 	fa                   	cli    
c0125f02:     1312 	fb                   	sti    
c0125f03:        0 	fa                   	cli    
c0125f04:     1426 	fb                   	sti    
c0125f05:        0 	fa                   	cli    
c0125f06:     1507 	fb                   	sti    
c0125f07:        0 	fa                   	cli    
c0125f08:     2720 	fb                   	sti    
c0125f09:        0 	fa                   	cli    
c0125f0a:     2469 	fa                   	cli    
c0125f0b:      787 	fb                   	sti    
c0125f0c:        0 	fa                   	cli    
c0125f0d:     2085 	fb                   	sti    
c0125f0e:        0 	fa                   	cli    

the 'cli' is always a 'black hole' to the NMI, while the second of two
consecutive cli's are not.

i also played a bit with the %ss instructions, and combined them with
the cli/sti instructions and other instructions in various ways, and
with a bit of experimenting found the following, somewhat surprising
results:

c0125f33:     1016 	66 8c d0             	mov    %ss,%ax
c0125f36:     6626 	8e d0                	mov    %eax,%ss
c0125f38:    34715 	8e d0                	mov    %eax,%ss
c0125f3a:    14682 	8e d0                	mov    %eax,%ss
c0125f3c:     4521 	8e d0                	mov    %eax,%ss
c0125f3e:     7564 	8e d0                	mov    %eax,%ss
c0125f40:     3861 	66 8e d0             	mov    %ax,%ss
c0125f43:        0 	66 8c d1             	mov    %ss,%cx
c0125f46:     1061 	66 8c da             	mov    %ds,%dx
c0125f49:     7660 	8e d1                	mov    %ecx,%ss
c0125f4b:    11322 	17                   	pop    %ss
c0125f4c:        0 	fb                   	sti    
c0125f4d:     8935 	8e d1                	mov    %ecx,%ss
c0125f4f:        0 	fa                   	cli    
c0125f50:     2198 	66 8c d1             	mov    %ss,%cx
c0125f53:      735 	66 8c da             	mov    %ds,%dx
c0125f56:        0 	8e da                	mov    %edx,%ds
c0125f58:     6400 	8e d0                	mov    %eax,%ss
c0125f5a:     3062 	8e d0                	mov    %eax,%ss
c0125f5c:     3552 	8e d0                	mov    %eax,%ss
c0125f5e:     4818 	8e d0                	mov    %eax,%ss
c0125f60:        0 	fb                   	sti    
c0125f61:        0 	66 8c da             	mov    %ds,%dx
c0125f64:    17788 	8e d0                	mov    %eax,%ss
c0125f66:    64694 	8e d0                	mov    %eax,%ss
c0125f68:    12837 	8e d0                	mov    %eax,%ss
c0125f6a:     9859 	8e d0                	mov    %eax,%ss
c0125f6c:        0 	fb                   	sti    
c0125f6d:    74506 	8e d0                	mov    %eax,%ss
c0125f6f:        0 	fb                   	sti    
c0125f70:     8589 	fa                   	cli    
c0125f71:    10248 	8e d0                	mov    %eax,%ss
c0125f73:     3825 	8e d0                	mov    %eax,%ss
c0125f75:     4903 	8e d0                	mov    %eax,%ss
c0125f77:    71134 	8e d0                	mov    %eax,%ss
c0125f79:        0 	fb                   	sti    
c0125f7a:        0 	fa                   	cli    
c0125f7b:     7461 	8e d0                	mov    %eax,%ss
c0125f7d:        0 	66 8c d0             	mov    %ss,%ax
c0125f80:    39387 	8e d0                	mov    %eax,%ss
c0125f82:        0 	fa                   	cli    
c0125f83:    41484 	8e d0                	mov    %eax,%ss
c0125f85:        0 	fa                   	cli    
c0125f86:     4490 	8e d0                	mov    %eax,%ss
c0125f88:        0 	fa                   	cli    
c0125f89:     6024 	8e d0                	mov    %eax,%ss
c0125f8b:    15454 	8e d0                	mov    %eax,%ss
c0125f8d:        0 	fb                   	sti    
c0125f8e:        0 	fb                   	sti    
c0125f8f:   115104 	fb                   	sti    
c0125f90:    39061 	fb                   	sti    

it shows a number of interesting effects:

- "mov %eax,%ss" followed by the _same_ instruction cancels the 
  black-hole. This i suspect is done to prevent the lockup in vm86
  mode.

- an %ss black-hole instruction followed by 'sti' cancels sti's
  black-hole. This is unlikely to occur in real kernel code, but we
  might want to add a 'nop' in front of safe_halt()'s sti, to make sure
  the black-hole takes effect.

- in one case a two-instruction blackhole was created - but this might 
  be some prefetch effect.

i played around with the instructions a bit to manufacture combinations
that enlengthen the black-hole but failed :) This was on an Athlon64.

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-16  2:10                           ` Zwane Mwaikambo
@ 2004-12-16 13:26                             ` Alan Cox
  0 siblings, 0 replies; 21+ messages in thread
From: Alan Cox @ 2004-12-16 13:26 UTC (permalink / raw)
  To: Zwane Mwaikambo
  Cc: Linus Torvalds, Ingo Molnar, Lee Revell, Andrea Arcangeli,
	Manfred Spraul, George Anzinger, dipankar, ganzinger, lkml,
	Andrew Morton, Andi Kleen

On Iau, 2004-12-16 at 02:10, Zwane Mwaikambo wrote:
> Might this be because you can't rely on interrupt suppression for back to 
> back suppressing instructions?

The documentation seems to have little to say on this. I've also not
tried things like interleaved mov->ss, sti to see how the interlocking
is done. It would make sense given the original 8086 reason was to allow
ss/sp to be loaded cleanly.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-16  0:37                         ` Alan Cox
  2004-12-16  1:58                           ` Linus Torvalds
@ 2004-12-16  2:10                           ` Zwane Mwaikambo
  2004-12-16 13:26                             ` Alan Cox
  1 sibling, 1 reply; 21+ messages in thread
From: Zwane Mwaikambo @ 2004-12-16  2:10 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, Ingo Molnar, Lee Revell, Andrea Arcangeli,
	Manfred Spraul, George Anzinger, dipankar, ganzinger, lkml,
	Andrew Morton, Andi Kleen

On Thu, 16 Dec 2004, Alan Cox wrote:

> On Maw, 2004-12-14 at 23:09, Linus Torvalds wrote:
> > On Tue, 14 Dec 2004, Ingo Molnar wrote:
> > Now that you mention it, I have this dim memory of the one-instruction
> > "sti-shadow" actually disabling NMI's (and debug traps) too. The CPU 
> > literally doesn't test for async events following "sti". 
> > 
> > Or maybe that was "mov->ss". That one also has that strange "black hole"  
> > for one instruction.
> 
> The mov to ss one is a bit more magic than that however. If you write
> 3Gb of mov->ss into memory (ie about 64 pages to thrash the cache and
> slow it plus mmap repeatedly) and run it you don't get a vastly long irq
> delay at least on intel, not tried the others. 

Might this be because you can't rely on interrupt suppression for back to 
back suppressing instructions?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-16  0:37                         ` Alan Cox
@ 2004-12-16  1:58                           ` Linus Torvalds
  2004-12-16 14:51                             ` Ingo Molnar
  2004-12-16  2:10                           ` Zwane Mwaikambo
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2004-12-16  1:58 UTC (permalink / raw)
  To: Alan Cox
  Cc: Ingo Molnar, Lee Revell, Andrea Arcangeli, Manfred Spraul,
	Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml,
	Andrew Morton, Andi Kleen



On Thu, 16 Dec 2004, Alan Cox wrote:

> On Maw, 2004-12-14 at 23:09, Linus Torvalds wrote:
> > On Tue, 14 Dec 2004, Ingo Molnar wrote:
> > Now that you mention it, I have this dim memory of the one-instruction
> > "sti-shadow" actually disabling NMI's (and debug traps) too. The CPU 
> > literally doesn't test for async events following "sti". 
> > 
> > Or maybe that was "mov->ss". That one also has that strange "black hole"  
> > for one instruction.
> 
> The mov to ss one is a bit more magic than that however. If you write
> 3Gb of mov->ss into memory (ie about 64 pages to thrash the cache and
> slow it plus mmap repeatedly) and run it you don't get a vastly long irq
> delay at least on intel, not tried the others. 

The irq window should actually be open every alternate instruction, I 
think. Although it's not actually architected, and I thought that there 
was some errata for some CPU about this..

			Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-14 23:09                       ` Linus Torvalds
  2004-12-15  8:52                         ` Ingo Molnar
@ 2004-12-16  0:37                         ` Alan Cox
  2004-12-16  1:58                           ` Linus Torvalds
  2004-12-16  2:10                           ` Zwane Mwaikambo
  1 sibling, 2 replies; 21+ messages in thread
From: Alan Cox @ 2004-12-16  0:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Lee Revell, Andrea Arcangeli, Manfred Spraul,
	Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml,
	Andrew Morton, Andi Kleen

On Maw, 2004-12-14 at 23:09, Linus Torvalds wrote:
> On Tue, 14 Dec 2004, Ingo Molnar wrote:
> Now that you mention it, I have this dim memory of the one-instruction
> "sti-shadow" actually disabling NMI's (and debug traps) too. The CPU 
> literally doesn't test for async events following "sti". 
> 
> Or maybe that was "mov->ss". That one also has that strange "black hole"  
> for one instruction.

The mov to ss one is a bit more magic than that however. If you write
3Gb of mov->ss into memory (ie about 64 pages to thrash the cache and
slow it plus mmap repeatedly) and run it you don't get a vastly long irq
delay at least on intel, not tried the others. 

Alan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-15 15:44                           ` Linus Torvalds
@ 2004-12-15 16:35                             ` Ingo Molnar
  0 siblings, 0 replies; 21+ messages in thread
From: Ingo Molnar @ 2004-12-15 16:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo,
	George Anzinger, dipankar, ganzinger, lkml, Andrew Morton,
	Andi Kleen


* Linus Torvalds <torvalds@osdl.org> wrote:

> 
> 
> On Wed, 15 Dec 2004, Ingo Molnar wrote:
> > 
> > i ran the stresstest overnight with the 10 KHz NMI, and not a single
> > time did the new branch trigger, out of hundreds of millions of IRQs and
> > NMIs. I think this suggests that the race doesnt exist in current CPUs.
> 
> That may well be true, but I'm not convinced your test is meaningful
> or shows anything.
> 
> The thing is, either the CPU is busy, or it's idle. If it's busy,
> you'll never see this. And if it's idle, it will always be _in_ the
> "halt" instruction.

i deliberately started a test where there was roughly 50% idle time.

> The only way to see the case is in the borderline cases, and if/when
> there are multiple different interrupts (first non-NMI interrupt takes
> it out of the hlt, and then the NMI happens to catch the sti). And
> quite frankly, I don't see how you would stress-test it. A 1kHz timer
> interrupt with a 10kHz NMI interrupt is still very infrequent
> interrupts...

i started an infinite loop that generated disk IRQs, and started a
network test that generated network IRQs. The IRQ rate was roughly
10K/sec - this combined with the 10K/sec NMI rate should be an adequate
mix. (I also made sure that it's really default_idle that is used.)

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-15  8:52                         ` Ingo Molnar
@ 2004-12-15 15:44                           ` Linus Torvalds
  2004-12-15 16:35                             ` Ingo Molnar
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2004-12-15 15:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo,
	George Anzinger, dipankar, ganzinger, lkml, Andrew Morton,
	Andi Kleen



On Wed, 15 Dec 2004, Ingo Molnar wrote:
> 
> i ran the stresstest overnight with the 10 KHz NMI, and not a single
> time did the new branch trigger, out of hundreds of millions of IRQs and
> NMIs. I think this suggests that the race doesnt exist in current CPUs.

That may well be true, but I'm not convinced your test is meaningful or 
shows anything.

The thing is, either the CPU is busy, or it's idle. If it's busy, you'll
never see this. And if it's idle, it will always be _in_ the "halt"  
instruction.

The only way to see the case is in the borderline cases, and if/when there
are multiple different interrupts (first non-NMI interrupt takes it out of
the hlt, and then the NMI happens to catch the sti). And quite frankly, I
don't see how you would stress-test it. A 1kHz timer interrupt with a
10kHz NMI interrupt is still very infrequent interrupts...

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-14 23:09                       ` Linus Torvalds
@ 2004-12-15  8:52                         ` Ingo Molnar
  2004-12-15 15:44                           ` Linus Torvalds
  2004-12-16  0:37                         ` Alan Cox
  1 sibling, 1 reply; 21+ messages in thread
From: Ingo Molnar @ 2004-12-15  8:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo,
	George Anzinger, dipankar, ganzinger, lkml, Andrew Morton,
	Andi Kleen


* Linus Torvalds <torvalds@osdl.org> wrote:

> > find the correct patch below. I've tested it with an NMI watchdog
> > frequency artificially increased to 10 KHz, and i've instrumented the
> > new branch in the NMI handler, but even under heavy IRQ load i was not
> > able to trigger the branch. Maybe newer CPUs handle this case somehow
> > and make sti;hlt truly atomic?
> 
> Now that you mention it, I have this dim memory of the one-instruction
> "sti-shadow" actually disabling NMI's (and debug traps) too. The CPU
> literally doesn't test for async events following "sti". 

i ran the stresstest overnight with the 10 KHz NMI, and not a single
time did the new branch trigger, out of hundreds of millions of IRQs and
NMIs. I think this suggests that the race doesnt exist in current CPUs.

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-15  6:27                     ` Avi Kivity
@ 2004-12-15  8:51                       ` Ingo Molnar
  0 siblings, 0 replies; 21+ messages in thread
From: Ingo Molnar @ 2004-12-15  8:51 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo,
	George Anzinger, dipankar, ganzinger, lkml, Andrew Morton,
	Linus Torvalds, Andi Kleen


* Avi Kivity <avi@argo.co.il> wrote:

> Ingo Molnar wrote:
> 
> >+	if (__kernel_text_address(regs->eip) && *(char *)regs->eip == 0xf4)
> > 
> >
> shouldn't that cast be (unsigned char *), otherwise the test will
> always fail?

yes, i fixed this in the second patch. (the compiler warned about it
too)

	Ingo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-14 22:23                   ` [patch, 2.6.10-rc3] safe_hlt() & NMIs Ingo Molnar
  2004-12-14 22:47                     ` Ingo Molnar
  2004-12-14 23:00                     ` Linus Torvalds
@ 2004-12-15  6:27                     ` Avi Kivity
  2004-12-15  8:51                       ` Ingo Molnar
  2 siblings, 1 reply; 21+ messages in thread
From: Avi Kivity @ 2004-12-15  6:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo,
	George Anzinger, dipankar, ganzinger, lkml, Andrew Morton,
	Linus Torvalds, Andi Kleen

Ingo Molnar wrote:

>+	if (__kernel_text_address(regs->eip) && *(char *)regs->eip == 0xf4)
>  
>
shouldn't that cast be (unsigned char *), otherwise the test will always 
fail?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-14 23:00                     ` Linus Torvalds
@ 2004-12-15  5:04                       ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2004-12-15  5:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Lee Revell, Andrea Arcangeli, Manfred Spraul,
	Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml,
	Andrew Morton, Andi Kleen

> But something like
> 
> 	static inline int kernel_mode(struct pt_regs *regs)
> 	{
> 		return !((regs->eflags & VM_MASK) | (regs->xcs & 3));
> 	}
> 
> should DTRT.
> 
> Can you pls double-check my thinking, and test?

Reasoning looks correct to me. 

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-14 22:47                     ` Ingo Molnar
  2004-12-14 23:09                       ` Linus Torvalds
@ 2004-12-14 23:41                       ` Andrea Arcangeli
  1 sibling, 0 replies; 21+ messages in thread
From: Andrea Arcangeli @ 2004-12-14 23:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Manfred Spraul, Zwane Mwaikambo, George Anzinger,
	dipankar, ganzinger, lkml, Andrew Morton, Linus Torvalds,
	Andi Kleen

On Tue, Dec 14, 2004 at 11:47:06PM +0100, Ingo Molnar wrote:
> find the correct patch below. I've tested it with an NMI watchdog
> frequency artificially increased to 10 KHz, and i've instrumented the

Nice test, it'd be nice to trigger it in real life.

on the lines of the 64k movl ss, I wonder if we could create an huge
piece of memory like this:

new_htl:
cli
sti
htl
cli
sti
htl
[..]
jmp original_hlt


and to call new_htl from original_hlt instead of sti;hlt. A dozen megs
of the above should boost the probability of getting interrupted in
"hlt" quite a bit.

However even if the nmi can execute on top of the "hlt" instruction, it
doesn't necessairly mean the next pending irq will execute before
executing 'hlt' too, so it'd need a bit more of instrumentation to as
well track down the race as happening (it's not enough to see the branch
in the nmi handler to be taken). The additional instrumentation should
be quite easy though, just copying the same nmi code to the irq handler
should do the trick.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-14 22:47                     ` Ingo Molnar
@ 2004-12-14 23:09                       ` Linus Torvalds
  2004-12-15  8:52                         ` Ingo Molnar
  2004-12-16  0:37                         ` Alan Cox
  2004-12-14 23:41                       ` Andrea Arcangeli
  1 sibling, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2004-12-14 23:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo,
	George Anzinger, dipankar, ganzinger, lkml, Andrew Morton,
	Andi Kleen



On Tue, 14 Dec 2004, Ingo Molnar wrote:
> 
> find the correct patch below. I've tested it with an NMI watchdog
> frequency artificially increased to 10 KHz, and i've instrumented the
> new branch in the NMI handler, but even under heavy IRQ load i was not
> able to trigger the branch. Maybe newer CPUs handle this case somehow
> and make sti;hlt truly atomic?

Now that you mention it, I have this dim memory of the one-instruction
"sti-shadow" actually disabling NMI's (and debug traps) too. The CPU 
literally doesn't test for async events following "sti". 

Or maybe that was "mov->ss". That one also has that strange "black hole"  
for one instruction.

Hmm.. You could be evil and try to fill up 64kB worth of memory with a 
"mov %ax,%ss", and jump to it in vm86 mode and see what happens. The eip 
will just keep wrapping around...

		Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-14 22:23                   ` [patch, 2.6.10-rc3] safe_hlt() & NMIs Ingo Molnar
  2004-12-14 22:47                     ` Ingo Molnar
@ 2004-12-14 23:00                     ` Linus Torvalds
  2004-12-15  5:04                       ` Andi Kleen
  2004-12-15  6:27                     ` Avi Kivity
  2 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2004-12-14 23:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo,
	George Anzinger, dipankar, ganzinger, lkml, Andrew Morton,
	Andi Kleen



On Tue, 14 Dec 2004, Ingo Molnar wrote:
> 
> indeed, there could be a connection, and it's certainly a fun race. The
> proper fix is Manfred's suggestion: check whether the EIP is a kernel
> text address, and if yes, whether it's a HLT instruction - and if yes
> then increase EIP by 1.

You do it the wrong way, though. This is not safe:

	if (__kernel_text_address(regs->eip) && *(char *)regs->eip == 0xf4)

does _entirely_ the wrong thing if CS is not the kernel CS. 

It can trigger with a regular use CS if you were to run the 4G:4G patches, 
but more realistically, I think you can make ii trigger even with a 
standard kernel by creating a local code segment in your LDT, and then 
trying to confuse the kernel that way.

Now, as long as the _only_ thing it does is increment the eip, the worst 
that can happen is that it screws over the user program that must have 
worked at this a bit, but the basic point is that you shouldn't do this. 
In _theory_ you could confuse a real program that wasn't doing anything 
bad.

Checking for kernel CS also requires checking that it's not vm86 mode, 
btw. So that's not just a "regs->xcs & 0xffff == __KERNEL_CS" either.

But something like

	static inline int kernel_mode(struct pt_regs *regs)
	{
		return !((regs->eflags & VM_MASK) | (regs->xcs & 3));
	}

should DTRT.

Can you pls double-check my thinking, and test?

			Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-14 22:23                   ` [patch, 2.6.10-rc3] safe_hlt() & NMIs Ingo Molnar
@ 2004-12-14 22:47                     ` Ingo Molnar
  2004-12-14 23:09                       ` Linus Torvalds
  2004-12-14 23:41                       ` Andrea Arcangeli
  2004-12-14 23:00                     ` Linus Torvalds
  2004-12-15  6:27                     ` Avi Kivity
  2 siblings, 2 replies; 21+ messages in thread
From: Ingo Molnar @ 2004-12-14 22:47 UTC (permalink / raw)
  To: Lee Revell
  Cc: Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo,
	George Anzinger, dipankar, ganzinger, lkml, Andrew Morton,
	Linus Torvalds, Andi Kleen


* Ingo Molnar <mingo@elte.hu> wrote:

> indeed, there could be a connection, and it's certainly a fun race.
> The proper fix is Manfred's suggestion: check whether the EIP is a
> kernel text address, and if yes, whether it's a HLT instruction - and
> if yes then increase EIP by 1. I've included the fix in the -33-02 -RT
> patch. Andrew, Linus: upstream fix is below - i think it's post-2.6.10
> stuff. Tested it on SMP and UP x86, using both the IO-APIC and the
> local-APIC based NMI watchdog.
> 
> i think x64 needs a similar fix as well.

find the correct patch below. I've tested it with an NMI watchdog
frequency artificially increased to 10 KHz, and i've instrumented the
new branch in the NMI handler, but even under heavy IRQ load i was not
able to trigger the branch. Maybe newer CPUs handle this case somehow
and make sti;hlt truly atomic? I tried this on an old Celeron
(Mendocino) and on an Athlon64.

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

--- linux/arch/i386/kernel/traps.c.orig
+++ linux/arch/i386/kernel/traps.c
@@ -670,6 +670,18 @@ fastcall void do_nmi(struct pt_regs * re
 
 	cpu = smp_processor_id();
 
+	/*
+	 * Fix up obscure CPU behavior: if we interrupt safe_hlt() via
+	 * the NMI then we might miss a reschedule if an interrupt is
+	 * posted to the CPU and executes before the HLT instruction.
+	 *
+	 * We check whether the EIP is kernelspace, and if yes, whether
+	 * the instruction is HLT:
+	 */
+	if (__kernel_text_address(regs->eip) &&
+					*(unsigned char *)regs->eip == 0xf4)
+		regs->eip++;
+
 #ifdef CONFIG_HOTPLUG_CPU
 	if (!cpu_online(cpu)) {
 		nmi_exit();

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch, 2.6.10-rc3] safe_hlt() & NMIs
  2004-12-14 21:40                 ` Lee Revell
@ 2004-12-14 22:23                   ` Ingo Molnar
  2004-12-14 22:47                     ` Ingo Molnar
                                       ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Ingo Molnar @ 2004-12-14 22:23 UTC (permalink / raw)
  To: Lee Revell
  Cc: Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo,
	George Anzinger, dipankar, ganzinger, lkml, Andrew Morton,
	Linus Torvalds, Andi Kleen


* Lee Revell <rlrevell@joe-job.com> wrote:

> On Sun, 2004-12-12 at 13:15 +0100, Andrea Arcangeli wrote:
> > Overall this is a very minor issue (unless HZ is 0), it would only
> > introduce a 1/HZ latency to the irq that get posted while the nmi
> > handler is running, and the nmi handlers never runs in production.
> 
> Ingo, couldn't this account for some of the inexplicable outliers some
> people were seeing in latency tests?

indeed, there could be a connection, and it's certainly a fun race. The
proper fix is Manfred's suggestion: check whether the EIP is a kernel
text address, and if yes, whether it's a HLT instruction - and if yes
then increase EIP by 1. I've included the fix in the -33-02 -RT patch.
Andrew, Linus: upstream fix is below - i think it's post-2.6.10 stuff.
Tested it on SMP and UP x86, using both the IO-APIC and the local-APIC
based NMI watchdog.

i think x64 needs a similar fix as well.

	Ingo

--- linux/arch/i386/kernel/traps.c.orig
+++ linux/arch/i386/kernel/traps.c
@@ -670,6 +670,17 @@ fastcall void do_nmi(struct pt_regs * re
 
 	cpu = smp_processor_id();
 
+	/*
+	 * Fix up obscure CPU behavior: if we interrupt safe_hlt() via
+	 * the NMI then we might miss a reschedule if an interrupt is
+	 * posted to the CPU and executes before the HLT instruction.
+	 *
+	 * We check whether the EIP is kernelspace, and if yes, whether
+	 * the instruction is HLT:
+	 */
+	if (__kernel_text_address(regs->eip) && *(char *)regs->eip == 0xf4)
+		regs->eip++;
+
 #ifdef CONFIG_HOTPLUG_CPU
 	if (!cpu_online(cpu)) {
 		nmi_exit();

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2004-12-17 23:38 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-12-17 23:35 [patch, 2.6.10-rc3] safe_hlt() & NMIs Chuck Ebbert
  -- strict thread matches above, loose matches on Subject: below --
2004-12-11  3:29 RCU question George Anzinger
2004-12-11 14:52 ` Zwane Mwaikambo
2004-12-11 16:32   ` Manfred Spraul
2004-12-11 16:52     ` George Anzinger
2004-12-12  2:53       ` Zwane Mwaikambo
2004-12-12  8:59         ` Manfred Spraul
2004-12-12  9:37           ` Andrea Arcangeli
2004-12-12 10:22             ` Manfred Spraul
2004-12-12 12:15               ` Andrea Arcangeli
2004-12-14 21:40                 ` Lee Revell
2004-12-14 22:23                   ` [patch, 2.6.10-rc3] safe_hlt() & NMIs Ingo Molnar
2004-12-14 22:47                     ` Ingo Molnar
2004-12-14 23:09                       ` Linus Torvalds
2004-12-15  8:52                         ` Ingo Molnar
2004-12-15 15:44                           ` Linus Torvalds
2004-12-15 16:35                             ` Ingo Molnar
2004-12-16  0:37                         ` Alan Cox
2004-12-16  1:58                           ` Linus Torvalds
2004-12-16 14:51                             ` Ingo Molnar
2004-12-16 15:08                               ` Maciej W. Rozycki
2004-12-16 15:11                                 ` Ingo Molnar
2004-12-16 15:42                                   ` Maciej W. Rozycki
2004-12-16 15:54                               ` Linus Torvalds
2004-12-16  2:10                           ` Zwane Mwaikambo
2004-12-16 13:26                             ` Alan Cox
2004-12-14 23:41                       ` Andrea Arcangeli
2004-12-14 23:00                     ` Linus Torvalds
2004-12-15  5:04                       ` Andi Kleen
2004-12-15  6:27                     ` Avi Kivity
2004-12-15  8:51                       ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).