All of lore.kernel.org
 help / color / mirror / Atom feed
* Request for help: what did I do wrong with idtentry?
@ 2014-11-14 22:25 Andy Lutomirski
  2014-11-15  0:52 ` Luck, Tony
  0 siblings, 1 reply; 5+ messages in thread
From: Andy Lutomirski @ 2014-11-14 22:25 UTC (permalink / raw)
  To: H. Peter Anvin, Steven Rostedt, Andi Kleen, Ingo Molnar
  Cc: Tony Luck, Borislav Petkov, linux-kernel

This patch:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/paranoid&id=5b58c7ac3034a8d62e42a6ca91c7a95e887542e7

which is functionally identical to:

http://lkml.kernel.org/g/76efedcd3622e61feb2982eabe52a6bf531396a9.1415917623.git.luto@amacapital.net

causes Tony's MCE stress test to fail, presumably when some CPU either
becomes permanently non-interruptable or otherwise wanders off into
the weeds.

Could any of you take a quick look and see if anything stands out?
I'm a bit stuck, since I don't have hardware that can run the stress
test, and I can't reproduce any problems with this patch at all.

Thanks,
Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Request for help: what did I do wrong with idtentry?
  2014-11-14 22:25 Request for help: what did I do wrong with idtentry? Andy Lutomirski
@ 2014-11-15  0:52 ` Luck, Tony
  2014-11-15  1:21   ` Andy Lutomirski
  0 siblings, 1 reply; 5+ messages in thread
From: Luck, Tony @ 2014-11-15  0:52 UTC (permalink / raw)
  To: Andy Lutomirski, H. Peter Anvin, Steven Rostedt, Andi Kleen, Ingo Molnar
  Cc: Borislav Petkov, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1203 bytes --]

> causes Tony's MCE stress test to fail, presumably when some CPU either
> becomes permanently non-interruptable or otherwise wanders off into
> the weeds.

It might be that recent "improvements" I made to my test harness have
messed things up.  I trimmed one delay (between injection and consumption),
but it turns out the other delay in the code never get executed (because we
take a SIGBUS on consumption and then longjmp).  So my test that used
to pause a bit between iterations were running almost back to back
consumption and injection of next error.

This meant the serial console was a huge bottleneck (especially as my
development BIOS is also kicking its own debug junk onto the same port).
Some of the errors pointed obliquely at console.

I've slowed things back down to where they used to be, and things are
ticking along nicely (with 0.6 second delay between iterations).  Just
passed the 2800 mark and still going.  I'm leaving it running over the
weekend - if it makes it into the 50k level I'm willing to call it good.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Request for help: what did I do wrong with idtentry?
  2014-11-15  0:52 ` Luck, Tony
@ 2014-11-15  1:21   ` Andy Lutomirski
  2014-11-15 18:28     ` Andi Kleen
  0 siblings, 1 reply; 5+ messages in thread
From: Andy Lutomirski @ 2014-11-15  1:21 UTC (permalink / raw)
  To: Luck, Tony
  Cc: H. Peter Anvin, Steven Rostedt, Andi Kleen, Ingo Molnar,
	Borislav Petkov, linux-kernel

On Fri, Nov 14, 2014 at 4:52 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> causes Tony's MCE stress test to fail, presumably when some CPU either
>> becomes permanently non-interruptable or otherwise wanders off into
>> the weeds.
>
> It might be that recent "improvements" I made to my test harness have
> messed things up.  I trimmed one delay (between injection and consumption),
> but it turns out the other delay in the code never get executed (because we
> take a SIGBUS on consumption and then longjmp).  So my test that used
> to pause a bit between iterations were running almost back to back
> consumption and injection of next error.

Hmm.

Am I right that the timeout code in mce.c is overly aggressive, too?

>
> This meant the serial console was a huge bottleneck (especially as my
> development BIOS is also kicking its own debug junk onto the same port).
> Some of the errors pointed obliquely at console.
>
> I've slowed things back down to where they used to be, and things are
> ticking along nicely (with 0.6 second delay between iterations).  Just
> passed the 2800 mark and still going.  I'm leaving it running over the
> weekend - if it makes it into the 50k level I'm willing to call it good.
>

Phew :)

FWIW, I've confirmed that my code survives int3 from userspace, int3
from normal kernel code, and int3 from kernel with user gs.  I'm not
completely thrilled with what it does to double_fault, though.  If we
somehow get a double fault caused by an interrupt hitting userspace
with a bad kernel_stack, then we'll end up page faulting in the
double_fault prologue.  I'm not convinced that this is worth worrying
about.  It would be easy enough to fix, though, even if it would
further uglify the code.

--Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Request for help: what did I do wrong with idtentry?
  2014-11-15  1:21   ` Andy Lutomirski
@ 2014-11-15 18:28     ` Andi Kleen
  2014-11-15 18:45       ` Andy Lutomirski
  0 siblings, 1 reply; 5+ messages in thread
From: Andi Kleen @ 2014-11-15 18:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Luck, Tony, H. Peter Anvin, Steven Rostedt, Andi Kleen,
	Ingo Molnar, Borislav Petkov, linux-kernel

> I'm not
> completely thrilled with what it does to double_fault, though.  If we
> somehow get a double fault caused by an interrupt hitting userspace
> with a bad kernel_stack, then we'll end up page faulting in the
> double_fault prologue.  I'm not convinced that this is worth worrying
> about.  It would be easy enough to fix, though, even if it would
> further uglify the code.

If you're "cleaning up" good and working code the functionality should
be the same as before. The old code handled this situation fine. 
So your new code should handle this too.

In general yes handling all the corner cases makes code ugly.
That is how the existing code got how it became.

-Andi


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Request for help: what did I do wrong with idtentry?
  2014-11-15 18:28     ` Andi Kleen
@ 2014-11-15 18:45       ` Andy Lutomirski
  0 siblings, 0 replies; 5+ messages in thread
From: Andy Lutomirski @ 2014-11-15 18:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Luck, Tony, H. Peter Anvin, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, linux-kernel

On Sat, Nov 15, 2014 at 10:28 AM, Andi Kleen <andi@firstfloor.org> wrote:
>> I'm not
>> completely thrilled with what it does to double_fault, though.  If we
>> somehow get a double fault caused by an interrupt hitting userspace
>> with a bad kernel_stack, then we'll end up page faulting in the
>> double_fault prologue.  I'm not convinced that this is worth worrying
>> about.  It would be easy enough to fix, though, even if it would
>> further uglify the code.
>
> If you're "cleaning up" good and working code the functionality should
> be the same as before. The old code handled this situation fine.
> So your new code should handle this too.

First, this failure mode should be almost impossible.  We'd really
have to screw up to have the kernel stack point to a bad address.
(This isn't the stack *pointer* being bad -- it's the value in the
TSS.)

If this happens, the existing code will die (no recovery possible
unlike with normal OOPSes).  The new code will log a kernel-mode page
fault on the DF stack (as shown on the stack trace, assuming that
logic works), complain some more in do_exit, and make some sort of
effort to recover, which might even work.

In other words, I'd be happy to "fix" it, but I'm not entirely
convinced that this change should count as a regression in the first
place.

If we go for the fix-it approach, we could add a fixup in sync_regs
and probe the kernel_stack or we could add a paranoid=2 mode for
double_fault.

>
> In general yes handling all the corner cases makes code ugly.
> That is how the existing code got how it became.

Most of those corner cases are at least in code paths that are
supposed to work.  This particular corner case is in a handler that's
just trying to print something useful rather than silently rebooting,
and it should still work well enough to print something useful.

--Andy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-11-15 18:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-14 22:25 Request for help: what did I do wrong with idtentry? Andy Lutomirski
2014-11-15  0:52 ` Luck, Tony
2014-11-15  1:21   ` Andy Lutomirski
2014-11-15 18:28     ` Andi Kleen
2014-11-15 18:45       ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.