STI architectural question (and lretq -- I'm not even kidding)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* STI architectural question (and lretq -- I'm not even kidding)
@ 2014-07-23  0:10 Andy Lutomirski
  2014-07-23  1:03 ` Linus Torvalds
  2014-07-23 21:18 ` Andi Kleen
  0 siblings, 2 replies; 11+ messages in thread
From: Andy Lutomirski @ 2014-07-23  0:10 UTC (permalink / raw)
  To: H. Peter Anvin, Borislav Petkov, linux-kernel, Linus Torvalds, X86 ML

It turns out that lretq-to-outer-privilege-level is about 100 cycles
faster than iretq on Sandy Bridge.  This may be enough to be worth
using for returns to userspace, despite the added complexity and
scariness.

Here's where it gets nasty.  Before using lretq, we have to have
interrupts on, and we have to have gs == usergs.  If an asynchronous
non-paranoid interrupt happens then, we're screwed, and I don't really
want to teach the IRQ code to handle this special case.

There's an easy "solution": do sti;lretq.  This even works in my
limited testing (whereas sti;nop;lretq blows up very quickly).

But here's the problem: what happens if an NMI or MCE happens between
the sti and the lretq?  I think an MCE just might be okay -- it's not
really recoverable anyway.  (Except for the absurd MCE broadcast crap,
which may cause this to be a problem.)  But what about an NMI between
sti and lretq?

The NMI itself won't cause any problem.  But the NMI will return to
the lretq with interrupts *on*, and we lose.

The Intel SDM helpfully says "The IF flag and the STI and CLI
instructions do not prohibit the generation of exceptions and NMI
interrupts. NMI interrupts (and SMIs) may be blocked for one
macroinstruction following an STI."  Does that mean that this isn't a
problem?  What about on AMD?

An alternative would be to do a manual fixup in the NMI and MCE code.  Yuck.

The implementation is here, in case you want to play with it:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/tag/?id=lretq-to-userspace

--Andy

P.S. I'm sure there will be any number of CPU errata here, especially
since lretq-from-long-mode-to-outer-privilege-level is involved, which
might be completely unused in any major OS.

P.P.S. At least on Sandy Bridge, lretq has the same 16-bit SS problem as iret.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: STI architectural question (and lretq -- I'm not even kidding)
  2014-07-23  0:10 STI architectural question (and lretq -- I'm not even kidding) Andy Lutomirski
@ 2014-07-23  1:03 ` Linus Torvalds
  2014-07-23  1:33   ` Andy Lutomirski
  2014-07-23  9:40   ` Borislav Petkov
  2014-07-23 21:18 ` Andi Kleen
  1 sibling, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2014-07-23  1:03 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: H. Peter Anvin, Borislav Petkov, linux-kernel, X86 ML

On Tue, Jul 22, 2014 at 5:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> But here's the problem: what happens if an NMI or MCE happens between
> the sti and the lretq?  I think an MCE just might be okay -- it's not
> really recoverable anyway.  (Except for the absurd MCE broadcast crap,
> which may cause this to be a problem.)  But what about an NMI between
> sti and lretq?

Sadly, it's not architected.

The "mov ss" and "pop ss" do indeed suppress even NMI. And that *has*
to be true, because in legacy real mode - where there is no protection
domain change, and the "lss" instruction didn't originally exist - the
"pop/mov ss" and "mov sp" instruction sequence had to be entirely
atomic. And this is even very officially documented. From the intel
system manual:

    "A POP SS instruction inhibits all interrupts, including the NMI
interrupt, until after execution of the next instruction. This action
allows sequential execution of POP SS and MOV ESP, EBP instructions
without the danger of having an invalid stack during an interrupt.
However, use of the LSS instruction is the preferred method of loading
the SS and ESP registers"

However, while "sti" has conceptually the same one-instruction
interrupt window disable as mov/pop ss, it looks like Intel broke it
for NMI. The documentation only talks about "external, maskable
interrupts", and while I suspect *many* micro-architectures also end
up disabling NMI for the next instruction, there are many reasons to
think not all do.

See for example

    http://www.sandpile.org/x86/inter.htm

and note #5 under external interrupt suppression.

Now, sandpile is pretty old, but Christian Ludloff used to get things
like that right.

So I'm afraid that "sti; lret" is not guaranteed to be architecturally
NMI-safe. But it *might* be safe on certain micro-architectures, and
maybe somebody inside Intel or AMD can give us a hint about when it is
safe and when it isn't.

                Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: STI architectural question (and lretq -- I'm not even kidding)
  2014-07-23  1:03 ` Linus Torvalds
@ 2014-07-23  1:33   ` Andy Lutomirski
  2014-07-23 10:49     ` Borislav Petkov
  2014-07-23  9:40   ` Borislav Petkov
  1 sibling, 1 reply; 11+ messages in thread
From: Andy Lutomirski @ 2014-07-23  1:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: H. Peter Anvin, Borislav Petkov, linux-kernel, X86 ML

On Tue, Jul 22, 2014 at 6:03 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jul 22, 2014 at 5:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> But here's the problem: what happens if an NMI or MCE happens between
>> the sti and the lretq?  I think an MCE just might be okay -- it's not
>> really recoverable anyway.  (Except for the absurd MCE broadcast crap,
>> which may cause this to be a problem.)  But what about an NMI between
>> sti and lretq?
>
> Sadly, it's not architected.
>
> The "mov ss" and "pop ss" do indeed suppress even NMI. And that *has*
> to be true, because in legacy real mode - where there is no protection
> domain change, and the "lss" instruction didn't originally exist - the
> "pop/mov ss" and "mov sp" instruction sequence had to be entirely
> atomic. And this is even very officially documented. From the intel
> system manual:
>
>     "A POP SS instruction inhibits all interrupts, including the NMI
> interrupt, until after execution of the next instruction. This action
> allows sequential execution of POP SS and MOV ESP, EBP instructions
> without the danger of having an invalid stack during an interrupt.
> However, use of the LSS instruction is the preferred method of loading
> the SS and ESP registers"
>
> However, while "sti" has conceptually the same one-instruction
> interrupt window disable as mov/pop ss, it looks like Intel broke it
> for NMI. The documentation only talks about "external, maskable
> interrupts", and while I suspect *many* micro-architectures also end
> up disabling NMI for the next instruction, there are many reasons to
> think not all do.
>
> See for example
>
>     http://www.sandpile.org/x86/inter.htm
>
> and note #5 under external interrupt suppression.
>
> Now, sandpile is pretty old, but Christian Ludloff used to get things
> like that right.
>
> So I'm afraid that "sti; lret" is not guaranteed to be architecturally
> NMI-safe. But it *might* be safe on certain micro-architectures, and
> maybe somebody inside Intel or AMD can give us a hint about when it is
> safe and when it isn't.

:)  I'm apparently not the only one who finds playing with evil
architectural stuff to be unreasonably fun.

FWIW, both the VMX and SVM code in kvm seem to explicitly implement
NMI suppression in the STI window.  I can't figure out how #MC
broadcast delivery works.  Grr.

In any event, at least the fixup would be straightforward: just do
something like:

void fixup_lret_nmi(struct pt_regs *regs) {
  if (regs->rip == native_lret_to_userspace && !user_mode_vm(regs)) {
    regs->rip = native_sti_before_lret_to_userspace;
    regs->flags &= ~X86_EFLAGS_IF;
  }
}

and call it from the NMI and MCE code.  This is probably preferable to
relying on special friendly microarchitectures.

Of course, this does nothing at all to protect us from #MC after sti
on return from #MC to userspace, but I think we're screwed regardless
-- we could just as easily get a second #MC before the sti.  Machine
check broadcast was the worst idea ever.

Anyway, I updated the tag.

--Andy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: STI architectural question (and lretq -- I'm not even kidding)
  2014-07-23  1:33   ` Andy Lutomirski
@ 2014-07-23 10:49     ` Borislav Petkov
  2014-07-23 15:12       ` Andy Lutomirski
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2014-07-23 10:49 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Linus Torvalds, H. Peter Anvin, linux-kernel, X86 ML

On Tue, Jul 22, 2014 at 06:33:02PM -0700, Andy Lutomirski wrote:
> Of course, this does nothing at all to protect us from #MC after sti
> on return from #MC to userspace, but I think we're screwed regardless
> -- we could just as easily get a second #MC before the sti. Machine
> check broadcast was the worst idea ever.

Please do not think that a raised #MC means the machine is gone. There
are MC errors which are reported with the exception mechanism and from
which we can and do recover, regardless of broadcasting or not.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: STI architectural question (and lretq -- I'm not even kidding)
  2014-07-23 10:49     ` Borislav Petkov
@ 2014-07-23 15:12       ` Andy Lutomirski
  2014-07-23 15:23         ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Andy Lutomirski @ 2014-07-23 15:12 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: H. Peter Anvin, Linus Torvalds, linux-kernel, X86 ML

On Jul 23, 2014 3:49 AM, "Borislav Petkov" <bp@alien8.de> wrote:
>
> On Tue, Jul 22, 2014 at 06:33:02PM -0700, Andy Lutomirski wrote:
> > Of course, this does nothing at all to protect us from #MC after sti
> > on return from #MC to userspace, but I think we're screwed regardless
> > -- we could just as easily get a second #MC before the sti. Machine
> > check broadcast was the worst idea ever.
>
> Please do not think that a raised #MC means the machine is gone. There
> are MC errors which are reported with the exception mechanism and from
> which we can and do recover, regardless of broadcasting or not.
>

How are we supposed to survive two machine checks in rapid succession?
 The second will fire as soon as the first one is acked, I imagine.
Unless we switch stacks before acking the MCE, the return address of
the first one will be lost.

In any event, I'll do a manual fixup for this in my patch.


--Andy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: STI architectural question (and lretq -- I'm not even kidding)
  2014-07-23 15:12       ` Andy Lutomirski
@ 2014-07-23 15:23         ` Borislav Petkov
  0 siblings, 0 replies; 11+ messages in thread
From: Borislav Petkov @ 2014-07-23 15:23 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: H. Peter Anvin, Linus Torvalds, linux-kernel, X86 ML

On Wed, Jul 23, 2014 at 08:12:32AM -0700, Andy Lutomirski wrote:
> How are we supposed to survive two machine checks in rapid succession?
> The second will fire as soon as the first one is acked, I imagine.
> Unless we switch stacks before acking the MCE, the return address of
> the first one will be lost.

Oh, that might not fly but in that case the box probably deserves to die
anyway.

I was adressing what you said earlier: "But here's the problem: what
happens if an NMI or MCE happens between the sti and the lretq? I think
an MCE just might be okay -- it's not really recoverable anyway."

An MC Exception can be recoverable and we can recover. The fact that we
raise an exception doesn't always mean we die.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: STI architectural question (and lretq -- I'm not even kidding)
  2014-07-23  1:03 ` Linus Torvalds
  2014-07-23  1:33   ` Andy Lutomirski
@ 2014-07-23  9:40   ` Borislav Petkov
  1 sibling, 0 replies; 11+ messages in thread
From: Borislav Petkov @ 2014-07-23  9:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andy Lutomirski, H. Peter Anvin, linux-kernel, X86 ML

On Tue, Jul 22, 2014 at 06:03:35PM -0700, Linus Torvalds wrote:
> So I'm afraid that "sti; lret" is not guaranteed to be architecturally
> NMI-safe. But it *might* be safe on certain micro-architectures, and
> maybe somebody inside Intel or AMD can give us a hint about when it is
> safe and when it isn't.

>From AMD's APM, STI section:

"Sets the interrupt flag (IF) in the rFLAGS register to 1, thereby
allowing external interrupts received on the INTR input. Interrupts
received on the non-maskable interrupt (NMI) input are not affected by
this instruction."

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: STI architectural question (and lretq -- I'm not even kidding)
  2014-07-23  0:10 STI architectural question (and lretq -- I'm not even kidding) Andy Lutomirski
  2014-07-23  1:03 ` Linus Torvalds
@ 2014-07-23 21:18 ` Andi Kleen
  2014-07-23 21:52   ` Andy Lutomirski
  1 sibling, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2014-07-23 21:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Borislav Petkov, linux-kernel, Linus Torvalds, X86 ML

Andy Lutomirski <luto@amacapital.net> writes:

> I think an MCE just might be okay -- it's not
> really recoverable anyway. 

That's wrong.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: STI architectural question (and lretq -- I'm not even kidding)
  2014-07-23 21:18 ` Andi Kleen
@ 2014-07-23 21:52   ` Andy Lutomirski
  2014-07-23 23:10     ` Andi Kleen
  0 siblings, 1 reply; 11+ messages in thread
From: Andy Lutomirski @ 2014-07-23 21:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Borislav Petkov, linux-kernel, Linus Torvalds, X86 ML

On Wed, Jul 23, 2014 at 2:18 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> I think an MCE just might be okay -- it's not
>> really recoverable anyway.
>
> That's wrong.

I think that, other than broadcast MCEs, #MC that hits in kernel mode
is non-recoverable, or at least can't safely be recovered.  (There's a
separate APIC interrupt for recoverable errors, I think, but that's a
much saner interface.)

Regardless, I put in a fixup in the patches I sent out -- they should
be just as safe as existing code if a #MC hits right after sti.  I
have no idea how to test that, though...

--Andy

>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: STI architectural question (and lretq -- I'm not even kidding)
  2014-07-23 21:52   ` Andy Lutomirski
@ 2014-07-23 23:10     ` Andi Kleen
  2014-07-24 22:15       ` H. Peter Anvin
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2014-07-23 23:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, H. Peter Anvin, Borislav Petkov, linux-kernel,
	Linus Torvalds, X86 ML

> I think that, other than broadcast MCEs, #MC that hits in kernel mode
> is non-recoverable, or at least can't safely be recovered.  (There's a
> separate APIC interrupt for recoverable errors, I think, but that's a
> much saner interface.)

There are multiple valid cases where #MC can return.

-Andi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: STI architectural question (and lretq -- I'm not even kidding)
  2014-07-23 23:10     ` Andi Kleen
@ 2014-07-24 22:15       ` H. Peter Anvin
  0 siblings, 0 replies; 11+ messages in thread
From: H. Peter Anvin @ 2014-07-24 22:15 UTC (permalink / raw)
  To: Andi Kleen, Andy Lutomirski
  Cc: Borislav Petkov, linux-kernel, Linus Torvalds, X86 ML

On 07/23/2014 04:10 PM, Andi Kleen wrote:
>> I think that, other than broadcast MCEs, #MC that hits in kernel mode
>> is non-recoverable, or at least can't safely be recovered.  (There's a
>> separate APIC interrupt for recoverable errors, I think, but that's a
>> much saner interface.)
> 
> There are multiple valid cases where #MC can return.
> 

Indeed, not just broadcast #MC.

	-hpa



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-07-24 22:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-23  0:10 STI architectural question (and lretq -- I'm not even kidding) Andy Lutomirski
2014-07-23  1:03 ` Linus Torvalds
2014-07-23  1:33   ` Andy Lutomirski
2014-07-23 10:49     ` Borislav Petkov
2014-07-23 15:12       ` Andy Lutomirski
2014-07-23 15:23         ` Borislav Petkov
2014-07-23  9:40   ` Borislav Petkov
2014-07-23 21:18 ` Andi Kleen
2014-07-23 21:52   ` Andy Lutomirski
2014-07-23 23:10     ` Andi Kleen
2014-07-24 22:15       ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).