linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
@ 2014-04-11 17:36 tip-bot for H. Peter Anvin
  2014-04-11 18:12 ` Andy Lutomirski
                   ` (2 more replies)
  0 siblings, 3 replies; 136+ messages in thread
From: tip-bot for H. Peter Anvin @ 2014-04-11 17:36 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, torvalds, stable, tglx, hpa

Commit-ID:  b3b42ac2cbae1f3cecbb6229964a4d48af31d382
Gitweb:     http://git.kernel.org/tip/b3b42ac2cbae1f3cecbb6229964a4d48af31d382
Author:     H. Peter Anvin <hpa@linux.intel.com>
AuthorDate: Sun, 16 Mar 2014 15:31:54 -0700
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 11 Apr 2014 10:10:09 -0700

x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels

The IRET instruction, when returning to a 16-bit segment, only
restores the bottom 16 bits of the user space stack pointer.  We have
a software workaround for that ("espfix") for the 32-bit kernel, but
it relies on a nonzero stack segment base which is not available in
32-bit mode.

Since 16-bit support is somewhat crippled anyway on a 64-bit kernel
(no V86 mode), and most (if not quite all) 64-bit processors support
virtualization for the users who really need it, simply reject
attempts at creating a 16-bit segment when running on top of a 64-bit
kernel.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-kicdm89kzw9lldryb1br9od0@git.kernel.org
Cc: <stable@vger.kernel.org>
---
 arch/x86/kernel/ldt.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index ebc9873..af1d14a 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -229,6 +229,17 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 		}
 	}
 
+	/*
+	 * On x86-64 we do not support 16-bit segments due to
+	 * IRET leaking the high bits of the kernel stack address.
+	 */
+#ifdef CONFIG_X86_64
+	if (!ldt_info.seg_32bit) {
+		error = -EINVAL;
+		goto out_unlock;
+	}
+#endif
+
 	fill_ldt(&ldt, &ldt_info);
 	if (oldmode)
 		ldt.avl = 0;

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt:  Ban 16-bit segments on 64-bit kernels
  2014-04-11 17:36 [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels tip-bot for H. Peter Anvin
@ 2014-04-11 18:12 ` Andy Lutomirski
  2014-04-11 18:20   ` H. Peter Anvin
  2014-04-11 18:27 ` Brian Gerst
  2014-04-21 22:47 ` [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE* H. Peter Anvin
  2 siblings, 1 reply; 136+ messages in thread
From: Andy Lutomirski @ 2014-04-11 18:12 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, torvalds, tglx, stable, hpa, linux-tip-commits

On 04/11/2014 10:36 AM, tip-bot for H. Peter Anvin wrote:
> Commit-ID:  b3b42ac2cbae1f3cecbb6229964a4d48af31d382
> Gitweb:     http://git.kernel.org/tip/b3b42ac2cbae1f3cecbb6229964a4d48af31d382
> Author:     H. Peter Anvin <hpa@linux.intel.com>
> AuthorDate: Sun, 16 Mar 2014 15:31:54 -0700
> Committer:  H. Peter Anvin <hpa@linux.intel.com>
> CommitDate: Fri, 11 Apr 2014 10:10:09 -0700
> 
> x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
> 
> The IRET instruction, when returning to a 16-bit segment, only
> restores the bottom 16 bits of the user space stack pointer.  We have
> a software workaround for that ("espfix") for the 32-bit kernel, but
> it relies on a nonzero stack segment base which is not available in
> 32-bit mode.
> 
> Since 16-bit support is somewhat crippled anyway on a 64-bit kernel
> (no V86 mode), and most (if not quite all) 64-bit processors support
> virtualization for the users who really need it, simply reject
> attempts at creating a 16-bit segment when running on top of a 64-bit
> kernel.
> 
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
> Link: http://lkml.kernel.org/n/tip-kicdm89kzw9lldryb1br9od0@git.kernel.org
> Cc: <stable@vger.kernel.org>

If this is what I think it is (hi, Spender), then it is probably only
useful for 3.14.y and not earlier kernels.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt:  Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:12 ` Andy Lutomirski
@ 2014-04-11 18:20   ` H. Peter Anvin
  0 siblings, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-11 18:20 UTC (permalink / raw)
  To: Andy Lutomirski, mingo, linux-kernel, torvalds, tglx, stable,
	hpa, linux-tip-commits

On 04/11/2014 11:12 AM, Andy Lutomirski wrote:
> 
> If this is what I think it is (hi, Spender), then it is probably only
> useful for 3.14.y and not earlier kernels.
> 

Not really.  The kernel stack address is sensitive regardless of kASLR;
in fact, it is completely orthogonal to kASLR.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 17:36 [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels tip-bot for H. Peter Anvin
  2014-04-11 18:12 ` Andy Lutomirski
@ 2014-04-11 18:27 ` Brian Gerst
  2014-04-11 18:29   ` H. Peter Anvin
  2014-04-11 18:41   ` Linus Torvalds
  2014-04-21 22:47 ` [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE* H. Peter Anvin
  2 siblings, 2 replies; 136+ messages in thread
From: Brian Gerst @ 2014-04-11 18:27 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, stable, H. Peter Anvin

Is this bug really still present in modern CPUs?  This change breaks
running 16-bit apps in Wine.  I have a few really old games I like to
play on occasion, and I don't have a copy of Win 3.11 to put in a VM.

On Fri, Apr 11, 2014 at 1:36 PM, tip-bot for H. Peter Anvin
<tipbot@zytor.com> wrote:
> Commit-ID:  b3b42ac2cbae1f3cecbb6229964a4d48af31d382
> Gitweb:     http://git.kernel.org/tip/b3b42ac2cbae1f3cecbb6229964a4d48af31d382
> Author:     H. Peter Anvin <hpa@linux.intel.com>
> AuthorDate: Sun, 16 Mar 2014 15:31:54 -0700
> Committer:  H. Peter Anvin <hpa@linux.intel.com>
> CommitDate: Fri, 11 Apr 2014 10:10:09 -0700
>
> x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
>
> The IRET instruction, when returning to a 16-bit segment, only
> restores the bottom 16 bits of the user space stack pointer.  We have
> a software workaround for that ("espfix") for the 32-bit kernel, but
> it relies on a nonzero stack segment base which is not available in
> 32-bit mode.
>
> Since 16-bit support is somewhat crippled anyway on a 64-bit kernel
> (no V86 mode), and most (if not quite all) 64-bit processors support
> virtualization for the users who really need it, simply reject
> attempts at creating a 16-bit segment when running on top of a 64-bit
> kernel.
>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
> Link: http://lkml.kernel.org/n/tip-kicdm89kzw9lldryb1br9od0@git.kernel.org
> Cc: <stable@vger.kernel.org>
> ---
>  arch/x86/kernel/ldt.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index ebc9873..af1d14a 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -229,6 +229,17 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
>                 }
>         }
>
> +       /*
> +        * On x86-64 we do not support 16-bit segments due to
> +        * IRET leaking the high bits of the kernel stack address.
> +        */
> +#ifdef CONFIG_X86_64
> +       if (!ldt_info.seg_32bit) {
> +               error = -EINVAL;
> +               goto out_unlock;
> +       }
> +#endif
> +
>         fill_ldt(&ldt, &ldt_info);
>         if (oldmode)
>                 ldt.avl = 0;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:27 ` Brian Gerst
@ 2014-04-11 18:29   ` H. Peter Anvin
  2014-04-11 18:35     ` Brian Gerst
  2014-04-11 21:16     ` Andy Lutomirski
  2014-04-11 18:41   ` Linus Torvalds
  1 sibling, 2 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-11 18:29 UTC (permalink / raw)
  To: Brian Gerst, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, stable, H. Peter Anvin

On 04/11/2014 11:27 AM, Brian Gerst wrote:
> Is this bug really still present in modern CPUs?  This change breaks
> running 16-bit apps in Wine.  I have a few really old games I like to
> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.

It is not a bug, per se, but an architectural definition issue, and it
is present in all x86 processors from all vendors.

Yes, it does break running 16-bit apps in Wine, although Wine could be
modified to put 16-bit apps in a container.  However, this is at best a
marginal use case.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:29   ` H. Peter Anvin
@ 2014-04-11 18:35     ` Brian Gerst
  2014-04-11 21:16     ` Andy Lutomirski
  1 sibling, 0 replies; 136+ messages in thread
From: Brian Gerst @ 2014-04-11 18:35 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, stable, H. Peter Anvin

On Fri, Apr 11, 2014 at 2:29 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/11/2014 11:27 AM, Brian Gerst wrote:
>> Is this bug really still present in modern CPUs?  This change breaks
>> running 16-bit apps in Wine.  I have a few really old games I like to
>> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
>
> It is not a bug, per se, but an architectural definition issue, and it
> is present in all x86 processors from all vendors.
>
> Yes, it does break running 16-bit apps in Wine, although Wine could be
> modified to put 16-bit apps in a container.  However, this is at best a
> marginal use case.


Marginal or not, it is still userspace breakage.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:27 ` Brian Gerst
  2014-04-11 18:29   ` H. Peter Anvin
@ 2014-04-11 18:41   ` Linus Torvalds
  2014-04-11 18:45     ` Brian Gerst
                       ` (2 more replies)
  1 sibling, 3 replies; 136+ messages in thread
From: Linus Torvalds @ 2014-04-11 18:41 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Thomas Gleixner, stable, H. Peter Anvin

On Fri, Apr 11, 2014 at 11:27 AM, Brian Gerst <brgerst@gmail.com> wrote:
> Is this bug really still present in modern CPUs?  This change breaks
> running 16-bit apps in Wine.  I have a few really old games I like to
> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.

Ok, so you actually do this on x86-64, and it currently works? For
some reason I thought that 16-bit windows apps already didn't work.

Because if we have working users of this, then I don't think we can do
the "we don't support 16-bit segments", or at least we need to make it
runtime configurable.

             Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:41   ` Linus Torvalds
@ 2014-04-11 18:45     ` Brian Gerst
  2014-04-11 18:50       ` Linus Torvalds
  2014-04-11 18:46     ` [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels H. Peter Anvin
  2014-04-13  2:54     ` Andi Kleen
  2 siblings, 1 reply; 136+ messages in thread
From: Brian Gerst @ 2014-04-11 18:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Thomas Gleixner, stable, H. Peter Anvin

On Fri, Apr 11, 2014 at 2:41 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Apr 11, 2014 at 11:27 AM, Brian Gerst <brgerst@gmail.com> wrote:
>> Is this bug really still present in modern CPUs?  This change breaks
>> running 16-bit apps in Wine.  I have a few really old games I like to
>> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
>
> Ok, so you actually do this on x86-64, and it currently works? For
> some reason I thought that 16-bit windows apps already didn't work.
>
> Because if we have working users of this, then I don't think we can do
> the "we don't support 16-bit segments", or at least we need to make it
> runtime configurable.
>
>              Linus

I haven't tested it recently but I do know it has worked on 64-bit
kernels.  There is no reason for it not to, the only thing not
supported in long mode is vm86.  16-bit protected mode is unchanged.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:41   ` Linus Torvalds
  2014-04-11 18:45     ` Brian Gerst
@ 2014-04-11 18:46     ` H. Peter Anvin
  2014-04-14  7:27       ` Ingo Molnar
  2014-04-13  2:54     ` Andi Kleen
  2 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-11 18:46 UTC (permalink / raw)
  To: Linus Torvalds, Brian Gerst
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Thomas Gleixner, stable

On 04/11/2014 11:41 AM, Linus Torvalds wrote:
> 
> Ok, so you actually do this on x86-64, and it currently works? For
> some reason I thought that 16-bit windows apps already didn't work.
> 

Some will work, because not all 16-bit software care about the upper
half of ESP getting randomly corrupted.

That is the "functionality bit" of the problem.  The other bit, of
course, is that that random corruption is the address of the kernel stack.

> Because if we have working users of this, then I don't think we can do
> the "we don't support 16-bit segments", or at least we need to make it
> runtime configurable.

I'll let you pick what the policy should be here.  I personally think
that we have to be able to draw a line somewhere sometimes (Microsoft
themselves haven't supported running 16-bit binaries for several Windows
generations now), but it is your policy, not mine.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:45     ` Brian Gerst
@ 2014-04-11 18:50       ` Linus Torvalds
  2014-04-12  4:44         ` Brian Gerst
  2014-04-14  7:48         ` Alexandre Julliard
  0 siblings, 2 replies; 136+ messages in thread
From: Linus Torvalds @ 2014-04-11 18:50 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Thomas Gleixner, stable, H. Peter Anvin

On Fri, Apr 11, 2014 at 11:45 AM, Brian Gerst <brgerst@gmail.com> wrote:
>
> I haven't tested it recently but I do know it has worked on 64-bit
> kernels.  There is no reason for it not to, the only thing not
> supported in long mode is vm86.  16-bit protected mode is unchanged.

Afaik 64-bit windows doesn't support 16-bit binaries, so I just
assumed Wine wouldn't do it either on x86-64. Not for any real
technical reasons, though.

HOWEVER. I'd like to hear something more definitive than "I haven't
tested recently". The "we don't break user space" is about having
actual real *users*, not about test programs.

Are there people actually using 16-bit old windows programs under
wine? That's what matters.

                Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:29   ` H. Peter Anvin
  2014-04-11 18:35     ` Brian Gerst
@ 2014-04-11 21:16     ` Andy Lutomirski
  2014-04-11 21:24       ` H. Peter Anvin
  2014-04-11 21:34       ` Linus Torvalds
  1 sibling, 2 replies; 136+ messages in thread
From: Andy Lutomirski @ 2014-04-11 21:16 UTC (permalink / raw)
  To: H. Peter Anvin, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	stable, H. Peter Anvin

On 04/11/2014 11:29 AM, H. Peter Anvin wrote:
> On 04/11/2014 11:27 AM, Brian Gerst wrote:
>> Is this bug really still present in modern CPUs?  This change breaks
>> running 16-bit apps in Wine.  I have a few really old games I like to
>> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
> 
> It is not a bug, per se, but an architectural definition issue, and it
> is present in all x86 processors from all vendors.
> 
> Yes, it does break running 16-bit apps in Wine, although Wine could be
> modified to put 16-bit apps in a container.  However, this is at best a
> marginal use case.

I wonder if there's an easy-ish good-enough fix:

Allocate some percpu space in the fixmap.  (OK, this is ugly, but
kvmclock already does it, so it's possible.)  To return to 16-bit
userspace, make sure interrupts are off, copy the whole iret descriptor
to the current cpu's fixmap space, change rsp to point to that space,
and then do the iret.

This won't restore the correct value to the high bits of [er]sp, but it
will at least stop leaking anything interesting to userspace.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 21:16     ` Andy Lutomirski
@ 2014-04-11 21:24       ` H. Peter Anvin
  2014-04-11 21:53         ` Andy Lutomirski
  2014-04-12 23:26         ` Alexander van Heukelum
  2014-04-11 21:34       ` Linus Torvalds
  1 sibling, 2 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-11 21:24 UTC (permalink / raw)
  To: Andy Lutomirski, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	stable, H. Peter Anvin

On 04/11/2014 02:16 PM, Andy Lutomirski wrote:
> On 04/11/2014 11:29 AM, H. Peter Anvin wrote:
>> On 04/11/2014 11:27 AM, Brian Gerst wrote:
>>> Is this bug really still present in modern CPUs?  This change breaks
>>> running 16-bit apps in Wine.  I have a few really old games I like to
>>> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
>>
>> It is not a bug, per se, but an architectural definition issue, and it
>> is present in all x86 processors from all vendors.
>>
>> Yes, it does break running 16-bit apps in Wine, although Wine could be
>> modified to put 16-bit apps in a container.  However, this is at best a
>> marginal use case.
> 
> I wonder if there's an easy-ish good-enough fix:
> 
> Allocate some percpu space in the fixmap.  (OK, this is ugly, but
> kvmclock already does it, so it's possible.)  To return to 16-bit
> userspace, make sure interrupts are off, copy the whole iret descriptor
> to the current cpu's fixmap space, change rsp to point to that space,
> and then do the iret.
> 
> This won't restore the correct value to the high bits of [er]sp, but it
> will at least stop leaking anything interesting to userspace.
> 

This would fix the infoleak, at the cost of allocating a chunk of memory
for each CPU.  It doesn't fix the functionality problem.

If we're going to do a workaround I would prefer to do something that
fixes both, but it is highly nontrivial.

This is a writeup I did to a select audience before this was public:

> Hello,
> 
> It appears we have an information leak on x86-64 by which at least bits
> [31:16] of the kernel stack address leaks to user space (some silicon
> including the 64-bit Pentium 4 leaks [63:16]).  This is due to the the
> behavior of IRET when returning to a 16-bit segment: IRET restores only
> the bottom 16 bits of the stack pointer.
> 
> This is known on 32 bits and we, in fact, have a workaround for it
> ("espfix") there.  We do not, however, have the equivalent on 64 bits,
> nor does it seem that it is very easy to construct a workaround (see below.)
> 
> This is both a functionality problem (16-bit code gets the upper bits of
> %esp corrupted when the kernel is invoked) and an information leak.  The
> 32-bit workaround was labeled as a fix for the functionality problem,
> but it of course also addresses the leak.
> 
> On 64 bits, the easiest mitigation seems to be to make modify_ldt()
> refuse to install a 16-bit segment when running on a 64-bit kernel.
> 16-bit support is already somewhat crippled on 64 bits since there is no
> V86 support; obviously, for "full service" support we can always set up
> a virtual machine -- most (but sadly, not all) 64-bit parts are also
> virtualization capable.
> 
> I would have suggested rejecting modify_ldt() entirely, to reduce attack
> surface, except that some early versions of 32-bit NPTL glibc use
> modify_ldt() to exclusion of all other methods of establishing the
> thread pointer, so in order to stay compatible with those we would need
> to allow 32-bit segments via modify_ldt() still.
> 
> However, there is no doubt this will break some legitimate users of
> 16-bit segments, e.g. Wine for 16-bit Windows apps (which don't work on
> 64-bit Windows either, for what it is worth.)
> 
> We may very well have other infoleaks that dwarf this, but the kernel
> stack address is a relatively high value item for exploits.
> 
> Some workarounds I have considered:
> 
> a. Using paging in a similar way to the 32-bit segment base workaround
> 
> This one requires a very large swath of virtual user space (depending on
> allocation policy, as much as 4 GiB per CPU.)  The "per CPU" requirement
> comes in as locking is not feasible -- as we return to user space there
> is nowhere to release the lock.
> 
> b. Return to user space via compatibility mode
> 
> As the kernel lives above the 4 GiB virtual mark, a transition through
> compatibility mode is not practical.  This would require the kernel to
> reserve virtual address space below the 4 GiB mark, which may interfere
> with the application, especially an application launched as a 64-bit
> application.
> 
> c. Trampoline in kernel space
> 
> A trampoline in kernel space is not feasible since all ring transition
> instructions capable of returning to 16-bit mode require the use of the
> stack.
> 
> d. Trampoline in user space
> 
> A return to the vdso with values set up in registers r8-r15 would enable
> a trampoline in user space.  Unfortunately there is no way
> to do a far JMP entirely with register state so this would require
> touching user space memory, possibly in an unsafe manner.
> 
> The most likely variant is to use the address of the 16-bit user stack
> and simply hope that this is a safe thing to do.
> 
> This appears to be the most feasible workaround if a workaround is
> deemed necessary.
> 
> e. Transparently run 16-bit code segments inside a lightweight VMM
> 
> The complexity of this solution versus the realized value is staggering.
> It also doesn't work on non-virtualization-capable hardware (including
> running on top of a VMM which doesn't support nested virtualization.)
> 
> 	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 21:16     ` Andy Lutomirski
  2014-04-11 21:24       ` H. Peter Anvin
@ 2014-04-11 21:34       ` Linus Torvalds
  1 sibling, 0 replies; 136+ messages in thread
From: Linus Torvalds @ 2014-04-11 21:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On Fri, Apr 11, 2014 at 2:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I wonder if there's an easy-ish good-enough fix:

Heh. Yes. Check the thread on lkml about three weeks ago under the
subject "x86-64: Information leak: kernel stack address leaks to user
space". It had exactly that as a suggestion.

Anyway, I ended up pulling the current change - let's see if anybody even cares.

And if somebody *does* care, maybe we can just do a trivial sysctl. If
you are running 16-bit apps under wine, the default kernel setup
already stops you: the 'mmap_min_addr' being non-zero means that that
already will not run.

But yeah, I personally don't care about the high bits of rsp one whit,
since that has never worked on x86-64. But the information leak needs
to be plugged, and a percpu stack can fix that.

I'm a bit worried that a percpu stack can cause issues with NMI's,
which already have too much complexity in them, so I don't think it's
*entirely* trivial to do. And the exception that the 'iretq' can take
adds more complexity wrt kernel stack pointer games. Which is why I'm
not at all sure it's worth it.

              Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 21:24       ` H. Peter Anvin
@ 2014-04-11 21:53         ` Andy Lutomirski
  2014-04-11 21:59           ` H. Peter Anvin
  2014-04-13  4:20           ` H. Peter Anvin
  2014-04-12 23:26         ` Alexander van Heukelum
  1 sibling, 2 replies; 136+ messages in thread
From: Andy Lutomirski @ 2014-04-11 21:53 UTC (permalink / raw)
  To: H. Peter Anvin, Andy Lutomirski, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	stable, H. Peter Anvin

On 04/11/2014 02:24 PM, H. Peter Anvin wrote:
> On 04/11/2014 02:16 PM, Andy Lutomirski wrote:
>> I wonder if there's an easy-ish good-enough fix:
>>
>> Allocate some percpu space in the fixmap.  (OK, this is ugly, but
>> kvmclock already does it, so it's possible.)  To return to 16-bit
>> userspace, make sure interrupts are off, copy the whole iret descriptor
>> to the current cpu's fixmap space, change rsp to point to that space,
>> and then do the iret.
>>
>> This won't restore the correct value to the high bits of [er]sp, but it
>> will at least stop leaking anything interesting to userspace.
>>
> 
> This would fix the infoleak, at the cost of allocating a chunk of memory
> for each CPU.  It doesn't fix the functionality problem.
> 
> If we're going to do a workaround I would prefer to do something that
> fixes both, but it is highly nontrivial.
> 
> This is a writeup I did to a select audience before this was public:
> 
>> Hello,
>>
>> This is both a functionality problem (16-bit code gets the upper bits of
>> %esp corrupted when the kernel is invoked) and an information leak.  The
>> 32-bit workaround was labeled as a fix for the functionality problem,
>> but it of course also addresses the leak.

How big of a functionality problem is it?  Apparently it doesn't break
16-bit code on wine.

Since the high bits of esp have been corrupted on x86_64 since the
beginning, there's no regression issue here if an eventual fix writes
less meaningful crap to those bits -- I see no real reason to try to put
the correct values in there.


>> I would have suggested rejecting modify_ldt() entirely, to reduce attack
>> surface, except that some early versions of 32-bit NPTL glibc use
>> modify_ldt() to exclusion of all other methods of establishing the
>> thread pointer, so in order to stay compatible with those we would need
>> to allow 32-bit segments via modify_ldt() still.

I actually use modify_ldt for amusement: it's the only way I know of to
issue real 32-bit syscalls from 64-bit userspace.  Yes, this isn't
really a legitimate use case.

>>
>> a. Using paging in a similar way to the 32-bit segment base workaround
>>
>> This one requires a very large swath of virtual user space (depending on
>> allocation policy, as much as 4 GiB per CPU.)  The "per CPU" requirement
>> comes in as locking is not feasible -- as we return to user space there
>> is nowhere to release the lock.

Why not just 4k per CPU?  Write the pfn to the pte, invlpg, update rsp,
iret.  This leaks the CPU number, but that's all.

To me, this sounds like the easiest solution, so long as rsp is known to
be sufficiently far from a page boundary.

These ptes could even be read-only to limit the extra exposure to
known-address attacks.

If you want a fully correct solution, you can use a fancier allocation
policy that can fit quite a few cpus per 4G :)

>>
>> d. Trampoline in user space
>>
>> A return to the vdso with values set up in registers r8-r15 would enable
>> a trampoline in user space.  Unfortunately there is no way
>> to do a far JMP entirely with register state so this would require
>> touching user space memory, possibly in an unsafe manner.
>>
>> The most likely variant is to use the address of the 16-bit user stack
>> and simply hope that this is a safe thing to do.
>>
>> This appears to be the most feasible workaround if a workaround is
>> deemed necessary.

Eww.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 21:53         ` Andy Lutomirski
@ 2014-04-11 21:59           ` H. Peter Anvin
  2014-04-11 22:15             ` Andy Lutomirski
  2014-04-13  4:20           ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-11 21:59 UTC (permalink / raw)
  To: Andy Lutomirski, H. Peter Anvin, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	stable

On 04/11/2014 02:53 PM, Andy Lutomirski wrote:
> 
> How big of a functionality problem is it?  Apparently it doesn't break
> 16-bit code on wine.
> 

It breaks *some* 16-bit code.  This is actually the reason that 32 bits
has the espfix workaround - it wasn't identified as an infoleak at the time.

> Since the high bits of esp have been corrupted on x86_64 since the
> beginning, there's no regression issue here if an eventual fix writes
> less meaningful crap to those bits -- I see no real reason to try to put
> the correct values in there.

It is a regression vs. the 32-bit kernel, and if we're going to support
16-bit code we should arguably support 16-bit code correctly.

This is actually how I stumbled onto this problem in the first place: it
broke a compiler test suite for gcc -m16 I was working on.  The
workaround for *that* was to run in a VM instead.

>>> I would have suggested rejecting modify_ldt() entirely, to reduce attack
>>> surface, except that some early versions of 32-bit NPTL glibc use
>>> modify_ldt() to exclusion of all other methods of establishing the
>>> thread pointer, so in order to stay compatible with those we would need
>>> to allow 32-bit segments via modify_ldt() still.
> 
> I actually use modify_ldt for amusement: it's the only way I know of to
> issue real 32-bit syscalls from 64-bit userspace.  Yes, this isn't
> really a legitimate use case.

That's actually wrong on no less than two levels:

1. You can issue real 32-bit system calls from 64-bit user space simply
   by invoking int $0x80; it works in 64-bit mode as well.

2. Even if you want to be in 32-bit mode you can simply call via
   __USER32_CS, you don't need an LDT entry.

> Why not just 4k per CPU?  Write the pfn to the pte, invlpg, update rsp,
> iret.  This leaks the CPU number, but that's all.
> 
> To me, this sounds like the easiest solution, so long as rsp is known to
> be sufficiently far from a page boundary.
> 
> These ptes could even be read-only to limit the extra exposure to
> known-address attacks.
> 
> If you want a fully correct solution, you can use a fancier allocation
> policy that can fit quite a few cpus per 4G :)

It's damned hard, because you don't have a logical place to
*deallocate*.  That's what ends up killing you.

Also, you will need to port over the equivalent to the espfix recovery
code from 32 bits (what happens if IRET takes an exception), so it is
nontrivial.

>>> d. Trampoline in user space
>>>
>>> A return to the vdso with values set up in registers r8-r15 would enable
>>> a trampoline in user space.  Unfortunately there is no way
>>> to do a far JMP entirely with register state so this would require
>>> touching user space memory, possibly in an unsafe manner.
>>>
>>> The most likely variant is to use the address of the 16-bit user stack
>>> and simply hope that this is a safe thing to do.
>>>
>>> This appears to be the most feasible workaround if a workaround is
>>> deemed necessary.
> 
> Eww.

I don't think any of the options are anything but.

	-hpa





^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 21:59           ` H. Peter Anvin
@ 2014-04-11 22:15             ` Andy Lutomirski
  2014-04-11 22:18               ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Andy Lutomirski @ 2014-04-11 22:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H. Peter Anvin, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	stable

On Fri, Apr 11, 2014 at 2:59 PM, H. Peter Anvin <hpa@linux.intel.com> wrote:
> On 04/11/2014 02:53 PM, Andy Lutomirski wrote:
>>
>> How big of a functionality problem is it?  Apparently it doesn't break
>> 16-bit code on wine.
>>
>
> It breaks *some* 16-bit code.  This is actually the reason that 32 bits
> has the espfix workaround - it wasn't identified as an infoleak at the time.
>
>> Since the high bits of esp have been corrupted on x86_64 since the
>> beginning, there's no regression issue here if an eventual fix writes
>> less meaningful crap to those bits -- I see no real reason to try to put
>> the correct values in there.
>
> It is a regression vs. the 32-bit kernel, and if we're going to support
> 16-bit code we should arguably support 16-bit code correctly.
>
> This is actually how I stumbled onto this problem in the first place: it
> broke a compiler test suite for gcc -m16 I was working on.  The
> workaround for *that* was to run in a VM instead.
>
>>>> I would have suggested rejecting modify_ldt() entirely, to reduce attack
>>>> surface, except that some early versions of 32-bit NPTL glibc use
>>>> modify_ldt() to exclusion of all other methods of establishing the
>>>> thread pointer, so in order to stay compatible with those we would need
>>>> to allow 32-bit segments via modify_ldt() still.
>>
>> I actually use modify_ldt for amusement: it's the only way I know of to
>> issue real 32-bit syscalls from 64-bit userspace.  Yes, this isn't
>> really a legitimate use case.
>
> That's actually wrong on no less than two levels:
>
> 1. You can issue real 32-bit system calls from 64-bit user space simply
>    by invoking int $0x80; it works in 64-bit mode as well.
>
> 2. Even if you want to be in 32-bit mode you can simply call via
>    __USER32_CS, you don't need an LDT entry.

I just looked up my hideous code.  I was doing this to test the
now-deleted int 0xcc vsyscall stuff.  I used modify_ldt because either
I didn't realize that __USER32_CS was usable or I didn't think it was
ABI.  Or I was just being silly.

But yes, breaking my hack would not matter. :)

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 22:15             ` Andy Lutomirski
@ 2014-04-11 22:18               ` H. Peter Anvin
  0 siblings, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-11 22:18 UTC (permalink / raw)
  To: Andy Lutomirski, H. Peter Anvin
  Cc: Brian Gerst, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Thomas Gleixner, stable

On 04/11/2014 03:15 PM, Andy Lutomirski wrote:
> 
> I just looked up my hideous code.  I was doing this to test the
> now-deleted int 0xcc vsyscall stuff.  I used modify_ldt because either
> I didn't realize that __USER32_CS was usable or I didn't think it was
> ABI.  Or I was just being silly.
> 
> But yes, breaking my hack would not matter. :)
> 

Either way, it wouldn't break it.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:50       ` Linus Torvalds
@ 2014-04-12  4:44         ` Brian Gerst
  2014-04-12 17:18           ` H. Peter Anvin
  2014-04-14  7:48         ` Alexandre Julliard
  1 sibling, 1 reply; 136+ messages in thread
From: Brian Gerst @ 2014-04-12  4:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Linux Kernel Mailing List,
	Thomas Gleixner, stable, H. Peter Anvin

On Fri, Apr 11, 2014 at 2:50 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Apr 11, 2014 at 11:45 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>
>> I haven't tested it recently but I do know it has worked on 64-bit
>> kernels.  There is no reason for it not to, the only thing not
>> supported in long mode is vm86.  16-bit protected mode is unchanged.
>
> Afaik 64-bit windows doesn't support 16-bit binaries, so I just
> assumed Wine wouldn't do it either on x86-64. Not for any real
> technical reasons, though.
>
> HOWEVER. I'd like to hear something more definitive than "I haven't
> tested recently". The "we don't break user space" is about having
> actual real *users*, not about test programs.
>
> Are there people actually using 16-bit old windows programs under
> wine? That's what matters.
>
>                 Linus

I just verified that the game does still run on a 64-bit kernel
(3.13.8-200.fc20.x86_64).  It needed an older version of Wine, but
that's a Wine regression and not kernel related.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12  4:44         ` Brian Gerst
@ 2014-04-12 17:18           ` H. Peter Anvin
  2014-04-12 19:35             ` Borislav Petkov
  2014-04-12 20:29             ` Brian Gerst
  0 siblings, 2 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-12 17:18 UTC (permalink / raw)
  To: Brian Gerst, Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

So Wine regressed and noone noticed? They doesn't sound like an active user base.

On April 11, 2014 9:44:22 PM PDT, Brian Gerst <brgerst@gmail.com> wrote:
>On Fri, Apr 11, 2014 at 2:50 PM, Linus Torvalds
><torvalds@linux-foundation.org> wrote:
>> On Fri, Apr 11, 2014 at 11:45 AM, Brian Gerst <brgerst@gmail.com>
>wrote:
>>>
>>> I haven't tested it recently but I do know it has worked on 64-bit
>>> kernels.  There is no reason for it not to, the only thing not
>>> supported in long mode is vm86.  16-bit protected mode is unchanged.
>>
>> Afaik 64-bit windows doesn't support 16-bit binaries, so I just
>> assumed Wine wouldn't do it either on x86-64. Not for any real
>> technical reasons, though.
>>
>> HOWEVER. I'd like to hear something more definitive than "I haven't
>> tested recently". The "we don't break user space" is about having
>> actual real *users*, not about test programs.
>>
>> Are there people actually using 16-bit old windows programs under
>> wine? That's what matters.
>>
>>                 Linus
>
>I just verified that the game does still run on a 64-bit kernel
>(3.13.8-200.fc20.x86_64).  It needed an older version of Wine, but
>that's a Wine regression and not kernel related.

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 17:18           ` H. Peter Anvin
@ 2014-04-12 19:35             ` Borislav Petkov
  2014-04-12 19:44               ` H. Peter Anvin
  2014-04-12 20:29             ` Brian Gerst
  1 sibling, 1 reply; 136+ messages in thread
From: Borislav Petkov @ 2014-04-12 19:35 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Brian Gerst, Linus Torvalds, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On Sat, Apr 12, 2014 at 10:18:25AM -0700, H. Peter Anvin wrote:
> So Wine regressed and noone noticed? They doesn't sound like an active
> user base.

Btw, wouldn't this obscure use case simply work in a KVM guest with a
kernel <= 3.14?

Because if so, we simply cut it at 3.14, everything newer has the leak
fix and people who still want to play phone games on a x86 machine, can
do so in a guest with an older kernel. Everybody's happy.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 19:35             ` Borislav Petkov
@ 2014-04-12 19:44               ` H. Peter Anvin
  2014-04-12 20:11                 ` Borislav Petkov
  2014-04-12 21:53                 ` Linus Torvalds
  0 siblings, 2 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-12 19:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Brian Gerst, Linus Torvalds, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

Run a 32-bit VM.  The 32-bit kernel does this right.

I suspect it would also work fine in a Qemu user mode guest (is this supported by KVM?), in a ReactOS VM, or some other number of combinations.

The real question is how many real users are actually affected.

On April 12, 2014 12:35:41 PM PDT, Borislav Petkov <bp@alien8.de> wrote:
>On Sat, Apr 12, 2014 at 10:18:25AM -0700, H. Peter Anvin wrote:
>> So Wine regressed and noone noticed? They doesn't sound like an
>active
>> user base.
>
>Btw, wouldn't this obscure use case simply work in a KVM guest with a
>kernel <= 3.14?
>
>Because if so, we simply cut it at 3.14, everything newer has the leak
>fix and people who still want to play phone games on a x86 machine, can
>do so in a guest with an older kernel. Everybody's happy.

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 19:44               ` H. Peter Anvin
@ 2014-04-12 20:11                 ` Borislav Petkov
  2014-04-12 20:34                   ` Brian Gerst
  2014-04-12 21:53                 ` Linus Torvalds
  1 sibling, 1 reply; 136+ messages in thread
From: Borislav Petkov @ 2014-04-12 20:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Brian Gerst, Linus Torvalds, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On Sat, Apr 12, 2014 at 12:44:42PM -0700, H. Peter Anvin wrote:
> Run a 32-bit VM.  The 32-bit kernel does this right.

Yes, even better.

> I suspect it would also work fine in a Qemu user mode guest (is
> this supported by KVM?), in a ReactOS VM, or some other number of
> combinations.

Right.

So basically, there a lot of different virt scenarios which can all take
care of those use cases *without* encumbering some insane solutions on
64-bit.

> The real question is how many real users are actually affected.

And if they are, virtualize them, for chrissake. It is time we finally
used virt for maybe one of its major use cases - virtualize old/obscure
hw. It should be pretty reliable by now.

:-P

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 17:18           ` H. Peter Anvin
  2014-04-12 19:35             ` Borislav Petkov
@ 2014-04-12 20:29             ` Brian Gerst
  1 sibling, 0 replies; 136+ messages in thread
From: Brian Gerst @ 2014-04-12 20:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List,
	Thomas Gleixner, stable, H. Peter Anvin

For this particular game, not 16-bit in general.   The installer, also
16-bit, runs perfectly.  Already filed wine bug 35977.

On Sat, Apr 12, 2014 at 1:18 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> So Wine regressed and noone noticed? They doesn't sound like an active user base.
>
> On April 11, 2014 9:44:22 PM PDT, Brian Gerst <brgerst@gmail.com> wrote:
>>On Fri, Apr 11, 2014 at 2:50 PM, Linus Torvalds
>><torvalds@linux-foundation.org> wrote:
>>> On Fri, Apr 11, 2014 at 11:45 AM, Brian Gerst <brgerst@gmail.com>
>>wrote:
>>>>
>>>> I haven't tested it recently but I do know it has worked on 64-bit
>>>> kernels.  There is no reason for it not to, the only thing not
>>>> supported in long mode is vm86.  16-bit protected mode is unchanged.
>>>
>>> Afaik 64-bit windows doesn't support 16-bit binaries, so I just
>>> assumed Wine wouldn't do it either on x86-64. Not for any real
>>> technical reasons, though.
>>>
>>> HOWEVER. I'd like to hear something more definitive than "I haven't
>>> tested recently". The "we don't break user space" is about having
>>> actual real *users*, not about test programs.
>>>
>>> Are there people actually using 16-bit old windows programs under
>>> wine? That's what matters.
>>>
>>>                 Linus
>>
>>I just verified that the game does still run on a 64-bit kernel
>>(3.13.8-200.fc20.x86_64).  It needed an older version of Wine, but
>>that's a Wine regression and not kernel related.
>
> --
> Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 20:11                 ` Borislav Petkov
@ 2014-04-12 20:34                   ` Brian Gerst
  2014-04-12 20:59                     ` Borislav Petkov
  0 siblings, 1 reply; 136+ messages in thread
From: Brian Gerst @ 2014-04-12 20:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Linus Torvalds, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On Sat, Apr 12, 2014 at 4:11 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Sat, Apr 12, 2014 at 12:44:42PM -0700, H. Peter Anvin wrote:
>> Run a 32-bit VM.  The 32-bit kernel does this right.
>
> Yes, even better.
>
>> I suspect it would also work fine in a Qemu user mode guest (is
>> this supported by KVM?), in a ReactOS VM, or some other number of
>> combinations.
>
> Right.
>
> So basically, there a lot of different virt scenarios which can all take
> care of those use cases *without* encumbering some insane solutions on
> 64-bit.
>
>> The real question is how many real users are actually affected.
>
> And if they are, virtualize them, for chrissake. It is time we finally
> used virt for maybe one of its major use cases - virtualize old/obscure
> hw. It should be pretty reliable by now.
>
> :-P
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

My experience with kvm so far is that is slow and clunky.  It may be
OK for a server environment, but interactively it's difficult to use.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 20:34                   ` Brian Gerst
@ 2014-04-12 20:59                     ` Borislav Petkov
  2014-04-12 21:13                       ` Brian Gerst
  0 siblings, 1 reply; 136+ messages in thread
From: Borislav Petkov @ 2014-04-12 20:59 UTC (permalink / raw)
  To: Brian Gerst
  Cc: H. Peter Anvin, Linus Torvalds, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On Sat, Apr 12, 2014 at 04:34:14PM -0400, Brian Gerst wrote:
> My experience with kvm so far is that is slow and clunky. It may be OK
> for a server environment, but interactively it's difficult to use.

Are you saying, you've run your game in a guest and perf. is sucky?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 20:59                     ` Borislav Petkov
@ 2014-04-12 21:13                       ` Brian Gerst
  2014-04-12 21:40                         ` Borislav Petkov
  0 siblings, 1 reply; 136+ messages in thread
From: Brian Gerst @ 2014-04-12 21:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Linus Torvalds, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On Sat, Apr 12, 2014 at 4:59 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Sat, Apr 12, 2014 at 04:34:14PM -0400, Brian Gerst wrote:
>> My experience with kvm so far is that is slow and clunky. It may be OK
>> for a server environment, but interactively it's difficult to use.
>
> Are you saying, you've run your game in a guest and perf. is sucky?
>

Performance is bad in general, running a 32-bit Fedora 20 guest.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 21:13                       ` Brian Gerst
@ 2014-04-12 21:40                         ` Borislav Petkov
  2014-04-14  7:21                           ` Ingo Molnar
  0 siblings, 1 reply; 136+ messages in thread
From: Borislav Petkov @ 2014-04-12 21:40 UTC (permalink / raw)
  To: Brian Gerst
  Cc: H. Peter Anvin, Linus Torvalds, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On Sat, Apr 12, 2014 at 05:13:40PM -0400, Brian Gerst wrote:
> Performance is bad in general, running a 32-bit Fedora 20 guest.

So this means you haven't tried the game in the guest yet, so that we
can know for sure that a guest doesn't solve your problem or what?

Btw, which game is that and can I get it somewhere to try it here
locally?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 19:44               ` H. Peter Anvin
  2014-04-12 20:11                 ` Borislav Petkov
@ 2014-04-12 21:53                 ` Linus Torvalds
  2014-04-12 22:25                   ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: Linus Torvalds @ 2014-04-12 21:53 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On Sat, Apr 12, 2014 at 12:44 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> Run a 32-bit VM.  The 32-bit kernel does this right.

I really don't think that's the answer.

If people really run these 16-bit programs, we need to allow it.
Clearly it used to work.

Just make the unconditional "don't allow 16-bit segments" be a sysconf entry.

          Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 21:53                 ` Linus Torvalds
@ 2014-04-12 22:25                   ` H. Peter Anvin
  2014-04-13  2:56                     ` Andi Kleen
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-12 22:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On 04/12/2014 02:53 PM, Linus Torvalds wrote:
> On Sat, Apr 12, 2014 at 12:44 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> Run a 32-bit VM.  The 32-bit kernel does this right.
> 
> I really don't think that's the answer.
> 
> If people really run these 16-bit programs, we need to allow it.
> Clearly it used to work.
> 
> Just make the unconditional "don't allow 16-bit segments" be a sysconf entry.
> 

Well, is there more than one user, really... that's my question.

But yes, we can make it configurable, but the default should almost
certainly be off.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 21:24       ` H. Peter Anvin
  2014-04-11 21:53         ` Andy Lutomirski
@ 2014-04-12 23:26         ` Alexander van Heukelum
  2014-04-12 23:31           ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: Alexander van Heukelum @ 2014-04-12 23:26 UTC (permalink / raw)
  To: H. Peter Anvin, Andy Lutomirski, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	stable

Hi,

> This is a writeup I did to a select audience before this was public:

I'ld like to add an option d.2. for your consideration. Can you think of a
fundamental problem with it?

Greetings,
    Alexander

> > Some workarounds I have considered:
> > 
> > a. Using paging in a similar way to the 32-bit segment base workaround
> > 
> > This one requires a very large swath of virtual user space (depending on
> > allocation policy, as much as 4 GiB per CPU.)  The "per CPU" requirement
> > comes in as locking is not feasible -- as we return to user space there
> > is nowhere to release the lock.
> > 
> > b. Return to user space via compatibility mode
> > 
> > As the kernel lives above the 4 GiB virtual mark, a transition through
> > compatibility mode is not practical.  This would require the kernel to
> > reserve virtual address space below the 4 GiB mark, which may interfere
> > with the application, especially an application launched as a 64-bit
> > application.
> > 
> > c. Trampoline in kernel space
> > 
> > A trampoline in kernel space is not feasible since all ring transition
> > instructions capable of returning to 16-bit mode require the use of the
> > stack.

"16 bit mode" -> "a mode with 16-bit stack"

> > d. Trampoline in user space
> > 
> > A return to the vdso with values set up in registers r8-r15 would enable
> > a trampoline in user space.  Unfortunately there is no way
> > to do a far JMP entirely with register state so this would require
> > touching user space memory, possibly in an unsafe manner.

d.2. trampoline in user space via long mode

Return from the kernel to a user space trampoline via long mode.
The kernel changes the stack frame just before executing the iret
instruction. (the CS and RIP slots are set to run the trampoline code,
where CS is a long mode segment.) The trampoline code in userspace
is set up to this single instruction: a far jump to the final CS:EIP
(compatibility mode).

Because the IRET is now returning to long mode, all registers are
restored fully. The stack cannot be used at this point, but the far
jump doesn't need stack and it will/should make the stack valid
immediately after execution. The IRET enables interrupts, so the
far jump is in the interrupt shadow: it won't be seen, unless it causes
an exception.

> > The most likely variant is to use the address of the 16-bit user stack
> > and simply hope that this is a safe thing to do.
> > 
> > This appears to be the most feasible workaround if a workaround is
> > deemed necessary.
> > 
> > e. Transparently run 16-bit code segments inside a lightweight VMM

"16-bit code" -> "code with 16-bit stack"

> > The complexity of this solution versus the realized value is staggering.
> > It also doesn't work on non-virtualization-capable hardware (including
> > running on top of a VMM which doesn't support nested virtualization.)
> > 
> > 	-hpa
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 23:26         ` Alexander van Heukelum
@ 2014-04-12 23:31           ` H. Peter Anvin
  2014-04-12 23:49             ` Alexander van Heukelum
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-12 23:31 UTC (permalink / raw)
  To: Alexander van Heukelum, Andy Lutomirski, Brian Gerst,
	Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, stable

On 04/12/2014 04:26 PM, Alexander van Heukelum wrote:
>>>
>>> c. Trampoline in kernel space
>>>
>>> A trampoline in kernel space is not feasible since all ring transition
>>> instructions capable of returning to 16-bit mode require the use of the
>>> stack.
> 
> "16 bit mode" -> "a mode with 16-bit stack"

Yes... I believe it is the SS.B bit that is relevant, not CS.B (although
I haven't confirmed that experimentally.)  Not that that helps one iota,
as far as I can tell.

>>> d. Trampoline in user space
>>>
>>> A return to the vdso with values set up in registers r8-r15 would enable
>>> a trampoline in user space.  Unfortunately there is no way
>>> to do a far JMP entirely with register state so this would require
>>> touching user space memory, possibly in an unsafe manner.
> 
> d.2. trampoline in user space via long mode
> 
> Return from the kernel to a user space trampoline via long mode.
> The kernel changes the stack frame just before executing the iret
> instruction. (the CS and RIP slots are set to run the trampoline code,
> where CS is a long mode segment.) The trampoline code in userspace
> is set up to this single instruction: a far jump to the final CS:EIP
> (compatibility mode).

This still requires user space memory that the kernel can write to.
Long mode is actually exactly identical to what I was suggesting above,
except that I would avoid using self-modifying code in favor of just
parameterization using the high registers.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 23:31           ` H. Peter Anvin
@ 2014-04-12 23:49             ` Alexander van Heukelum
  2014-04-13  0:03               ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Alexander van Heukelum @ 2014-04-12 23:49 UTC (permalink / raw)
  To: H. Peter Anvin, Andy Lutomirski, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	stable

On Sun, Apr 13, 2014, at 1:31, H. Peter Anvin wrote:
> >>> d. Trampoline in user space
> >>>
> >>> A return to the vdso with values set up in registers r8-r15 would enable
> >>> a trampoline in user space.  Unfortunately there is no way
> >>> to do a far JMP entirely with register state so this would require
> >>> touching user space memory, possibly in an unsafe manner.
> > 
> > d.2. trampoline in user space via long mode
> > 
> > Return from the kernel to a user space trampoline via long mode.
> > The kernel changes the stack frame just before executing the iret
> > instruction. (the CS and RIP slots are set to run the trampoline code,
> > where CS is a long mode segment.) The trampoline code in userspace
> > is set up to this single instruction: a far jump to the final CS:EIP
> > (compatibility mode).
> 
> This still requires user space memory that the kernel can write to.
> Long mode is actually exactly identical to what I was suggesting above,
> except that I would avoid using self-modifying code in favor of just
> parameterization using the high registers.

No self modifying code... The far jump must be in the indirect form
anyhow. The CS:EIP must be accessible from user mode, but not
necessarily from compatibility mode. So the trampoline (the jump)
and data (CS:EIP) can live pretty much anywhere in virtual memory.
But indeed, I see what you meant now.

Greetings,
   Alexander

> 
> 	-hpa
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 23:49             ` Alexander van Heukelum
@ 2014-04-13  0:03               ` H. Peter Anvin
  2014-04-13  1:25                 ` Andy Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-13  0:03 UTC (permalink / raw)
  To: Alexander van Heukelum, Andy Lutomirski, Brian Gerst,
	Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Thomas Gleixner, stable

On 04/12/2014 04:49 PM, Alexander van Heukelum wrote:
> On Sun, Apr 13, 2014, at 1:31, H. Peter Anvin wrote:
>>>>> d. Trampoline in user space
>>>>>
>>>>> A return to the vdso with values set up in registers r8-r15 would enable
>>>>> a trampoline in user space.  Unfortunately there is no way
>>>>> to do a far JMP entirely with register state so this would require
>>>>> touching user space memory, possibly in an unsafe manner.
>>>
>>> d.2. trampoline in user space via long mode
>>>
>>> Return from the kernel to a user space trampoline via long mode.
>>> The kernel changes the stack frame just before executing the iret
>>> instruction. (the CS and RIP slots are set to run the trampoline code,
>>> where CS is a long mode segment.) The trampoline code in userspace
>>> is set up to this single instruction: a far jump to the final CS:EIP
>>> (compatibility mode).
>>
>> This still requires user space memory that the kernel can write to.
>> Long mode is actually exactly identical to what I was suggesting above,
>> except that I would avoid using self-modifying code in favor of just
>> parameterization using the high registers.
> 
> No self modifying code... The far jump must be in the indirect form
> anyhow. The CS:EIP must be accessible from user mode, but not
> necessarily from compatibility mode. So the trampoline (the jump)
> and data (CS:EIP) can live pretty much anywhere in virtual memory.
> But indeed, I see what you meant now.
> 

This is, in fact, exactly then what I was suggesting, except that data
is passed directly in memory rather than in a register and letting user
space sort it out (this could be in the vdso, but the vdso may be > 4 GB
so it has to be in 64-bit mode until the last instruction.)  The
difference isn't huge; mostly an implementation detail.

A signal arriving while in the user space trampoline could seriously
complicate life.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-13  0:03               ` H. Peter Anvin
@ 2014-04-13  1:25                 ` Andy Lutomirski
  2014-04-13  1:29                   ` Andy Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: Andy Lutomirski @ 2014-04-13  1:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexander van Heukelum, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	stable

On Sat, Apr 12, 2014 at 5:03 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> No self modifying code... The far jump must be in the indirect form
>> anyhow. The CS:EIP must be accessible from user mode, but not
>> necessarily from compatibility mode. So the trampoline (the jump)
>> and data (CS:EIP) can live pretty much anywhere in virtual memory.
>> But indeed, I see what you meant now.
>>
>
> This is, in fact, exactly then what I was suggesting, except that data
> is passed directly in memory rather than in a register and letting user
> space sort it out (this could be in the vdso, but the vdso may be > 4 GB
> so it has to be in 64-bit mode until the last instruction.)  The
> difference isn't huge; mostly an implementation detail.

I'm a bit confused as to exactly what everyone is suggesting.  I don't
think there's any instruction that can do a direct far jump to an
address stored in a register.

ISTM it does matter whether SS or CS is the offending selector.  If
it's SS, then the possible trampoline sequences are:

MOV SS, ??? / POP SS / LSS
JMP/RET

or

IRET (!)


If it's CS, then we just need a far JMP or a RET or an IRET.  The far
JMP is kind of nice since we can at least use RIP-relative addressing

What are the interrupt shadow rules?  I thought IRET did not block interrupts.

>
> A signal arriving while in the user space trampoline could seriously
> complicate life.

Agreed.

Note that we're not really guaranteed to have a trampoline at all.
The vdso isn't there in CONFIG_COMPAT_VDSO mode, although the number
of users of this "feature" on OpenSUSE 9 is probably zero.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-13  1:25                 ` Andy Lutomirski
@ 2014-04-13  1:29                   ` Andy Lutomirski
  2014-04-13  3:00                     ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Andy Lutomirski @ 2014-04-13  1:29 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Alexander van Heukelum, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	stable

On Sat, Apr 12, 2014 at 6:25 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Sat, Apr 12, 2014 at 5:03 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> A signal arriving while in the user space trampoline could seriously
>> complicate life.
>
> Agreed.

Maybe I don't agree.  Have signals ever worked sensibly when delivered
to a task running on an unexpected stack or code segment?

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:41   ` Linus Torvalds
  2014-04-11 18:45     ` Brian Gerst
  2014-04-11 18:46     ` [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels H. Peter Anvin
@ 2014-04-13  2:54     ` Andi Kleen
  2 siblings, 0 replies; 136+ messages in thread
From: Andi Kleen @ 2014-04-13  2:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Brian Gerst, Ingo Molnar, H. Peter Anvin,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Apr 11, 2014 at 11:27 AM, Brian Gerst <brgerst@gmail.com> wrote:
>> Is this bug really still present in modern CPUs?  This change breaks
>> running 16-bit apps in Wine.  I have a few really old games I like to
>> play on occasion, and I don't have a copy of Win 3.11 to put in a VM.
>
> Ok, so you actually do this on x86-64, and it currently works? For
> some reason I thought that 16-bit windows apps already didn't work.

No, it always worked. I spent some time on this early in the x86-64 port
and it flushed out some bugs in the early segment handling. x86-64 
is perfectly compatible to this.

Also was always proud that Linux 64 was more compatible old Windows
binaries than Win 64...

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 22:25                   ` H. Peter Anvin
@ 2014-04-13  2:56                     ` Andi Kleen
  2014-04-13  3:02                       ` H. Peter Anvin
  2014-04-13  3:13                       ` Linus Torvalds
  0 siblings, 2 replies; 136+ messages in thread
From: Andi Kleen @ 2014-04-13  2:56 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Borislav Petkov, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

"H. Peter Anvin" <hpa@zytor.com> writes:
>
> But yes, we can make it configurable, but the default should almost
> certainly be off.

Why? Either it works or it doesn't.

If it works it doesn't make any sense to have a sysctl.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-13  1:29                   ` Andy Lutomirski
@ 2014-04-13  3:00                     ` H. Peter Anvin
  0 siblings, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-13  3:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexander van Heukelum, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	stable

I would think any sensible application with 16-bit segments would be using sigaltstack.  Does anyone know what Wine does?

On April 12, 2014 6:29:11 PM PDT, Andy Lutomirski <luto@amacapital.net> wrote:
>On Sat, Apr 12, 2014 at 6:25 PM, Andy Lutomirski <luto@amacapital.net>
>wrote:
>> On Sat, Apr 12, 2014 at 5:03 PM, H. Peter Anvin <hpa@zytor.com>
>wrote:
>>> A signal arriving while in the user space trampoline could seriously
>>> complicate life.
>>
>> Agreed.
>
>Maybe I don't agree.  Have signals ever worked sensibly when delivered
>to a task running on an unexpected stack or code segment?
>
>--Andy

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-13  2:56                     ` Andi Kleen
@ 2014-04-13  3:02                       ` H. Peter Anvin
  2014-04-13  3:13                       ` Linus Torvalds
  1 sibling, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-13  3:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Borislav Petkov, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

It leaks security sensitive information to userspace and corrupts the upper half of ESP because it lacks the equivalent of the espfix workaround.

On April 12, 2014 7:56:48 PM PDT, Andi Kleen <andi@firstfloor.org> wrote:
>"H. Peter Anvin" <hpa@zytor.com> writes:
>>
>> But yes, we can make it configurable, but the default should almost
>> certainly be off.
>
>Why? Either it works or it doesn't.
>
>If it works it doesn't make any sense to have a sysctl.
>
>-Andi

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-13  2:56                     ` Andi Kleen
  2014-04-13  3:02                       ` H. Peter Anvin
@ 2014-04-13  3:13                       ` Linus Torvalds
  1 sibling, 0 replies; 136+ messages in thread
From: Linus Torvalds @ 2014-04-13  3:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Borislav Petkov, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On Sat, Apr 12, 2014 at 7:56 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Why? Either it works or it doesn't.
>
> If it works it doesn't make any sense to have a sysctl.

BS.

It "works" exactly like mmap() at NULL "works".

It is a potential security leak, because x86-64 screwed up the
architecture definition in this area. So it should definitely be
disabled by default, exactly like mmap_min_addr is non-zero by
default.

            Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 21:53         ` Andy Lutomirski
  2014-04-11 21:59           ` H. Peter Anvin
@ 2014-04-13  4:20           ` H. Peter Anvin
  1 sibling, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-13  4:20 UTC (permalink / raw)
  To: Andy Lutomirski, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Linus Torvalds, Thomas Gleixner,
	stable, H. Peter Anvin

On 04/11/2014 02:53 PM, Andy Lutomirski wrote:
> 
> If you want a fully correct solution, you can use a fancier allocation
> policy that can fit quite a few cpus per 4G :)
> 

The more I think about this, I think this might actually be a reasonable
option, *IF* someone is willing to deal with actually implementing it.

The difference versus my "a" alternative is rather than mapping the
existing kernel stack into an alternate part of the address space would
be that we would have a series of ministacks that is only large enough
that we can handle the IRET data *and* big enough to handle any
exceptions that IRET may throw, until we can switch back to the real
kernel stack.  Tests would have to be added to the appropriate exception
paths, as early as possible.  We would then *copy* the IRET data to the
ministack before returning.  Each ministack would be mapped 65536 times.

If we can get away with 64 bytes per CPU, then we can get away with 4
GiB of address space per 1024 CPUs, so if MAX_CPUS is 16384 we would
need 64 GiB of address space... which is not unreasonable on 64 bits.
The total memory consumption would be about 81 bytes per CPU for the
ministacks plus page tables (just over 16K per 1K CPUs.)  Again, fairly
reasonable, but a *lot* of complexity.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-12 21:40                         ` Borislav Petkov
@ 2014-04-14  7:21                           ` Ingo Molnar
  2014-04-14  9:44                             ` Borislav Petkov
  0 siblings, 1 reply; 136+ messages in thread
From: Ingo Molnar @ 2014-04-14  7:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Brian Gerst, H. Peter Anvin, Linus Torvalds,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin


* Borislav Petkov <bp@alien8.de> wrote:

> On Sat, Apr 12, 2014 at 05:13:40PM -0400, Brian Gerst wrote:
> > Performance is bad in general, running a 32-bit Fedora 20 guest.
> 
> So this means you haven't tried the game in the guest yet, so that 
> we can know for sure that a guest doesn't solve your problem or 
> what?
> 
> Btw, which game is that and can I get it somewhere to try it here 
> locally?

Apparently the game in question is "Exile: Escape from the pit":

  http://osdir.com/ml/wine-bugs/2014-04/msg01159.html

Thanks,

        Ingo


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:46     ` [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels H. Peter Anvin
@ 2014-04-14  7:27       ` Ingo Molnar
  2014-04-14 15:45         ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Ingo Molnar @ 2014-04-14  7:27 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Brian Gerst, H. Peter Anvin,
	Linux Kernel Mailing List, Thomas Gleixner, stable


* H. Peter Anvin <hpa@linux.intel.com> wrote:

> On 04/11/2014 11:41 AM, Linus Torvalds wrote:
> > 
> > Ok, so you actually do this on x86-64, and it currently works? For
> > some reason I thought that 16-bit windows apps already didn't work.
> > 
> 
> Some will work, because not all 16-bit software care about the upper
> half of ESP getting randomly corrupted.
> 
> That is the "functionality bit" of the problem.  The other bit, of
> course, is that that random corruption is the address of the kernel stack.
> 
> > Because if we have working users of this, then I don't think we can do
> > the "we don't support 16-bit segments", or at least we need to make it
> > runtime configurable.
> 
> I'll let you pick what the policy should be here.  I personally 
> think that we have to be able to draw a line somewhere sometimes 
> (Microsoft themselves haven't supported running 16-bit binaries for 
> several Windows generations now), but it is your policy, not mine.

I think the mmap_min_addr model works pretty well:

 - it defaults to secure

 - allow a security policy to grant an exception to a known package, 
   built by the distro

 - end user can also grant an exception

This essentially punts any 'makes the system less secure' exceptions 
to the distro and the end user.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-11 18:50       ` Linus Torvalds
  2014-04-12  4:44         ` Brian Gerst
@ 2014-04-14  7:48         ` Alexandre Julliard
  2014-05-07  9:18           ` Sven Joachim
  1 sibling, 1 reply; 136+ messages in thread
From: Alexandre Julliard @ 2014-04-14  7:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Brian Gerst, Ingo Molnar, H. Peter Anvin,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Apr 11, 2014 at 11:45 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>
>> I haven't tested it recently but I do know it has worked on 64-bit
>> kernels.  There is no reason for it not to, the only thing not
>> supported in long mode is vm86.  16-bit protected mode is unchanged.
>
> Afaik 64-bit windows doesn't support 16-bit binaries, so I just
> assumed Wine wouldn't do it either on x86-64. Not for any real
> technical reasons, though.
>
> HOWEVER. I'd like to hear something more definitive than "I haven't
> tested recently". The "we don't break user space" is about having
> actual real *users*, not about test programs.
>
> Are there people actually using 16-bit old windows programs under
> wine? That's what matters.

Yes, there is still a significant number of users, and we still
regularly get bug reports about specific 16-bit apps. It would be really
nice if we could continue to support them on x86-64, particularly since
Microsoft doesn't ;-)

-- 
Alexandre Julliard
julliard@winehq.org

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-14  7:21                           ` Ingo Molnar
@ 2014-04-14  9:44                             ` Borislav Petkov
  2014-04-14  9:47                               ` Ingo Molnar
  0 siblings, 1 reply; 136+ messages in thread
From: Borislav Petkov @ 2014-04-14  9:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Brian Gerst, H. Peter Anvin, Linus Torvalds,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On Mon, Apr 14, 2014 at 09:21:13AM +0200, Ingo Molnar wrote:
> Apparently the game in question is "Exile: Escape from the pit":
> 
>   http://osdir.com/ml/wine-bugs/2014-04/msg01159.html

Ah, thanks.

Well, FWIW, you can get the game for free:

http://www.spiderwebsoftware.com/exile/winexile.html

I did run it on an old windoze guest I had lying around. Performance
feels like native considering this game is from the previous century :-)

https://en.wikipedia.org/wiki/Exile_%28video_game_series%29

Wikipedia says you can get it on steam too, which runs linux and stuff.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-14  9:44                             ` Borislav Petkov
@ 2014-04-14  9:47                               ` Ingo Molnar
  0 siblings, 0 replies; 136+ messages in thread
From: Ingo Molnar @ 2014-04-14  9:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Brian Gerst, H. Peter Anvin, Linus Torvalds,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin


* Borislav Petkov <bp@alien8.de> wrote:

> On Mon, Apr 14, 2014 at 09:21:13AM +0200, Ingo Molnar wrote:
> > Apparently the game in question is "Exile: Escape from the pit":
> > 
> >   http://osdir.com/ml/wine-bugs/2014-04/msg01159.html
> 
> Ah, thanks.
> 
> Well, FWIW, you can get the game for free:
> 
> http://www.spiderwebsoftware.com/exile/winexile.html
> 
> I did run it on an old windoze guest I had lying around. Performance
> feels like native considering this game is from the previous century :-)

In fact it's from the previous millenium :-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-14  7:27       ` Ingo Molnar
@ 2014-04-14 15:45         ` H. Peter Anvin
  0 siblings, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-14 15:45 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin
  Cc: Linus Torvalds, Brian Gerst, Linux Kernel Mailing List,
	Thomas Gleixner, stable

For both of these, though, it is really kind of broken that it is a global switch, whereas typically only one application on the whole system needs it, so it would be much better to have application-specific controls.  How to do that is another matter...

On April 14, 2014 12:27:56 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* H. Peter Anvin <hpa@linux.intel.com> wrote:
>
>> On 04/11/2014 11:41 AM, Linus Torvalds wrote:
>> > 
>> > Ok, so you actually do this on x86-64, and it currently works? For
>> > some reason I thought that 16-bit windows apps already didn't work.
>> > 
>> 
>> Some will work, because not all 16-bit software care about the upper
>> half of ESP getting randomly corrupted.
>> 
>> That is the "functionality bit" of the problem.  The other bit, of
>> course, is that that random corruption is the address of the kernel
>stack.
>> 
>> > Because if we have working users of this, then I don't think we can
>do
>> > the "we don't support 16-bit segments", or at least we need to make
>it
>> > runtime configurable.
>> 
>> I'll let you pick what the policy should be here.  I personally 
>> think that we have to be able to draw a line somewhere sometimes 
>> (Microsoft themselves haven't supported running 16-bit binaries for 
>> several Windows generations now), but it is your policy, not mine.
>
>I think the mmap_min_addr model works pretty well:
>
> - it defaults to secure
>
> - allow a security policy to grant an exception to a known package, 
>   built by the distro
>
> - end user can also grant an exception
>
>This essentially punts any 'makes the system less secure' exceptions 
>to the distro and the end user.
>
>Thanks,
>
>	Ingo

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-11 17:36 [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels tip-bot for H. Peter Anvin
  2014-04-11 18:12 ` Andy Lutomirski
  2014-04-11 18:27 ` Brian Gerst
@ 2014-04-21 22:47 ` H. Peter Anvin
  2014-04-21 23:19   ` Andrew Lutomirski
                     ` (4 more replies)
  2 siblings, 5 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-21 22:47 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: H. Peter Anvin, H. Peter Anvin, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Andy Lutomirski, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

This is a prototype of espfix for the 64-bit kernel.  espfix is a
workaround for the architectural definition of IRET, which fails to
restore bits [31:16] of %esp when returning to a 16-bit stack
segment.  We have a workaround for the 32-bit kernel, but that
implementation doesn't work for 64 bits.

The 64-bit implementation works like this:

Set up a ministack for each CPU, which is then mapped 65536 times
using the page tables.  This implementation uses the second-to-last
PGD slot for this; with a 64-byte espfix stack this is sufficient for
2^18 CPUs (currently we support a max of 2^13 CPUs.)

64 bytes appear to be sufficient, because NMI and #MC cause a task
switch.

THIS IS A PROTOTYPE AND IS NOT COMPLETE.  We need to make sure all
code paths that can interrupt userspace execute this code.
Fortunately we never need to use the espfix stack for nested faults,
so one per CPU is guaranteed to be safe.

Furthermore, this code adds unnecessary instructions to the common
path.  For example, on exception entry we push %rdi, pop %rdi, and
then save away %rdi.  Ideally we should do this in such a way that we
avoid unnecessary swapgs, especially on the IRET path (the exception
path is going to be very rare, and so is less critical.)

Putting this version out there for people to look at/laugh at/play
with.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/tip-kicdm89kzw9lldryb1br9od0@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Andy Lutomirski <amluto@gmail.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Alexandre Julliard <julliard@winehq.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/setup.h  |   2 +
 arch/x86/kernel/Makefile      |   1 +
 arch/x86/kernel/entry_64.S    |  79 ++++++++++++++++++-
 arch/x86/kernel/espfix_64.c   | 171 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/head64.c      |   1 +
 arch/x86/kernel/ldt.c         |  11 ---
 arch/x86/kernel/smpboot.c     |   5 ++
 arch/x86/mm/dump_pagetables.c |   2 +
 init/main.c                   |   4 +
 9 files changed, 264 insertions(+), 12 deletions(-)
 create mode 100644 arch/x86/kernel/espfix_64.c

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 9264f04a4c55..84b882eebdf9 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -57,6 +57,8 @@ extern void x86_ce4100_early_setup(void);
 static inline void x86_ce4100_early_setup(void) { }
 #endif
 
+extern void init_espfix_cpu(void);
+
 #ifndef _SETUP
 
 /*
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index f4d96000d33a..1cc3789d99d9 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_X86_64)	+= sys_x86_64.o x8664_ksyms_64.o
 obj-y			+= syscall_$(BITS).o vsyscall_gtod.o
 obj-$(CONFIG_X86_64)	+= vsyscall_64.o
 obj-$(CONFIG_X86_64)	+= vsyscall_emu_64.o
+obj-$(CONFIG_X86_64)	+= espfix_64.o
 obj-$(CONFIG_SYSFS)	+= ksysfs.o
 obj-y			+= bootflag.o e820.o
 obj-y			+= pci-dma.o quirks.o topology.o kdebugfs.o
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c3628bf2..7cc01770bf21 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -58,6 +58,7 @@
 #include <asm/asm.h>
 #include <asm/context_tracking.h>
 #include <asm/smap.h>
+#include <asm/pgtable_types.h>
 #include <linux/err.h>
 
 /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this.  */
@@ -1040,8 +1041,16 @@ restore_args:
 	RESTORE_ARGS 1,8,1
 
 irq_return:
+	/*
+	 * Are we returning to the LDT?  Note: in 64-bit mode
+	 * SS:RSP on the exception stack is always valid.
+	 */
+	testb $4,(SS-RIP)(%rsp)
+	jnz irq_return_ldt
+
+irq_return_iret:
 	INTERRUPT_RETURN
-	_ASM_EXTABLE(irq_return, bad_iret)
+	_ASM_EXTABLE(irq_return_iret, bad_iret)
 
 #ifdef CONFIG_PARAVIRT
 ENTRY(native_iret)
@@ -1049,6 +1058,34 @@ ENTRY(native_iret)
 	_ASM_EXTABLE(native_iret, bad_iret)
 #endif
 
+irq_return_ldt:
+	pushq_cfi %rcx
+	larl (CS-RIP+8)(%rsp), %ecx
+	jnz 1f		/* Invalid segment - will #GP at IRET time */
+	testl $0x00200000, %ecx
+	jnz 1f		/* Returning to 64-bit mode */
+	larl (SS-RIP+8)(%rsp), %ecx
+	jnz 1f		/* Invalid segment - will #SS at IRET time */
+	testl $0x00400000, %ecx
+	jnz 1f		/* Not a 16-bit stack segment */
+	pushq_cfi %rsi
+	pushq_cfi %rdi
+	SWAPGS
+	movq PER_CPU_VAR(espfix_stack),%rdi
+	movl (RSP-RIP+3*8)(%rsp),%esi
+	xorw %si,%si
+	orq %rsi,%rdi
+	movq %rsp,%rsi
+	movl $8,%ecx
+	rep;movsq
+	leaq -(8*8)(%rdi),%rsp
+	SWAPGS
+	popq_cfi %rdi
+	popq_cfi %rsi
+1:
+	popq_cfi %rcx
+	jmp irq_return_iret
+
 	.section .fixup,"ax"
 bad_iret:
 	/*
@@ -1058,6 +1095,7 @@ bad_iret:
 	 * So pretend we completed the iret and took the #GPF in user mode.
 	 *
 	 * We are now running with the kernel GS after exception recovery.
+	 * Exception entry will have removed us from the espfix stack.
 	 * But error_entry expects us to have user GS to match the user %cs,
 	 * so swap back.
 	 */
@@ -1200,6 +1238,17 @@ apicinterrupt IRQ_WORK_VECTOR \
 	irq_work_interrupt smp_irq_work_interrupt
 #endif
 
+.macro espfix_adjust_stack
+	pushq_cfi %rdi
+	movq %rsp,%rdi
+	sarq $PGDIR_SHIFT,%rdi
+	cmpl $-2,%edi
+	jne 1f
+	call espfix_fix_stack
+1:
+	popq_cfi %rdi		/* Fix so we don't need this again */
+.endm
+
 /*
  * Exception entry points.
  */
@@ -1209,6 +1258,7 @@ ENTRY(\sym)
 	ASM_CLAC
 	PARAVIRT_ADJUST_EXCEPTION_FRAME
 	pushq_cfi $-1		/* ORIG_RAX: no syscall to restart */
+	espfix_adjust_stack
 	subq $ORIG_RAX-R15, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
 	call error_entry
@@ -1227,6 +1277,7 @@ ENTRY(\sym)
 	ASM_CLAC
 	PARAVIRT_ADJUST_EXCEPTION_FRAME
 	pushq_cfi $-1		/* ORIG_RAX: no syscall to restart */
+	espfix_adjust_stack
 	subq $ORIG_RAX-R15, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
 	call save_paranoid
@@ -1265,6 +1316,7 @@ ENTRY(\sym)
 	XCPT_FRAME
 	ASM_CLAC
 	PARAVIRT_ADJUST_EXCEPTION_FRAME
+	espfix_adjust_stack
 	subq $ORIG_RAX-R15, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
 	call error_entry
@@ -1295,6 +1347,7 @@ ENTRY(\sym)
 	XCPT_FRAME
 	ASM_CLAC
 	PARAVIRT_ADJUST_EXCEPTION_FRAME
+	espfix_adjust_stack
 	subq $ORIG_RAX-R15, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
 	call save_paranoid
@@ -1323,6 +1376,30 @@ zeroentry coprocessor_error do_coprocessor_error
 errorentry alignment_check do_alignment_check
 zeroentry simd_coprocessor_error do_simd_coprocessor_error
 
+	/*
+	 * Switch from the espfix stack to the proper stack: tricky stuff.
+	 * On the stack right now is 5 words of exception frame,
+	 * error code/oldeax, RDI, and the return value, so no additional
+	 * stack is available.
+	 *
+	 * We will always be using the user space GS on entry.
+	*/
+ENTRY(espfix_fix_stack)
+	SWAPGS
+	cld
+	movq PER_CPU_VAR(kernel_stack),%rdi
+	subq $8*8,%rdi
+	/* Use the real stack to hold these registers for now */
+	movq %rsi,-8(%rdi)
+	movq %rcx,-16(%rdi)
+	movq %rsp,%rsi
+	movl $8,%ecx
+	rep;movsq
+	leaq -(10*8)(%rdi),%rsp
+	popq %rcx
+	popq %rsi
+	SWAPGS
+	retq
 
 	/* Reload gs selector with exception handling */
 	/* edi:  new selector */
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
new file mode 100644
index 000000000000..ff8479628ff2
--- /dev/null
+++ b/arch/x86/kernel/espfix_64.c
@@ -0,0 +1,171 @@
+/* ----------------------------------------------------------------------- *
+ *
+ *   Copyright 2014 Intel Corporation; author: H. Peter Anvin
+ *
+ *   This file is part of the Linux kernel, and is made available under
+ *   the terms of the GNU General Public License version 2 or (at your
+ *   option) any later version; incorporated herein by reference.
+ *
+ * ----------------------------------------------------------------------- */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <linux/gfp.h>
+#include <asm/pgtable.h>
+
+#define ESPFIX_STACK_SIZE	64
+#define ESPFIX_BASE_ADDR	(-2ULL << PGDIR_SHIFT)
+
+#if CONFIG_NR_CPUS >= (8 << 20)/ESPFIX_STACK_SIZE
+# error "Need more than one PGD for the ESPFIX hack"
+#endif
+
+#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define ESPFIX_PGD_FLAGS (__PAGE_KERNEL & ~_PAGE_DIRTY)
+#define ESPFIX_PUD_FLAGS (__PAGE_KERNEL & ~_PAGE_DIRTY)
+#define ESPFIX_PMD_FLAGS (__PAGE_KERNEL & ~_PAGE_DIRTY)
+#define ESPFIX_PTE_FLAGS __PAGE_KERNEL
+
+/* This contains the *bottom* address of the espfix stack */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
+
+/* Initialization mutex - should this be a spinlock? */
+static DEFINE_MUTEX(espfix_init_mutex);
+
+static __page_aligned_bss pud_t espfix_pud_page[PTRS_PER_PUD]
+	__aligned(PAGE_SIZE);
+
+/* This returns the bottom address of the espfix stack for a specific CPU */
+static inline unsigned long espfix_base_addr(int cpu)
+{
+	unsigned long addr = cpu * ESPFIX_STACK_SIZE;
+
+	addr = (addr & 0xffffUL) | ((addr & ~0xffffUL) << 16);
+	addr += ESPFIX_BASE_ADDR;
+	return addr;
+}
+
+#define PTE_STRIDE        (65536/PAGE_SIZE)
+#define ESPFIX_PTE_CLONES (PTRS_PER_PTE/PTE_STRIDE)
+#define ESPFIX_PMD_CLONES PTRS_PER_PMD
+#define ESPFIX_PUD_CLONES (65536/(ESPFIX_PTE_CLONES*ESPFIX_PMD_CLONES))
+
+/*
+ * Check to see if the espfix stuff is already installed.
+ * We do this once before grabbing the lock and, if we have to,
+ * once after.
+ */
+static bool espfix_already_there(unsigned long addr)
+{
+	const pgd_t *pgd_p;
+	pgd_t pgd;
+	const pud_t *pud_p;
+	pud_t pud;
+	const pmd_t *pmd_p;
+	pmd_t pmd;
+	const pte_t *pte_p;
+	pte_t pte;
+	int n;
+
+	pgd_p = &init_level4_pgt[pgd_index(addr)];
+	pgd = ACCESS_ONCE(*pgd_p);
+	if (!pgd_present(pgd))
+		return false;
+
+	pud_p = &espfix_pud_page[pud_index(addr)];
+	for (n = 0; n < ESPFIX_PUD_CLONES; n++) {
+		pud = ACCESS_ONCE(pud_p[n]);
+		if (!pud_present(pud))
+			return false;
+	}
+
+	pmd_p = pmd_offset(&pud, addr);
+	for (n = 0; n < ESPFIX_PMD_CLONES; n++) {
+		pmd = ACCESS_ONCE(pmd_p[n]);
+		if (!pmd_present(pmd))
+			return false;
+	}
+
+	pte_p = pte_offset_kernel(&pmd, addr);
+	for (n = 0; n < ESPFIX_PTE_CLONES; n++) {
+		pte = ACCESS_ONCE(pte_p[n*PTE_STRIDE]);
+		if (!pte_present(pte))
+			return false;
+	}
+
+	return true;		/* All aliases present and accounted for */
+}
+
+void init_espfix_cpu(void)
+{
+	int cpu = smp_processor_id();
+	unsigned long addr;
+	pgd_t pgd, *pgd_p;
+	pud_t pud, *pud_p;
+	pmd_t pmd, *pmd_p;
+	pte_t pte, *pte_p;
+	int n;
+	void *stack_page;
+
+	cpu = smp_processor_id();
+	BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
+
+	/* We only have to do this once... */
+	if (likely(this_cpu_read(espfix_stack)))
+		return;		/* Already initialized */
+
+	addr = espfix_base_addr(cpu);
+
+	/* Did another CPU already set this up? */
+	if (likely(espfix_already_there(addr)))
+		goto done;
+
+	mutex_lock(&espfix_init_mutex);
+
+	if (unlikely(espfix_already_there(addr)))
+		goto unlock_done;
+
+	pgd_p = &init_level4_pgt[pgd_index(addr)];
+	pgd = *pgd_p;
+	if (!pgd_present(pgd)) {
+		/* This can only happen on the BSP */
+		pgd = __pgd(__pa(espfix_pud_page) |
+			    (ESPFIX_PGD_FLAGS & __supported_pte_mask));
+		set_pgd(pgd_p, pgd);
+	}
+
+	pud_p = &espfix_pud_page[pud_index(addr)];
+	pud = *pud_p;
+	if (!pud_present(pud)) {
+		pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+		pud = __pud(__pa(pmd_p) |
+			    (ESPFIX_PUD_FLAGS & __supported_pte_mask));
+		for (n = 0; n < ESPFIX_PUD_CLONES; n++)
+			set_pud(&pud_p[n], pud);
+	}
+
+	pmd_p = pmd_offset(&pud, addr);
+	pmd = *pmd_p;
+	if (!pmd_present(pmd)) {
+		pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+		pmd = __pmd(__pa(pte_p) |
+			    (ESPFIX_PMD_FLAGS & __supported_pte_mask));
+		for (n = 0; n < ESPFIX_PMD_CLONES; n++)
+			set_pmd(&pmd_p[n], pmd);
+	}
+
+	pte_p = pte_offset_kernel(&pmd, addr);
+	stack_page = (void *)__get_free_page(GFP_KERNEL);
+	pte = __pte(__pa(stack_page) |
+		    (ESPFIX_PTE_FLAGS & __supported_pte_mask));
+	for (n = 0; n < ESPFIX_PTE_CLONES; n++)
+		set_pte(&pte_p[n*PTE_STRIDE], pte);
+
+unlock_done:
+	mutex_unlock(&espfix_init_mutex);
+done:
+	this_cpu_write(espfix_stack, addr);
+	printk(KERN_ERR "espfix: Initializing espfix for cpu %d, stack @ %p\n",
+		 cpu, (const void *)addr);
+}
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 85126ccbdf6b..dc2d8afcafe9 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -32,6 +32,7 @@
  * Manage page tables very early on.
  */
 extern pgd_t early_level4_pgt[PTRS_PER_PGD];
+extern pud_t espfix_pud_page[PTRS_PER_PUD];
 extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
 static unsigned int __initdata next_early_pgt = 2;
 pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index af1d14a9ebda..ebc987398923 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -229,17 +229,6 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 		}
 	}
 
-	/*
-	 * On x86-64 we do not support 16-bit segments due to
-	 * IRET leaking the high bits of the kernel stack address.
-	 */
-#ifdef CONFIG_X86_64
-	if (!ldt_info.seg_32bit) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-#endif
-
 	fill_ldt(&ldt, &ldt_info);
 	if (oldmode)
 		ldt.avl = 0;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 34826934d4a7..ff32efb14e33 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -244,6 +244,11 @@ static void notrace start_secondary(void *unused)
 	check_tsc_sync_target();
 
 	/*
+	 * Enable the espfix hack for this CPU
+	 */
+	init_espfix_cpu();
+
+	/*
 	 * We need to hold vector_lock so there the set of online cpus
 	 * does not change while we are assigning vectors to cpus.  Holding
 	 * this lock ensures we don't half assign or remove an irq from a cpu.
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 20621d753d5f..96bf767a05fc 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -327,6 +327,8 @@ void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
 	int i;
 	struct pg_state st = {};
 
+	st.to_dmesg = true;
+
 	if (pgd) {
 		start = pgd;
 		st.to_dmesg = true;
diff --git a/init/main.c b/init/main.c
index 9c7fd4c9249f..6cccf5524b3c 100644
--- a/init/main.c
+++ b/init/main.c
@@ -648,6 +648,10 @@ asmlinkage void __init start_kernel(void)
 
 	ftrace_init();
 
+#ifdef CONFIG_X86_64
+	init_espfix_cpu();
+#endif
+
 	/* Do the rest non-__init'ed, we're now alive */
 	rest_init();
 }
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-21 22:47 ` [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE* H. Peter Anvin
@ 2014-04-21 23:19   ` Andrew Lutomirski
  2014-04-21 23:29     ` H. Peter Anvin
  2014-04-22 11:25   ` Borislav Petkov
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-21 23:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linux Kernel Mailing List, H. Peter Anvin, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Mon, Apr 21, 2014 at 3:47 PM, H. Peter Anvin <hpa@linux.intel.com> wrote:
> This is a prototype of espfix for the 64-bit kernel.  espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment.  We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.
>
> The 64-bit implementation works like this:
>
> Set up a ministack for each CPU, which is then mapped 65536 times
> using the page tables.  This implementation uses the second-to-last
> PGD slot for this; with a 64-byte espfix stack this is sufficient for
> 2^18 CPUs (currently we support a max of 2^13 CPUs.)
>
> 64 bytes appear to be sufficient, because NMI and #MC cause a task
> switch.
>
> THIS IS A PROTOTYPE AND IS NOT COMPLETE.  We need to make sure all
> code paths that can interrupt userspace execute this code.
> Fortunately we never need to use the espfix stack for nested faults,
> so one per CPU is guaranteed to be safe.
>
> Furthermore, this code adds unnecessary instructions to the common
> path.  For example, on exception entry we push %rdi, pop %rdi, and
> then save away %rdi.  Ideally we should do this in such a way that we
> avoid unnecessary swapgs, especially on the IRET path (the exception
> path is going to be very rare, and so is less critical.)
>
> Putting this version out there for people to look at/laugh at/play
> with.

Hahaha! :)

Some comments:

Does returning to 64-bit CS with 16-bit SS not need espfix?
Conversely, does 16-bit CS and 32-bit SS need espfix?


> @@ -1058,6 +1095,7 @@ bad_iret:
>          * So pretend we completed the iret and took the #GPF in user mode.
>          *
>          * We are now running with the kernel GS after exception recovery.
> +        * Exception entry will have removed us from the espfix stack.
>          * But error_entry expects us to have user GS to match the user %cs,
>          * so swap back.
>          */

What is that referring to?


> +       /*
> +        * Switch from the espfix stack to the proper stack: tricky stuff.
> +        * On the stack right now is 5 words of exception frame,
> +        * error code/oldeax, RDI, and the return value, so no additional
> +        * stack is available.
> +        *
> +        * We will always be using the user space GS on entry.
> +       */
> +ENTRY(espfix_fix_stack)
> +       SWAPGS
> +       cld
> +       movq PER_CPU_VAR(kernel_stack),%rdi
> +       subq $8*8,%rdi
> +       /* Use the real stack to hold these registers for now */
> +       movq %rsi,-8(%rdi)
> +       movq %rcx,-16(%rdi)
> +       movq %rsp,%rsi
> +       movl $8,%ecx
> +       rep;movsq
> +       leaq -(10*8)(%rdi),%rsp
> +       popq %rcx
> +       popq %rsi
> +       SWAPGS
> +       retq
>

Is it guaranteed that the userspace thread that caused this is dead?
If not, do you need to change RIP so that espfix gets invoked again
when you return from the exception?

> +
> +void init_espfix_cpu(void)
> +{
> +       int cpu = smp_processor_id();
> +       unsigned long addr;
> +       pgd_t pgd, *pgd_p;
> +       pud_t pud, *pud_p;
> +       pmd_t pmd, *pmd_p;
> +       pte_t pte, *pte_p;
> +       int n;
> +       void *stack_page;
> +
> +       cpu = smp_processor_id();
> +       BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
> +
> +       /* We only have to do this once... */
> +       if (likely(this_cpu_read(espfix_stack)))
> +               return;         /* Already initialized */
> +
> +       addr = espfix_base_addr(cpu);
> +
> +       /* Did another CPU already set this up? */
> +       if (likely(espfix_already_there(addr)))
> +               goto done;
> +
> +       mutex_lock(&espfix_init_mutex);
> +
> +       if (unlikely(espfix_already_there(addr)))
> +               goto unlock_done;

Wouldn't it be simpler to just have a single static bool to indicate
whether espfix is initialized?

Even better: why not separate the percpu init from the pagetable init
and just do the pagetable init once from main or even modify_ldt?

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-21 23:19   ` Andrew Lutomirski
@ 2014-04-21 23:29     ` H. Peter Anvin
  2014-04-22  0:37       ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-21 23:29 UTC (permalink / raw)
  To: Andrew Lutomirski, H. Peter Anvin
  Cc: Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/21/2014 04:19 PM, Andrew Lutomirski wrote:
> 
> Hahaha! :)
> 
> Some comments:
> 
> Does returning to 64-bit CS with 16-bit SS not need espfix?

There is no such thing.  With a 64-bit CS, the flags on SS are ignored
(although you still have to have a non-null SS... the conditions are a
bit complex.)

> Conversely, does 16-bit CS and 32-bit SS need espfix?

It does not, at least to the best of my knowledge (it is controlled by
the SS size, not the CS size.)

I'm going to double-check the corner cases just out of healthy paranoia,
but I'm 98% sure this is correct (and if not, the 32-bit code needs to
be fixed, too.)

>> @@ -1058,6 +1095,7 @@ bad_iret:
>>          * So pretend we completed the iret and took the #GPF in user mode.
>>          *
>>          * We are now running with the kernel GS after exception recovery.
>> +        * Exception entry will have removed us from the espfix stack.
>>          * But error_entry expects us to have user GS to match the user %cs,
>>          * so swap back.
>>          */
> 
> What is that referring to?

It means that we have already switched back from the espfix stack to the
real stack.

>> +       /*
>> +        * Switch from the espfix stack to the proper stack: tricky stuff.
>> +        * On the stack right now is 5 words of exception frame,
>> +        * error code/oldeax, RDI, and the return value, so no additional
>> +        * stack is available.
>> +        *
>> +        * We will always be using the user space GS on entry.
>> +       */
>> +ENTRY(espfix_fix_stack)
>> +       SWAPGS
>> +       cld
>> +       movq PER_CPU_VAR(kernel_stack),%rdi
>> +       subq $8*8,%rdi
>> +       /* Use the real stack to hold these registers for now */
>> +       movq %rsi,-8(%rdi)
>> +       movq %rcx,-16(%rdi)
>> +       movq %rsp,%rsi
>> +       movl $8,%ecx
>> +       rep;movsq
>> +       leaq -(10*8)(%rdi),%rsp
>> +       popq %rcx
>> +       popq %rsi
>> +       SWAPGS
>> +       retq
>>
> 
> Is it guaranteed that the userspace thread that caused this is dead?
> If not, do you need to change RIP so that espfix gets invoked again
> when you return from the exception?

It is not guaranteed to be dead at all.  Why would you need to change
RIP, though?

>> +
>> +void init_espfix_cpu(void)
>> +{
>> +       int cpu = smp_processor_id();
>> +       unsigned long addr;
>> +       pgd_t pgd, *pgd_p;
>> +       pud_t pud, *pud_p;
>> +       pmd_t pmd, *pmd_p;
>> +       pte_t pte, *pte_p;
>> +       int n;
>> +       void *stack_page;
>> +
>> +       cpu = smp_processor_id();
>> +       BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
>> +
>> +       /* We only have to do this once... */
>> +       if (likely(this_cpu_read(espfix_stack)))
>> +               return;         /* Already initialized */
>> +
>> +       addr = espfix_base_addr(cpu);
>> +
>> +       /* Did another CPU already set this up? */
>> +       if (likely(espfix_already_there(addr)))
>> +               goto done;
>> +
>> +       mutex_lock(&espfix_init_mutex);
>> +
>> +       if (unlikely(espfix_already_there(addr)))
>> +               goto unlock_done;
> 
> Wouldn't it be simpler to just have a single static bool to indicate
> whether espfix is initialized?

No, you would have to allocate memory for every possible CPU, which I
wanted to avoid in case NR_CPUS >> actual CPUs (I don't know if we have
already done that for percpu, but we *should* if we haven't yet.)

> Even better: why not separate the percpu init from the pagetable init
> and just do the pagetable init once from main or even modify_ldt?

It needs to be done once per CPU.  I wanted to do it late enough that
the page allocator is fully functional, so we don't have to do the ugly
hacks to call one allocator or another as the percpu initialization code
does (otherwise it would have made a lot of sense to co-locate with percpu.)

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-21 23:29     ` H. Peter Anvin
@ 2014-04-22  0:37       ` Andrew Lutomirski
  2014-04-22  0:53         ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-22  0:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Mon, Apr 21, 2014 at 4:29 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/21/2014 04:19 PM, Andrew Lutomirski wrote:
>>
>> Hahaha! :)
>>
>> Some comments:
>>
>> Does returning to 64-bit CS with 16-bit SS not need espfix?
>
> There is no such thing.  With a 64-bit CS, the flags on SS are ignored
> (although you still have to have a non-null SS... the conditions are a
> bit complex.)
>
>> Conversely, does 16-bit CS and 32-bit SS need espfix?
>
> It does not, at least to the best of my knowledge (it is controlled by
> the SS size, not the CS size.)
>
> I'm going to double-check the corner cases just out of healthy paranoia,
> but I'm 98% sure this is correct (and if not, the 32-bit code needs to
> be fixed, too.)
>
>>> @@ -1058,6 +1095,7 @@ bad_iret:
>>>          * So pretend we completed the iret and took the #GPF in user mode.
>>>          *
>>>          * We are now running with the kernel GS after exception recovery.
>>> +        * Exception entry will have removed us from the espfix stack.
>>>          * But error_entry expects us to have user GS to match the user %cs,
>>>          * so swap back.
>>>          */
>>
>> What is that referring to?
>
> It means that we have already switched back from the espfix stack to the
> real stack.
>
>>> +       /*
>>> +        * Switch from the espfix stack to the proper stack: tricky stuff.
>>> +        * On the stack right now is 5 words of exception frame,
>>> +        * error code/oldeax, RDI, and the return value, so no additional
>>> +        * stack is available.
>>> +        *
>>> +        * We will always be using the user space GS on entry.
>>> +       */
>>> +ENTRY(espfix_fix_stack)
>>> +       SWAPGS
>>> +       cld
>>> +       movq PER_CPU_VAR(kernel_stack),%rdi
>>> +       subq $8*8,%rdi
>>> +       /* Use the real stack to hold these registers for now */
>>> +       movq %rsi,-8(%rdi)
>>> +       movq %rcx,-16(%rdi)
>>> +       movq %rsp,%rsi
>>> +       movl $8,%ecx
>>> +       rep;movsq
>>> +       leaq -(10*8)(%rdi),%rsp
>>> +       popq %rcx
>>> +       popq %rsi
>>> +       SWAPGS
>>> +       retq
>>>
>>
>> Is it guaranteed that the userspace thread that caused this is dead?
>> If not, do you need to change RIP so that espfix gets invoked again
>> when you return from the exception?
>
> It is not guaranteed to be dead at all.  Why would you need to change
> RIP, though?

Oh.  You're not changing the RSP that you return to.  So this should be okay.

>
>>> +
>>> +void init_espfix_cpu(void)
>>> +{
>>> +       int cpu = smp_processor_id();
>>> +       unsigned long addr;
>>> +       pgd_t pgd, *pgd_p;
>>> +       pud_t pud, *pud_p;
>>> +       pmd_t pmd, *pmd_p;
>>> +       pte_t pte, *pte_p;
>>> +       int n;
>>> +       void *stack_page;
>>> +
>>> +       cpu = smp_processor_id();
>>> +       BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
>>> +
>>> +       /* We only have to do this once... */
>>> +       if (likely(this_cpu_read(espfix_stack)))
>>> +               return;         /* Already initialized */
>>> +
>>> +       addr = espfix_base_addr(cpu);
>>> +
>>> +       /* Did another CPU already set this up? */
>>> +       if (likely(espfix_already_there(addr)))
>>> +               goto done;
>>> +
>>> +       mutex_lock(&espfix_init_mutex);
>>> +
>>> +       if (unlikely(espfix_already_there(addr)))
>>> +               goto unlock_done;
>>
>> Wouldn't it be simpler to just have a single static bool to indicate
>> whether espfix is initialized?
>
> No, you would have to allocate memory for every possible CPU, which I
> wanted to avoid in case NR_CPUS >> actual CPUs (I don't know if we have
> already done that for percpu, but we *should* if we haven't yet.)
>
>> Even better: why not separate the percpu init from the pagetable init
>> and just do the pagetable init once from main or even modify_ldt?
>
> It needs to be done once per CPU.  I wanted to do it late enough that
> the page allocator is fully functional, so we don't have to do the ugly
> hacks to call one allocator or another as the percpu initialization code
> does (otherwise it would have made a lot of sense to co-locate with percpu.)

Hmm.  I guess espfix_already_there isn't so bad.  Given that, in the
worst case, I think there are 16 pages allocated, it might make sense
to just track which of those 16 pages have been allocated in some
array.  That whole array would probably be shorter than the test of
espfix_already_there.  Or am I still failing to understand how this
works?

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22  0:37       ` Andrew Lutomirski
@ 2014-04-22  0:53         ` H. Peter Anvin
  2014-04-22  1:06           ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22  0:53 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

Well, if 2^17 CPUs are allocated we might 2K pages allocated.  We could easily do a bitmap here, of course.  NR_CPUS/64 is a small number, and would reduce the code complexity.

On April 21, 2014 5:37:05 PM PDT, Andrew Lutomirski <amluto@gmail.com> wrote:
>On Mon, Apr 21, 2014 at 4:29 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 04/21/2014 04:19 PM, Andrew Lutomirski wrote:
>>>
>>> Hahaha! :)
>>>
>>> Some comments:
>>>
>>> Does returning to 64-bit CS with 16-bit SS not need espfix?
>>
>> There is no such thing.  With a 64-bit CS, the flags on SS are
>ignored
>> (although you still have to have a non-null SS... the conditions are
>a
>> bit complex.)
>>
>>> Conversely, does 16-bit CS and 32-bit SS need espfix?
>>
>> It does not, at least to the best of my knowledge (it is controlled
>by
>> the SS size, not the CS size.)
>>
>> I'm going to double-check the corner cases just out of healthy
>paranoia,
>> but I'm 98% sure this is correct (and if not, the 32-bit code needs
>to
>> be fixed, too.)
>>
>>>> @@ -1058,6 +1095,7 @@ bad_iret:
>>>>          * So pretend we completed the iret and took the #GPF in
>user mode.
>>>>          *
>>>>          * We are now running with the kernel GS after exception
>recovery.
>>>> +        * Exception entry will have removed us from the espfix
>stack.
>>>>          * But error_entry expects us to have user GS to match the
>user %cs,
>>>>          * so swap back.
>>>>          */
>>>
>>> What is that referring to?
>>
>> It means that we have already switched back from the espfix stack to
>the
>> real stack.
>>
>>>> +       /*
>>>> +        * Switch from the espfix stack to the proper stack: tricky
>stuff.
>>>> +        * On the stack right now is 5 words of exception frame,
>>>> +        * error code/oldeax, RDI, and the return value, so no
>additional
>>>> +        * stack is available.
>>>> +        *
>>>> +        * We will always be using the user space GS on entry.
>>>> +       */
>>>> +ENTRY(espfix_fix_stack)
>>>> +       SWAPGS
>>>> +       cld
>>>> +       movq PER_CPU_VAR(kernel_stack),%rdi
>>>> +       subq $8*8,%rdi
>>>> +       /* Use the real stack to hold these registers for now */
>>>> +       movq %rsi,-8(%rdi)
>>>> +       movq %rcx,-16(%rdi)
>>>> +       movq %rsp,%rsi
>>>> +       movl $8,%ecx
>>>> +       rep;movsq
>>>> +       leaq -(10*8)(%rdi),%rsp
>>>> +       popq %rcx
>>>> +       popq %rsi
>>>> +       SWAPGS
>>>> +       retq
>>>>
>>>
>>> Is it guaranteed that the userspace thread that caused this is dead?
>>> If not, do you need to change RIP so that espfix gets invoked again
>>> when you return from the exception?
>>
>> It is not guaranteed to be dead at all.  Why would you need to change
>> RIP, though?
>
>Oh.  You're not changing the RSP that you return to.  So this should be
>okay.
>
>>
>>>> +
>>>> +void init_espfix_cpu(void)
>>>> +{
>>>> +       int cpu = smp_processor_id();
>>>> +       unsigned long addr;
>>>> +       pgd_t pgd, *pgd_p;
>>>> +       pud_t pud, *pud_p;
>>>> +       pmd_t pmd, *pmd_p;
>>>> +       pte_t pte, *pte_p;
>>>> +       int n;
>>>> +       void *stack_page;
>>>> +
>>>> +       cpu = smp_processor_id();
>>>> +       BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
>>>> +
>>>> +       /* We only have to do this once... */
>>>> +       if (likely(this_cpu_read(espfix_stack)))
>>>> +               return;         /* Already initialized */
>>>> +
>>>> +       addr = espfix_base_addr(cpu);
>>>> +
>>>> +       /* Did another CPU already set this up? */
>>>> +       if (likely(espfix_already_there(addr)))
>>>> +               goto done;
>>>> +
>>>> +       mutex_lock(&espfix_init_mutex);
>>>> +
>>>> +       if (unlikely(espfix_already_there(addr)))
>>>> +               goto unlock_done;
>>>
>>> Wouldn't it be simpler to just have a single static bool to indicate
>>> whether espfix is initialized?
>>
>> No, you would have to allocate memory for every possible CPU, which I
>> wanted to avoid in case NR_CPUS >> actual CPUs (I don't know if we
>have
>> already done that for percpu, but we *should* if we haven't yet.)
>>
>>> Even better: why not separate the percpu init from the pagetable
>init
>>> and just do the pagetable init once from main or even modify_ldt?
>>
>> It needs to be done once per CPU.  I wanted to do it late enough that
>> the page allocator is fully functional, so we don't have to do the
>ugly
>> hacks to call one allocator or another as the percpu initialization
>code
>> does (otherwise it would have made a lot of sense to co-locate with
>percpu.)
>
>Hmm.  I guess espfix_already_there isn't so bad.  Given that, in the
>worst case, I think there are 16 pages allocated, it might make sense
>to just track which of those 16 pages have been allocated in some
>array.  That whole array would probably be shorter than the test of
>espfix_already_there.  Or am I still failing to understand how this
>works?
>
>--Andy

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22  0:53         ` H. Peter Anvin
@ 2014-04-22  1:06           ` Andrew Lutomirski
  2014-04-22  1:14             ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-22  1:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Mon, Apr 21, 2014 at 5:53 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> Well, if 2^17 CPUs are allocated we might 2K pages allocated.  We could easily do a bitmap here, of course.  NR_CPUS/64 is a small number, and would reduce the code complexity.
>

Even simpler: just get rid of the check entirely.  That is, break out
of the higher level loops once one of them is set (this should be a
big speedup regardless) and don't allocate the page if the first PTE
is already pointing at something.

After all, espfix_already_there is mostly a duplicate of init_espfix_cpu.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22  1:06           ` Andrew Lutomirski
@ 2014-04-22  1:14             ` H. Peter Anvin
  2014-04-22  1:28               ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22  1:14 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

I wanted to avoid the "another cpu made this allocation, now I have to free" crap, but I also didn't want to grab the lock if there was no work needed.

On April 21, 2014 6:06:19 PM PDT, Andrew Lutomirski <amluto@gmail.com> wrote:
>On Mon, Apr 21, 2014 at 5:53 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> Well, if 2^17 CPUs are allocated we might 2K pages allocated.  We
>could easily do a bitmap here, of course.  NR_CPUS/64 is a small
>number, and would reduce the code complexity.
>>
>
>Even simpler: just get rid of the check entirely.  That is, break out
>of the higher level loops once one of them is set (this should be a
>big speedup regardless) and don't allocate the page if the first PTE
>is already pointing at something.
>
>After all, espfix_already_there is mostly a duplicate of
>init_espfix_cpu.
>
>--Andy

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22  1:14             ` H. Peter Anvin
@ 2014-04-22  1:28               ` Andrew Lutomirski
  2014-04-22  1:47                 ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-22  1:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Mon, Apr 21, 2014 at 6:14 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> I wanted to avoid the "another cpu made this allocation, now I have to free" crap, but I also didn't want to grab the lock if there was no work needed.

I guess you also want to avoid bouncing all these cachelines around on
boot on bit multicore machines.

I'd advocate using the bitmap approach or simplifying the existing
code.  For example:

+       for (n = 0; n < ESPFIX_PUD_CLONES; n++) {
+               pud = ACCESS_ONCE(pud_p[n]);
+               if (!pud_present(pud))
+                       return false;
+       }

I don't see why that needs to be a loop.  How can one clone exist but
not the others?

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22  1:28               ` Andrew Lutomirski
@ 2014-04-22  1:47                 ` H. Peter Anvin
  2014-04-22  1:53                   ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22  1:47 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

Race condition (although with x86 being globally ordered, it probably can't actually happen.) The bitmask is probably the way to go.

On April 21, 2014 6:28:12 PM PDT, Andrew Lutomirski <amluto@gmail.com> wrote:
>On Mon, Apr 21, 2014 at 6:14 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> I wanted to avoid the "another cpu made this allocation, now I have
>to free" crap, but I also didn't want to grab the lock if there was no
>work needed.
>
>I guess you also want to avoid bouncing all these cachelines around on
>boot on bit multicore machines.
>
>I'd advocate using the bitmap approach or simplifying the existing
>code.  For example:
>
>+       for (n = 0; n < ESPFIX_PUD_CLONES; n++) {
>+               pud = ACCESS_ONCE(pud_p[n]);
>+               if (!pud_present(pud))
>+                       return false;
>+       }
>
>I don't see why that needs to be a loop.  How can one clone exist but
>not the others?
>
>--Andy

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22  1:47                 ` H. Peter Anvin
@ 2014-04-22  1:53                   ` Andrew Lutomirski
  2014-04-22 11:23                     ` Borislav Petkov
  0 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-22  1:53 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> Race condition (although with x86 being globally ordered, it probably can't actually happen.) The bitmask is probably the way to go.

Does the race matter?  In the worst case you take the lock
unnecessarily.  But yes, the bitmask is easy.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22  1:53                   ` Andrew Lutomirski
@ 2014-04-22 11:23                     ` Borislav Petkov
  2014-04-22 14:46                       ` Borislav Petkov
  0 siblings, 1 reply; 136+ messages in thread
From: Borislav Petkov @ 2014-04-22 11:23 UTC (permalink / raw)
  To: Andrew Lutomirski, H. Peter Anvin
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Mon, Apr 21, 2014 at 06:53:36PM -0700, Andrew Lutomirski wrote:
> On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> > Race condition (although with x86 being globally ordered, it probably can't actually happen.) The bitmask is probably the way to go.
> 
> Does the race matter?  In the worst case you take the lock
> unnecessarily.  But yes, the bitmask is easy.

I wonder if it would be workable to use a bit in the espfix PGD to
denote that it has been initialized already... I hear, near NX there's
some room :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-21 22:47 ` [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE* H. Peter Anvin
  2014-04-21 23:19   ` Andrew Lutomirski
@ 2014-04-22 11:25   ` Borislav Petkov
  2014-04-23  1:17   ` H. Peter Anvin
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 136+ messages in thread
From: Borislav Petkov @ 2014-04-22 11:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linux Kernel Mailing List, H. Peter Anvin, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Andy Lutomirski,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

Just nitpicks below:

On Mon, Apr 21, 2014 at 03:47:52PM -0700, H. Peter Anvin wrote:
> This is a prototype of espfix for the 64-bit kernel.  espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment.  We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.
> 
> The 64-bit implementation works like this:
> 
> Set up a ministack for each CPU, which is then mapped 65536 times
> using the page tables.  This implementation uses the second-to-last
> PGD slot for this; with a 64-byte espfix stack this is sufficient for
> 2^18 CPUs (currently we support a max of 2^13 CPUs.)

I wish we'd put this description in the code instead of in a commit
message as those can get lost in git history over time.

> 64 bytes appear to be sufficient, because NMI and #MC cause a task
> switch.
> 
> THIS IS A PROTOTYPE AND IS NOT COMPLETE.  We need to make sure all
> code paths that can interrupt userspace execute this code.
> Fortunately we never need to use the espfix stack for nested faults,
> so one per CPU is guaranteed to be safe.
> 
> Furthermore, this code adds unnecessary instructions to the common
> path.  For example, on exception entry we push %rdi, pop %rdi, and
> then save away %rdi.  Ideally we should do this in such a way that we
> avoid unnecessary swapgs, especially on the IRET path (the exception
> path is going to be very rare, and so is less critical.)
> 
> Putting this version out there for people to look at/laugh at/play
> with.
> 
> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
> Link: http://lkml.kernel.org/r/tip-kicdm89kzw9lldryb1br9od0@git.kernel.org
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Alexander van Heukelum <heukelum@fastmail.fm>
> Cc: Andy Lutomirski <amluto@gmail.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Arjan van de Ven <arjan.van.de.ven@intel.com>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Alexandre Julliard <julliard@winehq.com>
> Cc: Andi Kleen <andi@firstfloor.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>

...

> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 1e96c3628bf2..7cc01770bf21 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -58,6 +58,7 @@
>  #include <asm/asm.h>
>  #include <asm/context_tracking.h>
>  #include <asm/smap.h>
> +#include <asm/pgtable_types.h>
>  #include <linux/err.h>
>  
>  /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this.  */
> @@ -1040,8 +1041,16 @@ restore_args:
>  	RESTORE_ARGS 1,8,1
>  
>  irq_return:
> +	/*
> +	 * Are we returning to the LDT?  Note: in 64-bit mode
> +	 * SS:RSP on the exception stack is always valid.
> +	 */
> +	testb $4,(SS-RIP)(%rsp)
> +	jnz irq_return_ldt
> +
> +irq_return_iret:
>  	INTERRUPT_RETURN
> -	_ASM_EXTABLE(irq_return, bad_iret)
> +	_ASM_EXTABLE(irq_return_iret, bad_iret)
>  
>  #ifdef CONFIG_PARAVIRT
>  ENTRY(native_iret)
> @@ -1049,6 +1058,34 @@ ENTRY(native_iret)
>  	_ASM_EXTABLE(native_iret, bad_iret)
>  #endif
>  
> +irq_return_ldt:
> +	pushq_cfi %rcx
> +	larl (CS-RIP+8)(%rsp), %ecx
> +	jnz 1f		/* Invalid segment - will #GP at IRET time */
> +	testl $0x00200000, %ecx
> +	jnz 1f		/* Returning to 64-bit mode */
> +	larl (SS-RIP+8)(%rsp), %ecx
> +	jnz 1f		/* Invalid segment - will #SS at IRET time */

You mean " ... will #GP at IRET time"? But you're right, you're looking
at SS :-)

> +	testl $0x00400000, %ecx
> +	jnz 1f		/* Not a 16-bit stack segment */
> +	pushq_cfi %rsi
> +	pushq_cfi %rdi
> +	SWAPGS
> +	movq PER_CPU_VAR(espfix_stack),%rdi
> +	movl (RSP-RIP+3*8)(%rsp),%esi
> +	xorw %si,%si
> +	orq %rsi,%rdi
> +	movq %rsp,%rsi
> +	movl $8,%ecx
> +	rep;movsq
> +	leaq -(8*8)(%rdi),%rsp
> +	SWAPGS
> +	popq_cfi %rdi
> +	popq_cfi %rsi
> +1:
> +	popq_cfi %rcx
> +	jmp irq_return_iret
> +
>  	.section .fixup,"ax"
>  bad_iret:
>  	/*

...

> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 85126ccbdf6b..dc2d8afcafe9 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -32,6 +32,7 @@
>   * Manage page tables very early on.
>   */
>  extern pgd_t early_level4_pgt[PTRS_PER_PGD];
> +extern pud_t espfix_pud_page[PTRS_PER_PUD];

I guess you don't need the "extern" here.

>  extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
>  static unsigned int __initdata next_early_pgt = 2;
>  pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index af1d14a9ebda..ebc987398923 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -229,17 +229,6 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
>  		}
>  	}
>  
> -	/*
> -	 * On x86-64 we do not support 16-bit segments due to
> -	 * IRET leaking the high bits of the kernel stack address.
> -	 */
> -#ifdef CONFIG_X86_64
> -	if (!ldt_info.seg_32bit) {
> -		error = -EINVAL;
> -		goto out_unlock;
> -	}
> -#endif
> -
>  	fill_ldt(&ldt, &ldt_info);
>  	if (oldmode)
>  		ldt.avl = 0;
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 34826934d4a7..ff32efb14e33 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -244,6 +244,11 @@ static void notrace start_secondary(void *unused)
>  	check_tsc_sync_target();
>  
>  	/*
> +	 * Enable the espfix hack for this CPU
> +	 */
> +	init_espfix_cpu();
> +
> +	/*
>  	 * We need to hold vector_lock so there the set of online cpus
>  	 * does not change while we are assigning vectors to cpus.  Holding
>  	 * this lock ensures we don't half assign or remove an irq from a cpu.
> diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
> index 20621d753d5f..96bf767a05fc 100644
> --- a/arch/x86/mm/dump_pagetables.c
> +++ b/arch/x86/mm/dump_pagetables.c
> @@ -327,6 +327,8 @@ void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
>  	int i;
>  	struct pg_state st = {};
>  
> +	st.to_dmesg = true;

Right, remove before applying :)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 11:23                     ` Borislav Petkov
@ 2014-04-22 14:46                       ` Borislav Petkov
  2014-04-22 16:03                         ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: Borislav Petkov @ 2014-04-22 14:46 UTC (permalink / raw)
  To: Andrew Lutomirski, H. Peter Anvin
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 01:23:12PM +0200, Borislav Petkov wrote:
> I wonder if it would be workable to use a bit in the espfix PGD to
> denote that it has been initialized already... I hear, near NX there's
> some room :-)

Ok, I realized this won't work when I hit send... Oh well.

Anyway, another dumb idea: have we considered making this lazy? I.e.,
preallocate pages to fit the stack of NR_CPUS after smp init is done but
not setup the percpu espfix stack. Only do that in espfix_fix_stack the
first time we land there and haven't been setup yet on this cpu.

This should cover the 1% out there who still use 16-bit segments and the
rest simply doesn't use it and get to save themselves the PT-walk in
start_secondary().

Hmmm...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 14:46                       ` Borislav Petkov
@ 2014-04-22 16:03                         ` Andrew Lutomirski
  2014-04-22 16:10                           ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-22 16:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, H. Peter Anvin, Linux Kernel Mailing List,
	Linus Torvalds, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 7:46 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Tue, Apr 22, 2014 at 01:23:12PM +0200, Borislav Petkov wrote:
>> I wonder if it would be workable to use a bit in the espfix PGD to
>> denote that it has been initialized already... I hear, near NX there's
>> some room :-)
>
> Ok, I realized this won't work when I hit send... Oh well.
>
> Anyway, another dumb idea: have we considered making this lazy? I.e.,
> preallocate pages to fit the stack of NR_CPUS after smp init is done but
> not setup the percpu espfix stack. Only do that in espfix_fix_stack the
> first time we land there and haven't been setup yet on this cpu.
>
> This should cover the 1% out there who still use 16-bit segments and the
> rest simply doesn't use it and get to save themselves the PT-walk in
> start_secondary().
>
> Hmmm...

I'm going to try to do the math to see what's actually going on.

Each 4G slice contains 64kB of ministacks, which corresponds to 1024
ministacks.  Virtual addresses are divided up as:

12 bits (0..11): address within page.
9 bits (12..20): identifies the PTE within the level 1 directory
9 bits (21..29): identifies the level 1 directory (pmd?) within the
level 2 directory
9 bits (30..38): identifies the level 2 directory (pud) within the
level 3 directory

Critically, each 1024 CPUs can share the same level 1 directory --
there are just a bunch of copies of the same thing in there.
Similarly, they can share the same level 2 directory, and each slot in
that directory will point to the same level 1 directory.

For the level 3 directory, there is only one globally.  It needs 8
entries per 1024 CPUs.

I imagine there's a scalability problem here, too: it's okay if each
of a very large number of CPUs waits while shared structures are
allocated, but owners of big systems won't like it if they all
serialize on the way out.

So maybe it would make sense to refactor this into two separate
functions.  First, before we start the first non-boot CPU:

static pte_t *slice_pte_tables[NR_CPUS / 1024];
Allocate and initialize them all;

It might even make sense to do this at build time instead of run time.
 I can't imagine that parallelizing this would provide any benefit
unless it were done *very* carefully and there were hundreds of
thousands of CPUs.  At worst, we're wasting 4 bytes per CPU not
present.

Then, for the per-CPU part, have one init-once structure (please tell
me the kernel has one of these) per 64 possible CPUs.  Each CPU will
make sure that its group of 64 cpus is initialized, using the init
once mechanism, and then it will set its percpu variable accordingly.

There are only 64 CPUs per slice, so mutexes may no be so bad here.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 16:03                         ` Andrew Lutomirski
@ 2014-04-22 16:10                           ` H. Peter Anvin
  2014-04-22 16:33                             ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22 16:10 UTC (permalink / raw)
  To: Andrew Lutomirski, Borislav Petkov
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

Honestly, guys... you're painting the bikeshed at the moment.

Initialization is the easiest bit of all this code.  The tricky part is
*the rest of the code*, i.e. the stuff in entry_64.S.

Also, the code is butt-ugly at the moment.  Aestetics have not been
dealt with.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 16:10                           ` H. Peter Anvin
@ 2014-04-22 16:33                             ` Andrew Lutomirski
  2014-04-22 16:43                               ` Linus Torvalds
  0 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-22 16:33 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, H. Peter Anvin, Linux Kernel Mailing List,
	Linus Torvalds, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 9:10 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> Honestly, guys... you're painting the bikeshed at the moment.
>
> Initialization is the easiest bit of all this code.  The tricky part is
> *the rest of the code*, i.e. the stuff in entry_64.S.

That's because the initialization code is much simpler, so it's easy
to pick on :)  Sorry.

For the espfix_adjust_stack thing, when can it actually need to do
anything?  irqs should be off, I think, and MCE, NMI, and debug
exceptions use ist, so that leaves just #SS and #GP, I think.  How can
those actually occur?  Is there a way to trigger them deliberately
from userspace?  Why do you have three espfix_adjust_stack

What happens on the IST entries?  If I've read your patch right,
you're still switching back to the normal stack, which looks
questionable.

Also, if you want to same some register abuse on each exception entry,
could you check the saved RIP instead of the current RSP?  I.e. use
the test instruction with offset(%rsp)?  Maybe there are multiple
possible values, though, and just testing some bits doesn't help.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 16:33                             ` Andrew Lutomirski
@ 2014-04-22 16:43                               ` Linus Torvalds
  2014-04-22 17:00                                 ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: Linus Torvalds @ 2014-04-22 16:43 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: H. Peter Anvin, Borislav Petkov, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 9:33 AM, Andrew Lutomirski <amluto@gmail.com> wrote:
>
> For the espfix_adjust_stack thing, when can it actually need to do
> anything?  irqs should be off, I think, and MCE, NMI, and debug
> exceptions use ist, so that leaves just #SS and #GP, I think.  How can
> those actually occur?  Is there a way to trigger them deliberately
> from userspace?  Why do you have three espfix_adjust_stack

Yes, you can very much trigger GP deliberately.

The way to do it is to just make an invalid segment descriptor on the
iret stack. Or make it a valid 16-bit one, but make it a code segment
for the stack pointer, or read-only, or whatever. All of which is
trivial to do with a sigretun system call. But you can do it other
ways too - enter with a SS that is valid, but do a load_ldt() system
call that makes it invalid, so that by the time you exit it is no
longer valid etc.

There's a reason we mark that "iretq" as taking faults with that

        _ASM_EXTABLE(native_iret, bad_iret)

and that "bad_iret" creates a GP fault.

And that's a lot of kernel stack. The whole initial GP fault path,
which goes to the C code that finds the exception table etc. See
do_general_protection_fault() and fixup_exception().

                Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 16:43                               ` Linus Torvalds
@ 2014-04-22 17:00                                 ` Andrew Lutomirski
  2014-04-22 17:04                                   ` Linus Torvalds
  2014-04-22 17:09                                   ` H. Peter Anvin
  0 siblings, 2 replies; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-22 17:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Borislav Petkov, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 9:43 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 22, 2014 at 9:33 AM, Andrew Lutomirski <amluto@gmail.com> wrote:
>>
>> For the espfix_adjust_stack thing, when can it actually need to do
>> anything?  irqs should be off, I think, and MCE, NMI, and debug
>> exceptions use ist, so that leaves just #SS and #GP, I think.  How can
>> those actually occur?  Is there a way to trigger them deliberately
>> from userspace?  Why do you have three espfix_adjust_stack
>
> Yes, you can very much trigger GP deliberately.
>
> The way to do it is to just make an invalid segment descriptor on the
> iret stack. Or make it a valid 16-bit one, but make it a code segment
> for the stack pointer, or read-only, or whatever. All of which is
> trivial to do with a sigretun system call. But you can do it other
> ways too - enter with a SS that is valid, but do a load_ldt() system
> call that makes it invalid, so that by the time you exit it is no
> longer valid etc.
>
> There's a reason we mark that "iretq" as taking faults with that
>
>         _ASM_EXTABLE(native_iret, bad_iret)
>
> and that "bad_iret" creates a GP fault.
>
> And that's a lot of kernel stack. The whole initial GP fault path,
> which goes to the C code that finds the exception table etc. See
> do_general_protection_fault() and fixup_exception().

My point is that it may be safe to remove the special espfix fixup
from #PF, which is probably the most performance-critical piece here,
aside from iret itself.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:00                                 ` Andrew Lutomirski
@ 2014-04-22 17:04                                   ` Linus Torvalds
  2014-04-22 17:11                                     ` Andrew Lutomirski
                                                       ` (2 more replies)
  2014-04-22 17:09                                   ` H. Peter Anvin
  1 sibling, 3 replies; 136+ messages in thread
From: Linus Torvalds @ 2014-04-22 17:04 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: H. Peter Anvin, Borislav Petkov, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 10:00 AM, Andrew Lutomirski <amluto@gmail.com> wrote:
>
> My point is that it may be safe to remove the special espfix fixup
> from #PF, which is probably the most performance-critical piece here,
> aside from iret itself.

Actually, even that is unsafe.

Why?

The segment table is shared for a process. So you can have one thread
doing a load_ldt() that invalidates a segment, while another thread is
busy taking a page fault. The segment was valid at page fault time and
is saved on the kernel stack, but by the time the page fault returns,
it is no longer valid and the iretq will fault.

Anyway, if done correctly, this whole espfix should be totally free
for normal processes, since it should only trigger if SS is a LDT
entry (bit #2 set in the segment descriptor). So the normal fast-path
should just have a simple test for that.

And if you have a SS that is a descriptor in the LDT, nobody cares
about performance any more.

             Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:00                                 ` Andrew Lutomirski
  2014-04-22 17:04                                   ` Linus Torvalds
@ 2014-04-22 17:09                                   ` H. Peter Anvin
  2014-04-22 17:20                                     ` Andrew Lutomirski
  1 sibling, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22 17:09 UTC (permalink / raw)
  To: Andrew Lutomirski, Linus Torvalds
  Cc: Borislav Petkov, H. Peter Anvin, Linux Kernel Mailing List,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/22/2014 10:00 AM, Andrew Lutomirski wrote:
>>
>> Yes, you can very much trigger GP deliberately.
>>
>> The way to do it is to just make an invalid segment descriptor on the
>> iret stack. Or make it a valid 16-bit one, but make it a code segment
>> for the stack pointer, or read-only, or whatever. All of which is
>> trivial to do with a sigretun system call. But you can do it other
>> ways too - enter with a SS that is valid, but do a load_ldt() system
>> call that makes it invalid, so that by the time you exit it is no
>> longer valid etc.
>>
>> There's a reason we mark that "iretq" as taking faults with that
>>
>>         _ASM_EXTABLE(native_iret, bad_iret)
>>
>> and that "bad_iret" creates a GP fault.
>>
>> And that's a lot of kernel stack. The whole initial GP fault path,
>> which goes to the C code that finds the exception table etc. See
>> do_general_protection_fault() and fixup_exception().
> 
> My point is that it may be safe to remove the special espfix fixup
> from #PF, which is probably the most performance-critical piece here,
> aside from iret itself.
> 

It *might* even be plausible to do full manual sanitization, so that the
IRET cannot fault, but I have to admit to that being somewhat daunting,
especially given the thread/process distinction.  I wasn't actually sure
about the status of the LDT on the thread vs process scale (the GDT is
per-CPU, but has some entries that are context-switched per *thread*,
but I hadn't looked at the LDT recently.)

As for Andy's questions:

> What happens on the IST entries?  If I've read your patch right,
> you're still switching back to the normal stack, which looks
> questionable.

No, in that case %rsp won't point into the espfix region, and the switch
will be bypassed.  We will resume back into the espfix region on IRET,
which is actually required e.g. if we take an NMI in the middle of the
espfix setup.

> Also, if you want to same some register abuse on each exception entry,
> could you check the saved RIP instead of the current RSP?  I.e. use
> the test instruction with offset(%rsp)?  Maybe there are multiple
> possible values, though, and just testing some bits doesn't help.

I don't see how that would work.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:04                                   ` Linus Torvalds
@ 2014-04-22 17:11                                     ` Andrew Lutomirski
  2014-04-22 17:15                                       ` H. Peter Anvin
  2014-04-22 17:19                                       ` Linus Torvalds
  2014-04-22 17:11                                     ` H. Peter Anvin
  2014-04-23  6:24                                     ` H. Peter Anvin
  2 siblings, 2 replies; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-22 17:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Borislav Petkov, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 10:04 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 22, 2014 at 10:00 AM, Andrew Lutomirski <amluto@gmail.com> wrote:
>>
>> My point is that it may be safe to remove the special espfix fixup
>> from #PF, which is probably the most performance-critical piece here,
>> aside from iret itself.
>
> Actually, even that is unsafe.
>
> Why?
>
> The segment table is shared for a process. So you can have one thread
> doing a load_ldt() that invalidates a segment, while another thread is
> busy taking a page fault. The segment was valid at page fault time and
> is saved on the kernel stack, but by the time the page fault returns,
> it is no longer valid and the iretq will fault.

Let me try that again: I think it should be safe to remove the check
for "did we fault from the espfix stack" from the #PF entry.  You can
certainly have all kinds of weird things happen on return from #PF,
but the overhead that I'm talking about is a test on exception *entry*
to see whether the fault happened on the espfix stack so that we can
switch back to running on a real stack.

If the espfix code and the iret at the end can't cause #PF, then the
check in #PF entry can be removed, I think.

>
> Anyway, if done correctly, this whole espfix should be totally free
> for normal processes, since it should only trigger if SS is a LDT
> entry (bit #2 set in the segment descriptor). So the normal fast-path
> should just have a simple test for that.

How?  Doesn't something still need to check whether SS is funny before
doing iret?

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:04                                   ` Linus Torvalds
  2014-04-22 17:11                                     ` Andrew Lutomirski
@ 2014-04-22 17:11                                     ` H. Peter Anvin
  2014-04-22 17:26                                       ` Borislav Petkov
  2014-04-23  6:24                                     ` H. Peter Anvin
  2 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22 17:11 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Lutomirski
  Cc: Borislav Petkov, H. Peter Anvin, Linux Kernel Mailing List,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/22/2014 10:04 AM, Linus Torvalds wrote:
> 
> The segment table is shared for a process. So you can have one thread
> doing a load_ldt() that invalidates a segment, while another thread is
> busy taking a page fault. The segment was valid at page fault time and
> is saved on the kernel stack, but by the time the page fault returns,
> it is no longer valid and the iretq will fault.
> 
> Anyway, if done correctly, this whole espfix should be totally free
> for normal processes, since it should only trigger if SS is a LDT
> entry (bit #2 set in the segment descriptor). So the normal fast-path
> should just have a simple test for that.
> 
> And if you have a SS that is a descriptor in the LDT, nobody cares
> about performance any more.
> 

The fastpath interference is:

1. Testing for an LDT SS selector before IRET.  This is actually easier
than on 32 bits, because on 64 bits the SS:RSP on the stack is always valid.

2. Testing for an RSP inside the espfix region on exception entry, so we
can switch back the stack.  This has to be done very early in the
exception entry since the espfix stack is so small.  If NMI and #MC
didn't use IST it wouldn't work at all (but neither would SYSCALL).

#2 currently saves/restores %rdi, which is also saved further down.
This is obviously wasteful.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:11                                     ` Andrew Lutomirski
@ 2014-04-22 17:15                                       ` H. Peter Anvin
  2014-04-23  9:54                                         ` One Thousand Gnomes
  2014-04-22 17:19                                       ` Linus Torvalds
  1 sibling, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22 17:15 UTC (permalink / raw)
  To: Andrew Lutomirski, Linus Torvalds
  Cc: Borislav Petkov, H. Peter Anvin, Linux Kernel Mailing List,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/22/2014 10:11 AM, Andrew Lutomirski wrote:
>>
>> Anyway, if done correctly, this whole espfix should be totally free
>> for normal processes, since it should only trigger if SS is a LDT
>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>> should just have a simple test for that.
> 
> How?  Doesn't something still need to check whether SS is funny before
> doing iret?
> 

Ideally the tests should be doable such that on a normal machine the
tests can be overlapped with the other things we have to do on that
path.  The exit branch will be strongly predicted in the negative
direction, so it shouldn't be a significant problem.

Again, this is not the case in the current prototype.
	
	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:11                                     ` Andrew Lutomirski
  2014-04-22 17:15                                       ` H. Peter Anvin
@ 2014-04-22 17:19                                       ` Linus Torvalds
  2014-04-22 17:29                                         ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: Linus Torvalds @ 2014-04-22 17:19 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: H. Peter Anvin, Borislav Petkov, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski <amluto@gmail.com> wrote:
>
>>
>> Anyway, if done correctly, this whole espfix should be totally free
>> for normal processes, since it should only trigger if SS is a LDT
>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>> should just have a simple test for that.
>
> How?  Doesn't something still need to check whether SS is funny before
> doing iret?

Just test bit #2. Don't do anything else if it's clear, because you
should be done. You don't need to do anything special if it's clear,
because I don't *think* we have any 16-bit data segments in the GDT on
x86-64.

              Linus

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:09                                   ` H. Peter Anvin
@ 2014-04-22 17:20                                     ` Andrew Lutomirski
  2014-04-22 17:24                                       ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-22 17:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Borislav Petkov, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 10:09 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> As for Andy's questions:
>
>> What happens on the IST entries?  If I've read your patch right,
>> you're still switching back to the normal stack, which looks
>> questionable.
>
> No, in that case %rsp won't point into the espfix region, and the switch
> will be bypassed.  We will resume back into the espfix region on IRET,
> which is actually required e.g. if we take an NMI in the middle of the
> espfix setup.

Aha.  I misread that.  Would it be worth adding a comment along the lines of

/* Check whether we are running on the espfix stack.  This is
different from checking whether we faulted from the espfix stack,
since an ist exception will have switched us off of the espfix stack.
*/

>
>> Also, if you want to same some register abuse on each exception entry,
>> could you check the saved RIP instead of the current RSP?  I.e. use
>> the test instruction with offset(%rsp)?  Maybe there are multiple
>> possible values, though, and just testing some bits doesn't help.
>
> I don't see how that would work.

It won't, given the above.  I misunderstood what you were checking.

It still seems to me that only #GP needs this special handling.  The
IST entries should never run on the espfix stack, and #MC, #DB, #NM,
and #SS (I missed that one earlier) use IST.

Would it ever make sense to make #GP use IST as well?  That might
allow espfix_adjust_stack to be removed entirely.  I don't know how
much other fiddling would be needed to make that work.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:20                                     ` Andrew Lutomirski
@ 2014-04-22 17:24                                       ` H. Peter Anvin
  0 siblings, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22 17:24 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Linus Torvalds, Borislav Petkov, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/22/2014 10:20 AM, Andrew Lutomirski wrote:
> 
> It won't, given the above.  I misunderstood what you were checking.
> 
> It still seems to me that only #GP needs this special handling.  The
> IST entries should never run on the espfix stack, and #MC, #DB, #NM,
> and #SS (I missed that one earlier) use IST.
> 
> Would it ever make sense to make #GP use IST as well?  That might
> allow espfix_adjust_stack to be removed entirely.  I don't know how
> much other fiddling would be needed to make that work.
> 

Interesting thought.  It might even be able to share an IST entry with #SS.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:11                                     ` H. Peter Anvin
@ 2014-04-22 17:26                                       ` Borislav Petkov
  2014-04-22 17:29                                         ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: Borislav Petkov @ 2014-04-22 17:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andrew Lutomirski, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 10:11:27AM -0700, H. Peter Anvin wrote:
> The fastpath interference is:
> 
> 1. Testing for an LDT SS selector before IRET.  This is actually easier
> than on 32 bits, because on 64 bits the SS:RSP on the stack is always valid.
> 
> 2. Testing for an RSP inside the espfix region on exception entry, so we
> can switch back the stack.  This has to be done very early in the
> exception entry since the espfix stack is so small.  If NMI and #MC
> didn't use IST it wouldn't work at all (but neither would SYSCALL).
> 
> #2 currently saves/restores %rdi, which is also saved further down.
> This is obviously wasteful.

Btw, can we runtime-patch the fastpath interference chunk the moment we
see a 16-bit segment? I.e., connect it to the write_idt() path, i.e. in
the hunk you've removed in there and enable the espfix checks there the
moment we load a 16-bit segment.

I know, I know, this is not so important right now but let me put it out
there just the same.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:26                                       ` Borislav Petkov
@ 2014-04-22 17:29                                         ` Andrew Lutomirski
  2014-04-22 19:27                                           ` Borislav Petkov
  0 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-22 17:29 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Linus Torvalds, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 10:26 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Tue, Apr 22, 2014 at 10:11:27AM -0700, H. Peter Anvin wrote:
>> The fastpath interference is:
>>
>> 1. Testing for an LDT SS selector before IRET.  This is actually easier
>> than on 32 bits, because on 64 bits the SS:RSP on the stack is always valid.
>>
>> 2. Testing for an RSP inside the espfix region on exception entry, so we
>> can switch back the stack.  This has to be done very early in the
>> exception entry since the espfix stack is so small.  If NMI and #MC
>> didn't use IST it wouldn't work at all (but neither would SYSCALL).
>>
>> #2 currently saves/restores %rdi, which is also saved further down.
>> This is obviously wasteful.
>
> Btw, can we runtime-patch the fastpath interference chunk the moment we
> see a 16-bit segment? I.e., connect it to the write_idt() path, i.e. in
> the hunk you've removed in there and enable the espfix checks there the
> moment we load a 16-bit segment.
>
> I know, I know, this is not so important right now but let me put it out
> there just the same.

Or we could add a TIF_NEEDS_ESPFIX that gets set once you have a
16-bit LDT entry.

But I think it makes sense to nail down everything else first.  I
suspect that a single test-and-branch in the iret path will be lost in
the noise from iret itself.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:19                                       ` Linus Torvalds
@ 2014-04-22 17:29                                         ` H. Peter Anvin
  2014-04-22 17:46                                           ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22 17:29 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Lutomirski
  Cc: Borislav Petkov, H. Peter Anvin, Linux Kernel Mailing List,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/22/2014 10:19 AM, Linus Torvalds wrote:
> On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski <amluto@gmail.com> wrote:
>>
>>>
>>> Anyway, if done correctly, this whole espfix should be totally free
>>> for normal processes, since it should only trigger if SS is a LDT
>>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>>> should just have a simple test for that.
>>
>> How?  Doesn't something still need to check whether SS is funny before
>> doing iret?
> 
> Just test bit #2. Don't do anything else if it's clear, because you
> should be done. You don't need to do anything special if it's clear,
> because I don't *think* we have any 16-bit data segments in the GDT on
> x86-64.
> 

And we don't (neither do we on i386, and we depend on that invariance.)

Hence:

 irq_return:
+	/*
+	 * Are we returning to the LDT?  Note: in 64-bit mode
+	 * SS:RSP on the exception stack is always valid.
+	 */
+	testb $4,(SS-RIP)(%rsp)
+	jnz irq_return_ldt
+
+irq_return_iret:
 	INTERRUPT_RETURN
-	_ASM_EXTABLE(irq_return, bad_iret)
+	_ASM_EXTABLE(irq_return_iret, bad_iret)


That is the whole impact of the IRET path.

If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
make sure there is absolutely no way we could end up nested) then the
rest of the fixup code can go away and we kill the common path
exception-handling overhead; the only new overhead is the IST
indirection for #GP, which isn't a performance-critical fault (good
thing, because untangling #GP faults is a major effort.)

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:29                                         ` H. Peter Anvin
@ 2014-04-22 17:46                                           ` Andrew Lutomirski
  2014-04-22 17:59                                             ` H. Peter Anvin
  2014-04-22 18:03                                             ` Brian Gerst
  0 siblings, 2 replies; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-22 17:46 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Borislav Petkov, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 10:29 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/22/2014 10:19 AM, Linus Torvalds wrote:
>> On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski <amluto@gmail.com> wrote:
>>>
>>>>
>>>> Anyway, if done correctly, this whole espfix should be totally free
>>>> for normal processes, since it should only trigger if SS is a LDT
>>>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>>>> should just have a simple test for that.
>>>
>>> How?  Doesn't something still need to check whether SS is funny before
>>> doing iret?
>>
>> Just test bit #2. Don't do anything else if it's clear, because you
>> should be done. You don't need to do anything special if it's clear,
>> because I don't *think* we have any 16-bit data segments in the GDT on
>> x86-64.
>>
>
> And we don't (neither do we on i386, and we depend on that invariance.)
>
> Hence:
>
>  irq_return:
> +       /*
> +        * Are we returning to the LDT?  Note: in 64-bit mode
> +        * SS:RSP on the exception stack is always valid.
> +        */
> +       testb $4,(SS-RIP)(%rsp)
> +       jnz irq_return_ldt
> +
> +irq_return_iret:
>         INTERRUPT_RETURN
> -       _ASM_EXTABLE(irq_return, bad_iret)
> +       _ASM_EXTABLE(irq_return_iret, bad_iret)
>
>
> That is the whole impact of the IRET path.
>
> If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
> make sure there is absolutely no way we could end up nested) then the
> rest of the fixup code can go away and we kill the common path
> exception-handling overhead; the only new overhead is the IST
> indirection for #GP, which isn't a performance-critical fault (good
> thing, because untangling #GP faults is a major effort.)

I'd be a bit nervous about read_msr_safe and friends.  Also, what
happens if userspace triggers a #GP and the signal stack setup causes
a page fault?

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:46                                           ` Andrew Lutomirski
@ 2014-04-22 17:59                                             ` H. Peter Anvin
  2014-04-22 18:03                                             ` Brian Gerst
  1 sibling, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22 17:59 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Linus Torvalds, Borislav Petkov, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/22/2014 10:46 AM, Andrew Lutomirski wrote:
>>
>> That is the whole impact of the IRET path.
>>
>> If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
>> make sure there is absolutely no way we could end up nested) then the
>> rest of the fixup code can go away and we kill the common path
>> exception-handling overhead; the only new overhead is the IST
>> indirection for #GP, which isn't a performance-critical fault (good
>> thing, because untangling #GP faults is a major effort.)
> 
> I'd be a bit nervous about read_msr_safe and friends.  Also, what
> happens if userspace triggers a #GP and the signal stack setup causes
> a page fault?
> 

Yes, #GPs inside the kernel could be a real problem.  MSRs generally
don't trigger #SS.  The second scenario shouldn't be a problem, the #PF
will be delivered on the currently active stack.

On the other hand, doing the espfix fixup only for #GP might be an
alternative, as long as we can convince ourselves that it really is the
only fault that could possibly be delivered on the espfix path.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:46                                           ` Andrew Lutomirski
  2014-04-22 17:59                                             ` H. Peter Anvin
@ 2014-04-22 18:03                                             ` Brian Gerst
  2014-04-22 18:06                                               ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: Brian Gerst @ 2014-04-22 18:03 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: H. Peter Anvin, Linus Torvalds, Borislav Petkov, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 1:46 PM, Andrew Lutomirski <amluto@gmail.com> wrote:
> On Tue, Apr 22, 2014 at 10:29 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 04/22/2014 10:19 AM, Linus Torvalds wrote:
>>> On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski <amluto@gmail.com> wrote:
>>>>
>>>>>
>>>>> Anyway, if done correctly, this whole espfix should be totally free
>>>>> for normal processes, since it should only trigger if SS is a LDT
>>>>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>>>>> should just have a simple test for that.
>>>>
>>>> How?  Doesn't something still need to check whether SS is funny before
>>>> doing iret?
>>>
>>> Just test bit #2. Don't do anything else if it's clear, because you
>>> should be done. You don't need to do anything special if it's clear,
>>> because I don't *think* we have any 16-bit data segments in the GDT on
>>> x86-64.
>>>
>>
>> And we don't (neither do we on i386, and we depend on that invariance.)
>>
>> Hence:
>>
>>  irq_return:
>> +       /*
>> +        * Are we returning to the LDT?  Note: in 64-bit mode
>> +        * SS:RSP on the exception stack is always valid.
>> +        */
>> +       testb $4,(SS-RIP)(%rsp)
>> +       jnz irq_return_ldt
>> +
>> +irq_return_iret:
>>         INTERRUPT_RETURN
>> -       _ASM_EXTABLE(irq_return, bad_iret)
>> +       _ASM_EXTABLE(irq_return_iret, bad_iret)
>>
>>
>> That is the whole impact of the IRET path.
>>
>> If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
>> make sure there is absolutely no way we could end up nested) then the
>> rest of the fixup code can go away and we kill the common path
>> exception-handling overhead; the only new overhead is the IST
>> indirection for #GP, which isn't a performance-critical fault (good
>> thing, because untangling #GP faults is a major effort.)
>
> I'd be a bit nervous about read_msr_safe and friends.  Also, what
> happens if userspace triggers a #GP and the signal stack setup causes
> a page fault?
>
> --Andy

Maybe make the #GP handler check what the previous stack was at the start:
1) If we came from userspace, switch to the top of the process stack.
2) If the previous stack was not the espfix stack, switch back to that stack.
3) Switch to the top of the process stack (espfix case)

This leaves the IST available for any recursive faults.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 18:03                                             ` Brian Gerst
@ 2014-04-22 18:06                                               ` H. Peter Anvin
  2014-04-22 18:17                                                 ` Brian Gerst
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22 18:06 UTC (permalink / raw)
  To: Brian Gerst, Andrew Lutomirski
  Cc: Linus Torvalds, Borislav Petkov, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/22/2014 11:03 AM, Brian Gerst wrote:
> 
> Maybe make the #GP handler check what the previous stack was at the start:
> 1) If we came from userspace, switch to the top of the process stack.
> 2) If the previous stack was not the espfix stack, switch back to that stack.
> 3) Switch to the top of the process stack (espfix case)
> 
> This leaves the IST available for any recursive faults.
> 

Do you actually know what the IST is?  If so, you should realize the
above is nonsense.

The *hardware* switches stack on an exception; if the vector is set up
as an IST, then we *always* switch to the IST stack, unconditionally.
If the vector is not, then we switch to the process stack if we came
from userspace.

That is the entry condition that we have to deal with.  The fact that
the switch to the IST is unconditional is what makes ISTs hard to deal with.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 18:06                                               ` H. Peter Anvin
@ 2014-04-22 18:17                                                 ` Brian Gerst
  2014-04-22 18:51                                                   ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Brian Gerst @ 2014-04-22 18:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Lutomirski, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On Tue, Apr 22, 2014 at 2:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/22/2014 11:03 AM, Brian Gerst wrote:
>>
>> Maybe make the #GP handler check what the previous stack was at the start:
>> 1) If we came from userspace, switch to the top of the process stack.
>> 2) If the previous stack was not the espfix stack, switch back to that stack.
>> 3) Switch to the top of the process stack (espfix case)
>>
>> This leaves the IST available for any recursive faults.
>>
>
> Do you actually know what the IST is?  If so, you should realize the
> above is nonsense.
>
> The *hardware* switches stack on an exception; if the vector is set up
> as an IST, then we *always* switch to the IST stack, unconditionally.
> If the vector is not, then we switch to the process stack if we came
> from userspace.
>
> That is the entry condition that we have to deal with.  The fact that
> the switch to the IST is unconditional is what makes ISTs hard to deal with.

Right, that is why you switch away from the IST as soon as possible,
copying the data that is already pushed there to another stack so it
won't be overwritten by a recursive fault.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 18:17                                                 ` Brian Gerst
@ 2014-04-22 18:51                                                   ` H. Peter Anvin
  2014-04-22 19:55                                                     ` Brian Gerst
  2014-04-22 23:39                                                     ` Andi Kleen
  0 siblings, 2 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22 18:51 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andrew Lutomirski, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On 04/22/2014 11:17 AM, Brian Gerst wrote:
>>
>> That is the entry condition that we have to deal with.  The fact that
>> the switch to the IST is unconditional is what makes ISTs hard to deal with.
> 
> Right, that is why you switch away from the IST as soon as possible,
> copying the data that is already pushed there to another stack so it
> won't be overwritten by a recursive fault.
> 

That simply will not work if you can take a #GP due to the "safe" MSR
functions from NMI and #MC context, which would be my main concern.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:29                                         ` Andrew Lutomirski
@ 2014-04-22 19:27                                           ` Borislav Petkov
  0 siblings, 0 replies; 136+ messages in thread
From: Borislav Petkov @ 2014-04-22 19:27 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: H. Peter Anvin, Linus Torvalds, H. Peter Anvin,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 10:29:45AM -0700, Andrew Lutomirski wrote:
> Or we could add a TIF_NEEDS_ESPFIX that gets set once you have a
> 16-bit LDT entry.

Or something like that, yep.

> But I think it makes sense to nail down everything else first. I
> suspect that a single test-and-branch in the iret path will be lost in
> the noise from iret itself.

The cumulative effects of such additions here and there are nasty
though. If we can make the general path free relatively painlessly, we
should do it, IMO.

But yeah, later.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 18:51                                                   ` H. Peter Anvin
@ 2014-04-22 19:55                                                     ` Brian Gerst
  2014-04-22 20:17                                                       ` H. Peter Anvin
  2014-04-22 23:39                                                     ` Andi Kleen
  1 sibling, 1 reply; 136+ messages in thread
From: Brian Gerst @ 2014-04-22 19:55 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Lutomirski, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On Tue, Apr 22, 2014 at 2:51 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/22/2014 11:17 AM, Brian Gerst wrote:
>>>
>>> That is the entry condition that we have to deal with.  The fact that
>>> the switch to the IST is unconditional is what makes ISTs hard to deal with.
>>
>> Right, that is why you switch away from the IST as soon as possible,
>> copying the data that is already pushed there to another stack so it
>> won't be overwritten by a recursive fault.
>>
>
> That simply will not work if you can take a #GP due to the "safe" MSR
> functions from NMI and #MC context, which would be my main concern.

In that case (#2 above), you would switch to the previous %rsp (in the
NMI/MC stack), copy the exception frame from the IST, and continue
with the #GP handler.  That effectively is the same as it is today,
where no stack switch occurs on the #GP fault.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 19:55                                                     ` Brian Gerst
@ 2014-04-22 20:17                                                       ` H. Peter Anvin
  2014-04-22 23:08                                                         ` Brian Gerst
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22 20:17 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Andrew Lutomirski, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On 04/22/2014 12:55 PM, Brian Gerst wrote:
> On Tue, Apr 22, 2014 at 2:51 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 04/22/2014 11:17 AM, Brian Gerst wrote:
>>>>
>>>> That is the entry condition that we have to deal with.  The fact that
>>>> the switch to the IST is unconditional is what makes ISTs hard to deal with.
>>>
>>> Right, that is why you switch away from the IST as soon as possible,
>>> copying the data that is already pushed there to another stack so it
>>> won't be overwritten by a recursive fault.
>>>
>>
>> That simply will not work if you can take a #GP due to the "safe" MSR
>> functions from NMI and #MC context, which would be my main concern.
> 
> In that case (#2 above), you would switch to the previous %rsp (in the
> NMI/MC stack), copy the exception frame from the IST, and continue
> with the #GP handler.  That effectively is the same as it is today,
> where no stack switch occurs on the #GP fault.
> 

1. You take #GP.  This causes an IST stack switch.
2. You immediately thereafter take an NMI.  This switches stacks again.
3. Now you take another #GP.  This causes another IST stack, and now you
have clobbered your return information, and cannot resume.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 20:17                                                       ` H. Peter Anvin
@ 2014-04-22 23:08                                                         ` Brian Gerst
  0 siblings, 0 replies; 136+ messages in thread
From: Brian Gerst @ 2014-04-22 23:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Lutomirski, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On Tue, Apr 22, 2014 at 4:17 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/22/2014 12:55 PM, Brian Gerst wrote:
>> On Tue, Apr 22, 2014 at 2:51 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 04/22/2014 11:17 AM, Brian Gerst wrote:
>>>>>
>>>>> That is the entry condition that we have to deal with.  The fact that
>>>>> the switch to the IST is unconditional is what makes ISTs hard to deal with.
>>>>
>>>> Right, that is why you switch away from the IST as soon as possible,
>>>> copying the data that is already pushed there to another stack so it
>>>> won't be overwritten by a recursive fault.
>>>>
>>>
>>> That simply will not work if you can take a #GP due to the "safe" MSR
>>> functions from NMI and #MC context, which would be my main concern.
>>
>> In that case (#2 above), you would switch to the previous %rsp (in the
>> NMI/MC stack), copy the exception frame from the IST, and continue
>> with the #GP handler.  That effectively is the same as it is today,
>> where no stack switch occurs on the #GP fault.
>>
>
> 1. You take #GP.  This causes an IST stack switch.
> 2. You immediately thereafter take an NMI.  This switches stacks again.
> 3. Now you take another #GP.  This causes another IST stack, and now you
> have clobbered your return information, and cannot resume.

You are right.  That will not work.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 18:51                                                   ` H. Peter Anvin
  2014-04-22 19:55                                                     ` Brian Gerst
@ 2014-04-22 23:39                                                     ` Andi Kleen
  2014-04-22 23:40                                                       ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: Andi Kleen @ 2014-04-22 23:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Brian Gerst, Andrew Lutomirski, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

> That simply will not work if you can take a #GP due to the "safe" MSR
> functions from NMI and #MC context, which would be my main concern.

At some point the IST entry functions subtracted 1K while the
handler ran to handle simple nesting cases.

Not sure that code is still there.

-Andi

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 23:39                                                     ` Andi Kleen
@ 2014-04-22 23:40                                                       ` H. Peter Anvin
  0 siblings, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-22 23:40 UTC (permalink / raw)
  To: Andi Kleen, H. Peter Anvin
  Cc: Brian Gerst, Andrew Lutomirski, Linus Torvalds, Borislav Petkov,
	Linux Kernel Mailing List, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Arjan van de Ven,
	Alexandre Julliard, Thomas Gleixner

On 04/22/2014 04:39 PM, Andi Kleen wrote:
>> That simply will not work if you can take a #GP due to the "safe" MSR
>> functions from NMI and #MC context, which would be my main concern.
> 
> At some point the IST entry functions subtracted 1K while the
> handler ran to handle simple nesting cases.
> 
> Not sure that code is still there.

Doesn't help if you take an NMI on the first instruction of the #GP handler.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-21 22:47 ` [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE* H. Peter Anvin
  2014-04-21 23:19   ` Andrew Lutomirski
  2014-04-22 11:25   ` Borislav Petkov
@ 2014-04-23  1:17   ` H. Peter Anvin
  2014-04-23  1:23     ` Andrew Lutomirski
  2014-04-25 21:02     ` Konrad Rzeszutek Wilk
  2014-04-24  4:13   ` comex
  2014-04-25 12:02   ` Pavel Machek
  4 siblings, 2 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-23  1:17 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: H. Peter Anvin, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Andy Lutomirski, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 456 bytes --]

Another spin of the prototype.  This one avoids the espfix for anything
but #GP, and avoids save/restore/saving registers... one can wonder,
though, how much that actually matters in practice.

It still does redundant SWAPGS on the slow path.  I'm not sure I
personally care enough to optimize that, as it means some fairly
significant restructuring of some of the code paths.  Some of that
restructuring might actually be beneficial, but still...

	-hpa


[-- Attachment #2: diff.txt --]
[-- Type: text/plain, Size: 11778 bytes --]

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 9264f04a4c55..cea5b9b517f2 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -57,6 +57,8 @@ extern void x86_ce4100_early_setup(void);
 static inline void x86_ce4100_early_setup(void) { }
 #endif
 
+extern void init_espfix_this_cpu(void);
+
 #ifndef _SETUP
 
 /*
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index f4d96000d33a..1cc3789d99d9 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_X86_64)	+= sys_x86_64.o x8664_ksyms_64.o
 obj-y			+= syscall_$(BITS).o vsyscall_gtod.o
 obj-$(CONFIG_X86_64)	+= vsyscall_64.o
 obj-$(CONFIG_X86_64)	+= vsyscall_emu_64.o
+obj-$(CONFIG_X86_64)	+= espfix_64.o
 obj-$(CONFIG_SYSFS)	+= ksysfs.o
 obj-y			+= bootflag.o e820.o
 obj-y			+= pci-dma.o quirks.o topology.o kdebugfs.o
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c3628bf2..7f71c97f59c0 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -58,6 +58,7 @@
 #include <asm/asm.h>
 #include <asm/context_tracking.h>
 #include <asm/smap.h>
+#include <asm/pgtable_types.h>
 #include <linux/err.h>
 
 /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this.  */
@@ -1040,8 +1041,16 @@ restore_args:
 	RESTORE_ARGS 1,8,1
 
 irq_return:
+	/*
+	 * Are we returning to the LDT?  Note: in 64-bit mode
+	 * SS:RSP on the exception stack is always valid.
+	 */
+	testb $4,(SS-RIP)(%rsp)
+	jnz irq_return_ldt
+
+irq_return_iret:
 	INTERRUPT_RETURN
-	_ASM_EXTABLE(irq_return, bad_iret)
+	_ASM_EXTABLE(irq_return_iret, bad_iret)
 
 #ifdef CONFIG_PARAVIRT
 ENTRY(native_iret)
@@ -1049,6 +1058,34 @@ ENTRY(native_iret)
 	_ASM_EXTABLE(native_iret, bad_iret)
 #endif
 
+irq_return_ldt:
+	pushq_cfi %rcx
+	larl (CS-RIP+8)(%rsp), %ecx
+	jnz 1f		/* Invalid segment - will #GP at IRET time */
+	testl $0x00200000, %ecx
+	jnz 1f		/* Returning to 64-bit mode */
+	larl (SS-RIP+8)(%rsp), %ecx
+	jnz 1f		/* Invalid segment - will #SS at IRET time */
+	testl $0x00400000, %ecx
+	jnz 1f		/* Not a 16-bit stack segment */
+	pushq_cfi %rsi
+	pushq_cfi %rdi
+	SWAPGS
+	movq PER_CPU_VAR(espfix_stack),%rdi
+	movl (RSP-RIP+3*8)(%rsp),%esi
+	xorw %si,%si
+	orq %rsi,%rdi
+	movq %rsp,%rsi
+	movl $8,%ecx
+	rep;movsq
+	leaq -(8*8)(%rdi),%rsp
+	SWAPGS
+	popq_cfi %rdi
+	popq_cfi %rsi
+1:
+	popq_cfi %rcx
+	jmp irq_return_iret
+
 	.section .fixup,"ax"
 bad_iret:
 	/*
@@ -1058,6 +1095,7 @@ bad_iret:
 	 * So pretend we completed the iret and took the #GPF in user mode.
 	 *
 	 * We are now running with the kernel GS after exception recovery.
+	 * Exception entry will have removed us from the espfix stack.
 	 * But error_entry expects us to have user GS to match the user %cs,
 	 * so swap back.
 	 */
@@ -1278,6 +1316,62 @@ ENTRY(\sym)
 END(\sym)
 .endm
 
+/*
+ * Same as errorentry, except use for #GP in case we take the exception
+ * while on the espfix stack.  All other exceptions that are possible while
+ * on the espfix stack use IST, but that is not really practical for #GP
+ * for nesting reasons.
+ */
+.macro errorentry_espfix sym do_sym
+ENTRY(\sym)
+	XCPT_FRAME
+	ASM_CLAC
+	PARAVIRT_ADJUST_EXCEPTION_FRAME
+	/* Check if we are on the espfix stack */
+	pushq_cfi %rdi
+	pushq_cfi %rsi
+	movq %rsp,%rdi
+	sarq $PGDIR_SHIFT,%rdi
+	cmpl $-2,%edi			/* Are we on the espfix stack? */
+	CFI_REMEMBER_STATE
+	je 1f
+2:
+	subq $RSI-R15, %rsp
+	CFI_ADJUST_CFA_OFFSET RSI-R15
+	call error_entry_rdi_rsi_saved
+	DEFAULT_FRAME 0
+	movq %rsp,%rdi			/* pt_regs pointer */
+	movq ORIG_RAX(%rsp),%rsi	/* get error code */
+	movq $-1,ORIG_RAX(%rsp)		/* no syscall to restart */
+	call \do_sym
+	jmp error_exit			/* %ebx: no swapgs flag */
+1:
+	CFI_RESTORE_STATE
+	SWAPGS
+	movq PER_CPU_VAR(kernel_stack),%rdi
+	SWAPGS
+	/* Copy data from the espfix stack to the real stack */
+	movq %rsi,-64(%rdi)		/* Saved value of %rsi already */
+	movq 8(%rsp),%rsi
+	movq %rsi,-56(%rdi)
+	movq 16(%rsp),%rsi
+	movq %rsi,-48(%rdi)
+	movq 24(%rsp),%rsi
+	movq %rsi,-40(%rdi)
+	movq 32(%rsp),%rsi
+	movq %rsi,-32(%rdi)
+	movq 40(%rsp),%rsi
+	movq %rsi,-24(%rdi)
+	movq 48(%rsp),%rsi
+	movq %rsi,-16(%rdi)
+	movq 56(%rsp),%rsi
+	movq %rsi,-8(%rdi)
+	leaq -64(%rdi),%rsp
+	jmp 2b
+	CFI_ENDPROC
+END(\sym)
+.endm
+
 #ifdef CONFIG_TRACING
 .macro trace_errorentry sym do_sym
 errorentry trace(\sym) trace(\do_sym)
@@ -1323,7 +1417,6 @@ zeroentry coprocessor_error do_coprocessor_error
 errorentry alignment_check do_alignment_check
 zeroentry simd_coprocessor_error do_simd_coprocessor_error
 
-
 	/* Reload gs selector with exception handling */
 	/* edi:  new selector */
 ENTRY(native_load_gs_index)
@@ -1490,7 +1583,7 @@ zeroentry xen_debug do_debug
 zeroentry xen_int3 do_int3
 errorentry xen_stack_segment do_stack_segment
 #endif
-errorentry general_protection do_general_protection
+errorentry_espfix general_protection do_general_protection
 trace_errorentry page_fault do_page_fault
 #ifdef CONFIG_KVM_GUEST
 errorentry async_page_fault do_async_page_fault
@@ -1567,9 +1660,10 @@ ENTRY(error_entry)
 	XCPT_FRAME
 	CFI_ADJUST_CFA_OFFSET 15*8
 	/* oldrax contains error code */
-	cld
 	movq_cfi rdi, RDI+8
 	movq_cfi rsi, RSI+8
+error_entry_rdi_rsi_saved:
+	cld
 	movq_cfi rdx, RDX+8
 	movq_cfi rcx, RCX+8
 	movq_cfi rax, RAX+8
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
new file mode 100644
index 000000000000..05567d706f92
--- /dev/null
+++ b/arch/x86/kernel/espfix_64.c
@@ -0,0 +1,136 @@
+/* ----------------------------------------------------------------------- *
+ *
+ *   Copyright 2014 Intel Corporation; author: H. Peter Anvin
+ *
+ *   This file is part of the Linux kernel, and is made available under
+ *   the terms of the GNU General Public License version 2 or (at your
+ *   option) any later version; incorporated herein by reference.
+ *
+ * ----------------------------------------------------------------------- */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <linux/gfp.h>
+#include <asm/pgtable.h>
+
+#define ESPFIX_STACK_SIZE	64UL
+#define ESPFIX_STACKS_PER_PAGE	(PAGE_SIZE/ESPFIX_STACK_SIZE)
+
+#define ESPFIX_MAX_CPUS (ESPFIX_STACKS_PER_PAGE << (PGDIR_SHIFT-PAGE_SHIFT-16))
+#if CONFIG_NR_CPUS > ESPFIX_MAX_CPUS
+# error "Need more than one PGD for the ESPFIX hack"
+#endif
+
+#define ESPFIX_BASE_ADDR	(-2UL << PGDIR_SHIFT)
+
+#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+
+/* This contains the *bottom* address of the espfix stack */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
+
+/* Initialization mutex - should this be a spinlock? */
+static DEFINE_MUTEX(espfix_init_mutex);
+
+/* Page allocation bitmap - each page serves ESPFIX_STACKS_PER_PAGE CPUs */
+#define ESPFIX_MAX_PAGES  DIV_ROUND_UP(CONFIG_NR_CPUS, ESPFIX_STACKS_PER_PAGE)
+#define ESPFIX_MAP_SIZE   DIV_ROUND_UP(ESPFIX_MAX_PAGES, BITS_PER_LONG)
+static unsigned long espfix_page_alloc_map[ESPFIX_MAP_SIZE];
+
+static __page_aligned_bss pud_t espfix_pud_page[PTRS_PER_PUD]
+	__aligned(PAGE_SIZE);
+
+/*
+ * This returns the bottom address of the espfix stack for a specific CPU.
+ * The math allows for a non-power-of-two ESPFIX_STACK_SIZE, in which case
+ * we have to account for some amount of padding at the end of each page.
+ */
+static inline unsigned long espfix_base_addr(unsigned int cpu)
+{
+	unsigned long page, addr;
+
+	page = (cpu / ESPFIX_STACKS_PER_PAGE) << PAGE_SHIFT;
+	addr = page + (cpu % ESPFIX_STACKS_PER_PAGE) * ESPFIX_STACK_SIZE;
+	addr = (addr & 0xffffUL) | ((addr & ~0xffffUL) << 16);
+	addr += ESPFIX_BASE_ADDR;
+	return addr;
+}
+
+#define PTE_STRIDE        (65536/PAGE_SIZE)
+#define ESPFIX_PTE_CLONES (PTRS_PER_PTE/PTE_STRIDE)
+#define ESPFIX_PMD_CLONES PTRS_PER_PMD
+#define ESPFIX_PUD_CLONES (65536/(ESPFIX_PTE_CLONES*ESPFIX_PMD_CLONES))
+
+void init_espfix_this_cpu(void)
+{
+	unsigned int cpu, page;
+	unsigned long addr;
+	pgd_t pgd, *pgd_p;
+	pud_t pud, *pud_p;
+	pmd_t pmd, *pmd_p;
+	pte_t pte, *pte_p;
+	int n;
+	void *stack_page;
+	pteval_t ptemask;
+
+	/* We only have to do this once... */
+	if (likely(this_cpu_read(espfix_stack)))
+		return;		/* Already initialized */
+
+	cpu = smp_processor_id();
+	addr = espfix_base_addr(cpu);
+	page = cpu/ESPFIX_STACKS_PER_PAGE;
+
+	/* Did another CPU already set this up? */
+	if (likely(test_bit(page, espfix_page_alloc_map)))
+		goto done;
+
+	mutex_lock(&espfix_init_mutex);
+
+	/* Did we race on the lock? */
+	if (unlikely(test_bit(page, espfix_page_alloc_map)))
+		goto unlock_done;
+
+	ptemask = __supported_pte_mask;
+
+	pgd_p = &init_level4_pgt[pgd_index(addr)];
+	pgd = *pgd_p;
+	if (!pgd_present(pgd)) {
+		/* This can only happen on the BSP */
+		pgd = __pgd(__pa_symbol(espfix_pud_page) |
+			    (_KERNPG_TABLE & ptemask));
+		set_pgd(pgd_p, pgd);
+	}
+
+	pud_p = &espfix_pud_page[pud_index(addr)];
+	pud = *pud_p;
+	if (!pud_present(pud)) {
+		pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+		pud = __pud(__pa(pmd_p) | (_KERNPG_TABLE & ptemask));
+		for (n = 0; n < ESPFIX_PUD_CLONES; n++)
+			set_pud(&pud_p[n], pud);
+	}
+
+	pmd_p = pmd_offset(&pud, addr);
+	pmd = *pmd_p;
+	if (!pmd_present(pmd)) {
+		pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+		pmd = __pmd(__pa(pte_p) | (_KERNPG_TABLE & ptemask));
+		for (n = 0; n < ESPFIX_PMD_CLONES; n++)
+			set_pmd(&pmd_p[n], pmd);
+	}
+
+	pte_p = pte_offset_kernel(&pmd, addr);
+	stack_page = (void *)__get_free_page(GFP_KERNEL);
+	pte = __pte(__pa(stack_page) | (__PAGE_KERNEL & ptemask));
+	for (n = 0; n < ESPFIX_PTE_CLONES; n++)
+		set_pte(&pte_p[n*PTE_STRIDE], pte);
+
+	/* Job is done for this CPU and any CPU which shares this page */
+	set_bit(page, espfix_page_alloc_map);
+
+unlock_done:
+	mutex_unlock(&espfix_init_mutex);
+done:
+	this_cpu_write(espfix_stack, addr);
+}
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index af1d14a9ebda..ebc987398923 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -229,17 +229,6 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 		}
 	}
 
-	/*
-	 * On x86-64 we do not support 16-bit segments due to
-	 * IRET leaking the high bits of the kernel stack address.
-	 */
-#ifdef CONFIG_X86_64
-	if (!ldt_info.seg_32bit) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-#endif
-
 	fill_ldt(&ldt, &ldt_info);
 	if (oldmode)
 		ldt.avl = 0;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 34826934d4a7..7956aad1a710 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -244,6 +244,11 @@ static void notrace start_secondary(void *unused)
 	check_tsc_sync_target();
 
 	/*
+	 * Enable the espfix hack for this CPU
+	 */
+	init_espfix_this_cpu();
+
+	/*
 	 * We need to hold vector_lock so there the set of online cpus
 	 * does not change while we are assigning vectors to cpus.  Holding
 	 * this lock ensures we don't half assign or remove an irq from a cpu.
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 20621d753d5f..96bf767a05fc 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -327,6 +327,8 @@ void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
 	int i;
 	struct pg_state st = {};
 
+	st.to_dmesg = true;
+
 	if (pgd) {
 		start = pgd;
 		st.to_dmesg = true;
diff --git a/init/main.c b/init/main.c
index 9c7fd4c9249f..6230d4b7ce1b 100644
--- a/init/main.c
+++ b/init/main.c
@@ -617,6 +617,10 @@ asmlinkage void __init start_kernel(void)
 	if (efi_enabled(EFI_RUNTIME_SERVICES))
 		efi_enter_virtual_mode();
 #endif
+#ifdef CONFIG_X86_64
+	/* Should be run before the first non-init thread is created */
+	init_espfix_this_cpu();
+#endif
 	thread_info_cache_init();
 	cred_init();
 	fork_init(totalram_pages);

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23  1:17   ` H. Peter Anvin
@ 2014-04-23  1:23     ` Andrew Lutomirski
  2014-04-23  1:42       ` H. Peter Anvin
  2014-04-25 21:02     ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-23  1:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linux Kernel Mailing List, H. Peter Anvin, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 6:17 PM, H. Peter Anvin <hpa@linux.intel.com> wrote:
> Another spin of the prototype.  This one avoids the espfix for anything
> but #GP, and avoids save/restore/saving registers... one can wonder,
> though, how much that actually matters in practice.
>
> It still does redundant SWAPGS on the slow path.  I'm not sure I
> personally care enough to optimize that, as it means some fairly
> significant restructuring of some of the code paths.  Some of that
> restructuring might actually be beneficial, but still...
>

What's the to_dmesg thing for?

It looks sane, although I haven't checked the detailed register manipulation.

Users of big systems may complain when every single CPU lines up for
that mutex.  Maybe no one cares.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23  1:23     ` Andrew Lutomirski
@ 2014-04-23  1:42       ` H. Peter Anvin
  2014-04-23 14:24         ` Boris Ostrovsky
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-23  1:42 UTC (permalink / raw)
  To: Andrew Lutomirski, H. Peter Anvin
  Cc: Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/22/2014 06:23 PM, Andrew Lutomirski wrote:
> 
> What's the to_dmesg thing for?
> 

It's for debugging... the espfix page tables generate so many duplicate
entries that trying to output it via a seqfile runs out of memory.  I
suspect we need to do something like skip the espfix range or some other
hack.

> It looks sane, although I haven't checked the detailed register manipulation.
> 
> Users of big systems may complain when every single CPU lines up for
> that mutex.  Maybe no one cares.

Right now the whole smpboot sequence is fully serialized... that needs
to be fixed.

Konrad - I really could use some help figuring out what needs to be done
for this not to break Xen.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:04                                   ` Linus Torvalds
  2014-04-22 17:11                                     ` Andrew Lutomirski
  2014-04-22 17:11                                     ` H. Peter Anvin
@ 2014-04-23  6:24                                     ` H. Peter Anvin
  2014-04-23  8:57                                       ` Alexandre Julliard
  2 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-23  6:24 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Lutomirski
  Cc: Borislav Petkov, H. Peter Anvin, Linux Kernel Mailing List,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/22/2014 10:04 AM, Linus Torvalds wrote:
>
> The segment table is shared for a process. So you can have one thread
> doing a load_ldt() that invalidates a segment, while another thread is
> busy taking a page fault. The segment was valid at page fault time and
> is saved on the kernel stack, but by the time the page fault returns,
> it is no longer valid and the iretq will fault.
>
> Anyway, if done correctly, this whole espfix should be totally free
> for normal processes, since it should only trigger if SS is a LDT
> entry (bit #2 set in the segment descriptor). So the normal fast-path
> should just have a simple test for that.
>
> And if you have a SS that is a descriptor in the LDT, nobody cares
> about performance any more.
>

I just realized that with the LDT being a process-level object (unlike 
the GDT), we need to remove the filtering on the espfix hack, both for 
32-bit and 64-bit kernels.  Otherwise there is a race condition between 
executing the LAR instruction in the filter and the IRET, which could 
allow the leak to become manifest.

The "good" part is that I think the espfix hack is harmless even with a 
32/64-bit stack segment, although it has a substantial performance penalty.

Does anyone have any idea if there is a real use case for non-16-bit LDT 
segments used as the stack segment?  Does Wine use anything like that?

Very old NPTL Linux binaries use LDT segments, but only for data segments.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23  6:24                                     ` H. Peter Anvin
@ 2014-04-23  8:57                                       ` Alexandre Julliard
  0 siblings, 0 replies; 136+ messages in thread
From: Alexandre Julliard @ 2014-04-23  8:57 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Andrew Lutomirski, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Brian Gerst, Andi Kleen, Thomas Gleixner

"H. Peter Anvin" <hpa@zytor.com> writes:

> Does anyone have any idea if there is a real use case for non-16-bit
> LDT segments used as the stack segment?  Does Wine use anything like
> that?

Wine uses them for DPMI support, though that would only get used when
vm86 mode is available.

-- 
Alexandre Julliard
julliard@winehq.org

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-22 17:15                                       ` H. Peter Anvin
@ 2014-04-23  9:54                                         ` One Thousand Gnomes
  2014-04-23 15:53                                           ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: One Thousand Gnomes @ 2014-04-23  9:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Lutomirski, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

> Ideally the tests should be doable such that on a normal machine the
> tests can be overlapped with the other things we have to do on that
> path.  The exit branch will be strongly predicted in the negative
> direction, so it shouldn't be a significant problem.
> 
> Again, this is not the case in the current prototype.

Or you make sure that you switch to those code paths only after software
has executed syscalls that make it possible it will use a 16bit ss. 

The other question I have is - is there any reason we can't fix up the
IRET to do a 32bit return into a vsyscall type userspace page which then
does a long jump or retf to the right place ?

Alan

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23  1:42       ` H. Peter Anvin
@ 2014-04-23 14:24         ` Boris Ostrovsky
  2014-04-23 16:56           ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Boris Ostrovsky @ 2014-04-23 14:24 UTC (permalink / raw)
  To: H. Peter Anvin, Konrad Rzeszutek Wilk
  Cc: Andrew Lutomirski, H. Peter Anvin, Linux Kernel Mailing List,
	Linus Torvalds, Ingo Molnar, Alexander van Heukelum,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/22/2014 09:42 PM, H. Peter Anvin wrote:
> On 04/22/2014 06:23 PM, Andrew Lutomirski wrote:
>> What's the to_dmesg thing for?
>>
> It's for debugging... the espfix page tables generate so many duplicate
> entries that trying to output it via a seqfile runs out of memory.  I
> suspect we need to do something like skip the espfix range or some other
> hack.
>
>> It looks sane, although I haven't checked the detailed register manipulation.
>>
>> Users of big systems may complain when every single CPU lines up for
>> that mutex.  Maybe no one cares.
> Right now the whole smpboot sequence is fully serialized... that needs
> to be fixed.
>
> Konrad - I really could use some help figuring out what needs to be done
> for this not to break Xen.

This does break Xen PV:

[    3.683735] ------------[ cut here ]------------
[    3.683807] WARNING: CPU: 0 PID: 0 at arch/x86/xen/multicalls.c:129 
xen_mc_flush+0x1c8/0x1d0()
[    3.683903] Modules linked in:
[    3.684006] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.15.0-rc2 #2

[    3.684176]  0000000000000009 ffffffff81c01de0 ffffffff816cfb15 
0000000000000000
[    3.684416]  ffffffff81c01e18 ffffffff81084abd 0000000000000000 
0000000000000001
[    3.684654]  0000000000000000 ffff88023da0b180 0000000000000010 
ffffffff81c01e28
[    3.684893] Call Trace:
[    3.684962]  [<ffffffff816cfb15>] dump_stack+0x45/0x56
[    3.685032]  [<ffffffff81084abd>] warn_slowpath_common+0x7d/0xa0
[    3.685102]  [<ffffffff81084b9a>] warn_slowpath_null+0x1a/0x20
[    3.685171]  [<ffffffff810050a8>] xen_mc_flush+0x1c8/0x1d0
[    3.685240]  [<ffffffff81008155>] xen_set_pgd+0x1f5/0x220
[    3.685310]  [<ffffffff8101975a>] init_espfix_this_cpu+0x36a/0x380
[    3.685379]  [<ffffffff813cb559>] ? acpi_tb_initialize_facs+0x31/0x33
[    3.685450]  [<ffffffff81d27ec6>] start_kernel+0x37f/0x411
[    3.685517]  [<ffffffff81d27950>] ? repair_env_string+0x5c/0x5c
[    3.685586]  [<ffffffff81d27606>] x86_64_start_reservations+0x2a/0x2c
[    3.685654]  [<ffffffff81d2a6df>] xen_start_kernel+0x594/0x5a0
[    3.685728] ---[ end trace a2cf2d7b2ecab826 ]---

But then I think we may want to rearrange preempt_enable/disable in 
xen_set_pgd().

-boris

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23  9:54                                         ` One Thousand Gnomes
@ 2014-04-23 15:53                                           ` H. Peter Anvin
  2014-04-23 17:08                                             ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-23 15:53 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Andrew Lutomirski, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On 04/23/2014 02:54 AM, One Thousand Gnomes wrote:
>> Ideally the tests should be doable such that on a normal machine the
>> tests can be overlapped with the other things we have to do on that
>> path.  The exit branch will be strongly predicted in the negative
>> direction, so it shouldn't be a significant problem.
>>
>> Again, this is not the case in the current prototype.
> 
> Or you make sure that you switch to those code paths only after software
> has executed syscalls that make it possible it will use a 16bit ss. 
> 

Which, again, would introduce a race, I believe, at least if we have an
LDT at all (and since we only enter these code paths for LDT descriptors
in the first place, it is equivalent to the current code minus the filters.)

> The other question I have is - is there any reason we can't fix up the
> IRET to do a 32bit return into a vsyscall type userspace page which then
> does a long jump or retf to the right place ?

I did a writeup on this a while ago.  It does have the problem that you
need additional memory in userspace, which is per-thread and in the
right region of userspace; this pretty much means you have to muck about
with the user space stack when user space is running in weird modes.
This gets complex very quickly and does have some "footprint".
Furthermore, on some CPUs (not including any recent Intel CPUs) there is
still a way to leak bits [63:32].  I believe the in-kernel solution is
actually simpler.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23 14:24         ` Boris Ostrovsky
@ 2014-04-23 16:56           ` H. Peter Anvin
  2014-04-28 13:04             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-23 16:56 UTC (permalink / raw)
  To: Boris Ostrovsky, Konrad Rzeszutek Wilk
  Cc: Andrew Lutomirski, H. Peter Anvin, Linux Kernel Mailing List,
	Linus Torvalds, Ingo Molnar, Alexander van Heukelum,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/23/2014 07:24 AM, Boris Ostrovsky wrote:
>>
>> Konrad - I really could use some help figuring out what needs to be done
>> for this not to break Xen.
> 
> This does break Xen PV:
> 

I know it does.  This is why I asked for help.

This is fundamentally the problem with PV and *especially* the way Xen
PV was integrated into Linux: it acts as a development brake for native
hardware.  Fortunately, Konrad has been quite responsive to that kind of
problems, which hasn't always been true of the Xen community in the past.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23 15:53                                           ` H. Peter Anvin
@ 2014-04-23 17:08                                             ` Andrew Lutomirski
  2014-04-23 17:16                                               ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-23 17:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: One Thousand Gnomes, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On Wed, Apr 23, 2014 at 8:53 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/23/2014 02:54 AM, One Thousand Gnomes wrote:
>>> Ideally the tests should be doable such that on a normal machine the
>>> tests can be overlapped with the other things we have to do on that
>>> path.  The exit branch will be strongly predicted in the negative
>>> direction, so it shouldn't be a significant problem.
>>>
>>> Again, this is not the case in the current prototype.
>>
>> Or you make sure that you switch to those code paths only after software
>> has executed syscalls that make it possible it will use a 16bit ss.
>>
>
> Which, again, would introduce a race, I believe, at least if we have an
> LDT at all (and since we only enter these code paths for LDT descriptors
> in the first place, it is equivalent to the current code minus the filters.)

The only way I can see to trigger the race is with sigreturn, but it's
still there.  Sigh.

Here are two semi-related things:

1. The Intel manual's description of iretq does seems like it forgot
to mention that iret restores the stack pointer in anything except
vm86 mode.  Fortunately, the AMD manual seems to thing that, when
returning *from* 64-bit mode, RSP is always restored, which I think is
necessary for this patch to work correctly.

2. I've often pondered changing the way we return *to* CPL 0 to bypass
iret entirely.  It could be something like:

SS
RSP
EFLAGS
CS
RIP

push 16($rsp)
popfq [does this need to force rex.w somehow?]
ret $64

This may break backtraces if cfi isn't being used and we get an NMI
just before the popfq.  I'm not quite sure how that works.

I haven't benchmarked this at all, but the only slow part should be
the popfq, and I doubt it's anywhere near as slow as iret.

>
>> The other question I have is - is there any reason we can't fix up the
>> IRET to do a 32bit return into a vsyscall type userspace page which then
>> does a long jump or retf to the right place ?
>
> I did a writeup on this a while ago.  It does have the problem that you
> need additional memory in userspace, which is per-thread and in the
> right region of userspace; this pretty much means you have to muck about
> with the user space stack when user space is running in weird modes.
> This gets complex very quickly and does have some "footprint".
> Furthermore, on some CPUs (not including any recent Intel CPUs) there is
> still a way to leak bits [63:32].  I believe the in-kernel solution is
> actually simpler.
>

There's also no real guarantee that user code won't unmap the vdso.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23 17:08                                             ` Andrew Lutomirski
@ 2014-04-23 17:16                                               ` H. Peter Anvin
  2014-04-23 17:25                                                 ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-23 17:16 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: One Thousand Gnomes, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
> 
> The only way I can see to trigger the race is with sigreturn, but it's
> still there.  Sigh.
> 

I don't see why sigreturn needs to be involved... all you need is
modify_ldt() on one CPU while the other is in the middle of an IRET
return.  Small window, so hard to hit, but still.

> 2. I've often pondered changing the way we return *to* CPL 0 to bypass
> iret entirely.  It could be something like:
> 
> SS
> RSP
> EFLAGS
> CS
> RIP
> 
> push 16($rsp)
> popfq [does this need to force rex.w somehow?]
> ret $64

When you say return to CPL 0 you mean intra-kernel return?  That isn't
really the problem here, though.  I think this will also break the
kernel debugger since it will have the wrong behavior for TF and RF.

>>> The other question I have is - is there any reason we can't fix up the
>>> IRET to do a 32bit return into a vsyscall type userspace page which then
>>> does a long jump or retf to the right place ?
>>
>> I did a writeup on this a while ago.  It does have the problem that you
>> need additional memory in userspace, which is per-thread and in the
>> right region of userspace; this pretty much means you have to muck about
>> with the user space stack when user space is running in weird modes.
>> This gets complex very quickly and does have some "footprint".
>> Furthermore, on some CPUs (not including any recent Intel CPUs) there is
>> still a way to leak bits [63:32].  I believe the in-kernel solution is
>> actually simpler.
>>
> 
> There's also no real guarantee that user code won't unmap the vdso.

There is, but there is also at some point a "don't do that, then" aspect
to it all.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23 17:16                                               ` H. Peter Anvin
@ 2014-04-23 17:25                                                 ` Andrew Lutomirski
  2014-04-23 17:28                                                   ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-23 17:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: One Thousand Gnomes, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
>>
>> The only way I can see to trigger the race is with sigreturn, but it's
>> still there.  Sigh.
>>
>
> I don't see why sigreturn needs to be involved... all you need is
> modify_ldt() on one CPU while the other is in the middle of an IRET
> return.  Small window, so hard to hit, but still.
>

If you set the flag as soon as anyone calls modify_ldt, before any
descriptor is installed, then I don't think this can happen.  But
there's still sigreturn, and I don't think this is worth all the
complexity to save a single branch on #GP.

>> 2. I've often pondered changing the way we return *to* CPL 0 to bypass
>> iret entirely.  It could be something like:
>>
>> SS
>> RSP
>> EFLAGS
>> CS
>> RIP
>>
>> push 16($rsp)
>> popfq [does this need to force rex.w somehow?]
>> ret $64
>
> When you say return to CPL 0 you mean intra-kernel return?  That isn't
> really the problem here, though.  I think this will also break the
> kernel debugger since it will have the wrong behavior for TF and RF.

I do mean intra-kernel.  And yes, this has nothing to do with espfix,
but it would make write_msr_safe fail more quickly :)

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23 17:25                                                 ` Andrew Lutomirski
@ 2014-04-23 17:28                                                   ` H. Peter Anvin
  2014-04-23 17:45                                                     ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-23 17:28 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: One Thousand Gnomes, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On 04/23/2014 10:25 AM, Andrew Lutomirski wrote:
> On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
>>>
>>> The only way I can see to trigger the race is with sigreturn, but it's
>>> still there.  Sigh.
>>
>> I don't see why sigreturn needs to be involved... all you need is
>> modify_ldt() on one CPU while the other is in the middle of an IRET
>> return.  Small window, so hard to hit, but still.
> 
> If you set the flag as soon as anyone calls modify_ldt, before any
> descriptor is installed, then I don't think this can happen.  But
> there's still sigreturn, and I don't think this is worth all the
> complexity to save a single branch on #GP.
> 

Who cares?  Since we only need to enter the fixup path for LDT
selectors, anything that is dependent on having called modify_ldt() is
already redundant.

In some ways that is the saving grace.  SS being an LDT selector is
fortunately a rare case.

> I do mean intra-kernel.  And yes, this has nothing to do with espfix,
> but it would make write_msr_safe fail more quickly :)

And, pray tell, how important is that?

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23 17:28                                                   ` H. Peter Anvin
@ 2014-04-23 17:45                                                     ` Andrew Lutomirski
  0 siblings, 0 replies; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-23 17:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: One Thousand Gnomes, Linus Torvalds, Borislav Petkov,
	H. Peter Anvin, Linux Kernel Mailing List, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On Wed, Apr 23, 2014 at 10:28 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/23/2014 10:25 AM, Andrew Lutomirski wrote:
>> On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
>>>>
>>>> The only way I can see to trigger the race is with sigreturn, but it's
>>>> still there.  Sigh.
>>>
>>> I don't see why sigreturn needs to be involved... all you need is
>>> modify_ldt() on one CPU while the other is in the middle of an IRET
>>> return.  Small window, so hard to hit, but still.
>>
>> If you set the flag as soon as anyone calls modify_ldt, before any
>> descriptor is installed, then I don't think this can happen.  But
>> there's still sigreturn, and I don't think this is worth all the
>> complexity to save a single branch on #GP.
>>
>
> Who cares?  Since we only need to enter the fixup path for LDT
> selectors, anything that is dependent on having called modify_ldt() is
> already redundant.

But you still have to test this, and folding it into the existing
check for thread flags would eliminate that.  Still, I think this
would not be worth it, even if it were correct.

>
> In some ways that is the saving grace.  SS being an LDT selector is
> fortunately a rare case.
>
>> I do mean intra-kernel.  And yes, this has nothing to do with espfix,
>> but it would make write_msr_safe fail more quickly :)
>
> And, pray tell, how important is that?

Not very.

Page faults may be a different story for some workloads, particularly
if they are IO-heavy.  Returning to preempted kernel threads may also
matter.

For my particular workload, returns from rescheduling interrupts
delivered to idle cpus probably also matters, but the fact that those
interrupts are happening at all is a bug that tglx is working on.

--Andy

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-21 22:47 ` [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE* H. Peter Anvin
                     ` (2 preceding siblings ...)
  2014-04-23  1:17   ` H. Peter Anvin
@ 2014-04-24  4:13   ` comex
  2014-04-24  4:53     ` Andrew Lutomirski
  2014-04-25 12:02   ` Pavel Machek
  4 siblings, 1 reply; 136+ messages in thread
From: comex @ 2014-04-24  4:13 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linux Kernel Mailing List, H. Peter Anvin, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Andy Lutomirski,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Borislav Petkov,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin <hpa@linux.intel.com> wrote:
> This is a prototype of espfix for the 64-bit kernel.  espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment.  We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.

Hi,

A comment: The main purpose of espfix is to prevent attackers from
learning sensitive addresses, right?  But as far as I can tell, this
mini-stack becomes itself somewhat sensitive:

- The user can put arbitrary data in registers before returning to the
LDT in order to get it saved at a known address accessible from the
kernel.  With SMAP and KASLR this might otherwise be difficult.
- If the iret faults, kernel addresses will get stored there (and not
cleared).  If a vulnerability could return data from an arbitrary
specified address to the user, this would be harmful.

I guess with the current KASLR implementation you could get the same
effects via brute force anyway, by filling up and browsing memory,
respectively, but ideally there wouldn't be any virtual addresses
guaranteed not to fault.

- If a vulnerability allowed overwriting data at an arbitrary
specified address, the exception frame could get overwritten at
exactly the right moment between the copy and iret (or right after the
iret to mess up fixup_exception)?  You probably know better than I
whether or not caches prevent this from actually being possible.

Just raising the issue.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-24  4:13   ` comex
@ 2014-04-24  4:53     ` Andrew Lutomirski
  2014-04-24 22:24       ` H. Peter Anvin
  2014-04-28 23:05       ` H. Peter Anvin
  0 siblings, 2 replies; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-24  4:53 UTC (permalink / raw)
  To: comex
  Cc: H. Peter Anvin, Linux Kernel Mailing List, H. Peter Anvin,
	Linus Torvalds, Ingo Molnar, Alexander van Heukelum,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Borislav Petkov,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On Wed, Apr 23, 2014 at 9:13 PM, comex <comexk@gmail.com> wrote:
> On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin <hpa@linux.intel.com> wrote:
>> This is a prototype of espfix for the 64-bit kernel.  espfix is a
>> workaround for the architectural definition of IRET, which fails to
>> restore bits [31:16] of %esp when returning to a 16-bit stack
>> segment.  We have a workaround for the 32-bit kernel, but that
>> implementation doesn't work for 64 bits.
>
> Hi,
>
> A comment: The main purpose of espfix is to prevent attackers from
> learning sensitive addresses, right?  But as far as I can tell, this
> mini-stack becomes itself somewhat sensitive:
>
> - The user can put arbitrary data in registers before returning to the
> LDT in order to get it saved at a known address accessible from the
> kernel.  With SMAP and KASLR this might otherwise be difficult.

For one thing, this only matters on Haswell.  Otherwise the user can
put arbitrary data in userspace.

On Haswell, the HPET fixmap is currently a much simpler vector that
can do much the same thing, as long as you're willing to wait for the
HPET counter to contain some particular value.  I have patches that
will fix that as a side effect.

Would it pay to randomize the location of the espfix area?  Another
somewhat silly idea is to add some random offset to the CPU number mod
NR_CPUS so that at attacker won't know which ministack is which.

> - If the iret faults, kernel addresses will get stored there (and not
> cleared).  If a vulnerability could return data from an arbitrary
> specified address to the user, this would be harmful.

Can this be fixed by clearing the ministack in bad_iret?  There will
still be a window in which the kernel address is in there, but it'll
be short.

>
> I guess with the current KASLR implementation you could get the same
> effects via brute force anyway, by filling up and browsing memory,
> respectively, but ideally there wouldn't be any virtual addresses
> guaranteed not to fault.
>
> - If a vulnerability allowed overwriting data at an arbitrary
> specified address, the exception frame could get overwritten at
> exactly the right moment between the copy and iret (or right after the
> iret to mess up fixup_exception)?  You probably know better than I
> whether or not caches prevent this from actually being possible.

To attack this, you'd change the saved CS value.  I don't think caches
would make a difference.

This particular vector hurts: you can safely keep trying until it works.

This just gave me an evil idea: what if we make the whole espfix area
read-only?  This has some weird effects.  To switch to the espfix
stack, you have to write to an alias.  That's a little strange but
harmless and barely complicates the implementation.  If the iret
faults, though, I think the result will be a #DF.  This may actually
be a good thing: if the #DF handler detects that the cause was a bad
espfix iret, it could just return directly to bad_iret or send the
signal itself the same way that do_stack_segment does.  This could
even be written in C :)

Peter, is this idea completely nuts?  The only exceptions that can
happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
so they won't double-fault.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-24  4:53     ` Andrew Lutomirski
@ 2014-04-24 22:24       ` H. Peter Anvin
  2014-04-24 22:31         ` Andrew Lutomirski
  2014-04-28 23:05       ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-24 22:24 UTC (permalink / raw)
  To: Andrew Lutomirski, comex
  Cc: Linux Kernel Mailing List, H. Peter Anvin, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:
>>
>> - The user can put arbitrary data in registers before returning to the
>> LDT in order to get it saved at a known address accessible from the
>> kernel.  With SMAP and KASLR this might otherwise be difficult.
> 
> For one thing, this only matters on Haswell.  Otherwise the user can
> put arbitrary data in userspace.
> 
> On Haswell, the HPET fixmap is currently a much simpler vector that
> can do much the same thing, as long as you're willing to wait for the
> HPET counter to contain some particular value.  I have patches that
> will fix that as a side effect.
> 
> Would it pay to randomize the location of the espfix area?  Another
> somewhat silly idea is to add some random offset to the CPU number mod
> NR_CPUS so that at attacker won't know which ministack is which.

Since we store the espfix stack location explicitly, as long as the
scrambling happens in the initialization code that's fine.  However, we
don't want to reduce locality lest we massively blow up the memory
requirements.

We could XOR with a random constant with no penalty at all.  Only
problem is that this happens early, so the entropy system is not yet
available.  Fine if we have RDRAND, but...

>> - If the iret faults, kernel addresses will get stored there (and not
>> cleared).  If a vulnerability could return data from an arbitrary
>> specified address to the user, this would be harmful.
> 
> Can this be fixed by clearing the ministack in bad_iret?  There will
> still be a window in which the kernel address is in there, but it'll
> be short.

We could, if anyone thinks this is actually beneficial.

I'm trying to dig into some of the deeper semantics of IRET to figure
out another issue (a much bigger potential problem), this would affect
that as well.  My current belief is that we don't actually have a
problem here.

>> - If a vulnerability allowed overwriting data at an arbitrary
>> specified address, the exception frame could get overwritten at
>> exactly the right moment between the copy and iret (or right after the
>> iret to mess up fixup_exception)?  You probably know better than I
>> whether or not caches prevent this from actually being possible.
> 
> To attack this, you'd change the saved CS value.  I don't think caches
> would make a difference.
> 
> This particular vector hurts: you can safely keep trying until it works.
> 
> This just gave me an evil idea: what if we make the whole espfix area
> read-only?  This has some weird effects.  To switch to the espfix
> stack, you have to write to an alias.  That's a little strange but
> harmless and barely complicates the implementation.  If the iret
> faults, though, I think the result will be a #DF.  This may actually
> be a good thing: if the #DF handler detects that the cause was a bad
> espfix iret, it could just return directly to bad_iret or send the
> signal itself the same way that do_stack_segment does.  This could
> even be written in C :)
>
> Peter, is this idea completely nuts?  The only exceptions that can
> happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
> so they won't double-fault.

It is completely nuts, but sometimes completely nuts is actually useful.
 It is more complexity, to be sure, but it doesn't seem completely out
of the realm of reason, and avoids having to unwind the ministack except
in the normally-fatal #DF handler.  #DFs are documented as not
recoverable, but we might be able to do something here.

The only real disadvantage I see is the need for more bookkeeping
metadata.  Basically the bitmask in espfix_64.c now needs to turn into
an array, plus we need a second percpu variable.  Given that if
CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-24 22:24       ` H. Peter Anvin
@ 2014-04-24 22:31         ` Andrew Lutomirski
  2014-04-24 22:37           ` H. Peter Anvin
  0 siblings, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-24 22:31 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: comex, Linux Kernel Mailing List, H. Peter Anvin, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Thu, Apr 24, 2014 at 3:24 PM, H. Peter Anvin <hpa@linux.intel.com> wrote:
> On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:
>>>
>>> - The user can put arbitrary data in registers before returning to the
>>> LDT in order to get it saved at a known address accessible from the
>>> kernel.  With SMAP and KASLR this might otherwise be difficult.
>>
>> For one thing, this only matters on Haswell.  Otherwise the user can
>> put arbitrary data in userspace.
>>
>> On Haswell, the HPET fixmap is currently a much simpler vector that
>> can do much the same thing, as long as you're willing to wait for the
>> HPET counter to contain some particular value.  I have patches that
>> will fix that as a side effect.
>>
>> Would it pay to randomize the location of the espfix area?  Another
>> somewhat silly idea is to add some random offset to the CPU number mod
>> NR_CPUS so that at attacker won't know which ministack is which.
>
> Since we store the espfix stack location explicitly, as long as the
> scrambling happens in the initialization code that's fine.  However, we
> don't want to reduce locality lest we massively blow up the memory
> requirements.

I was imagining just randomizing a couple of high bits so the whole
espfix area moves as a unit.

>
> We could XOR with a random constant with no penalty at all.  Only
> problem is that this happens early, so the entropy system is not yet
> available.  Fine if we have RDRAND, but...

How many people have SMAP and not RDRAND?  I think this is a complete
nonissue for non-SMAP systems.

>> Peter, is this idea completely nuts?  The only exceptions that can
>> happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
>> so they won't double-fault.
>
> It is completely nuts, but sometimes completely nuts is actually useful.
>  It is more complexity, to be sure, but it doesn't seem completely out
> of the realm of reason, and avoids having to unwind the ministack except
> in the normally-fatal #DF handler.  #DFs are documented as not
> recoverable, but we might be able to do something here.
>
> The only real disadvantage I see is the need for more bookkeeping
> metadata.  Basically the bitmask in espfix_64.c now needs to turn into
> an array, plus we need a second percpu variable.  Given that if
> CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.

Doing something in #DF needs percpu data?  What am I missing?

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-24 22:31         ` Andrew Lutomirski
@ 2014-04-24 22:37           ` H. Peter Anvin
  2014-04-24 22:43             ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-24 22:37 UTC (permalink / raw)
  To: Andrew Lutomirski, H. Peter Anvin
  Cc: comex, Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/24/2014 03:31 PM, Andrew Lutomirski wrote:
> 
> I was imagining just randomizing a couple of high bits so the whole
> espfix area moves as a unit.
> 
>> We could XOR with a random constant with no penalty at all.  Only
>> problem is that this happens early, so the entropy system is not yet
>> available.  Fine if we have RDRAND, but...
> 
> How many people have SMAP and not RDRAND?  I think this is a complete
> nonissue for non-SMAP systems.
> 

Most likely none, unless some "clever" virtualizer turns off RDRAND out
of spite.

>>> Peter, is this idea completely nuts?  The only exceptions that can
>>> happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
>>> so they won't double-fault.
>>
>> It is completely nuts, but sometimes completely nuts is actually useful.
>>  It is more complexity, to be sure, but it doesn't seem completely out
>> of the realm of reason, and avoids having to unwind the ministack except
>> in the normally-fatal #DF handler.  #DFs are documented as not
>> recoverable, but we might be able to do something here.
>>
>> The only real disadvantage I see is the need for more bookkeeping
>> metadata.  Basically the bitmask in espfix_64.c now needs to turn into
>> an array, plus we need a second percpu variable.  Given that if
>> CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.
> 
> Doing something in #DF needs percpu data?  What am I missing?

You need the second percpu variable in the espfix setup code so you have
both the write address and the target rsp (read address).

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-24 22:37           ` H. Peter Anvin
@ 2014-04-24 22:43             ` Andrew Lutomirski
  0 siblings, 0 replies; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-24 22:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H. Peter Anvin, comex, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Thu, Apr 24, 2014 at 3:37 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 04/24/2014 03:31 PM, Andrew Lutomirski wrote:
>>
>> I was imagining just randomizing a couple of high bits so the whole
>> espfix area moves as a unit.
>>
>>> We could XOR with a random constant with no penalty at all.  Only
>>> problem is that this happens early, so the entropy system is not yet
>>> available.  Fine if we have RDRAND, but...
>>
>> How many people have SMAP and not RDRAND?  I think this is a complete
>> nonissue for non-SMAP systems.
>>
>
> Most likely none, unless some "clever" virtualizer turns off RDRAND out
> of spite.
>
>>>> Peter, is this idea completely nuts?  The only exceptions that can
>>>> happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
>>>> so they won't double-fault.
>>>
>>> It is completely nuts, but sometimes completely nuts is actually useful.
>>>  It is more complexity, to be sure, but it doesn't seem completely out
>>> of the realm of reason, and avoids having to unwind the ministack except
>>> in the normally-fatal #DF handler.  #DFs are documented as not
>>> recoverable, but we might be able to do something here.
>>>
>>> The only real disadvantage I see is the need for more bookkeeping
>>> metadata.  Basically the bitmask in espfix_64.c now needs to turn into
>>> an array, plus we need a second percpu variable.  Given that if
>>> CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.
>>
>> Doing something in #DF needs percpu data?  What am I missing?
>
> You need the second percpu variable in the espfix setup code so you have
> both the write address and the target rsp (read address).
>

Duh. :)

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-21 22:47 ` [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE* H. Peter Anvin
                     ` (3 preceding siblings ...)
  2014-04-24  4:13   ` comex
@ 2014-04-25 12:02   ` Pavel Machek
  2014-04-25 21:20     ` H. Peter Anvin
  4 siblings, 1 reply; 136+ messages in thread
From: Pavel Machek @ 2014-04-25 12:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linux Kernel Mailing List, H. Peter Anvin, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Andy Lutomirski,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Borislav Petkov,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

Hi!

> This is a prototype of espfix for the 64-bit kernel.  espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment.  We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.

Just to understand the consequences -- we leak 16 bit of kernel data
to the userspace, right? Because it is %esp, we know that we leak
stack address, which is not too sensitive, but will make kernel
address randomization less useful...?

> The 64-bit implementation works like this:
> 
> Set up a ministack for each CPU, which is then mapped 65536 times
> using the page tables.  This implementation uses the second-to-last
> PGD slot for this; with a 64-byte espfix stack this is sufficient for
> 2^18 CPUs (currently we support a max of 2^13 CPUs.)

16-bit stack segments on 64-bit machine. Who still uses it? Dosemu?
Wine? Would the solution be to disallow that?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23  1:17   ` H. Peter Anvin
  2014-04-23  1:23     ` Andrew Lutomirski
@ 2014-04-25 21:02     ` Konrad Rzeszutek Wilk
  2014-04-25 21:16       ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-25 21:02 UTC (permalink / raw)
  To: H. Peter Anvin, boris.ostrovsky
  Cc: Linux Kernel Mailing List, H. Peter Anvin, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Andy Lutomirski,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Tue, Apr 22, 2014 at 06:17:21PM -0700, H. Peter Anvin wrote:
> Another spin of the prototype.  This one avoids the espfix for anything
> but #GP, and avoids save/restore/saving registers... one can wonder,
> though, how much that actually matters in practice.
> 
> It still does redundant SWAPGS on the slow path.  I'm not sure I
> personally care enough to optimize that, as it means some fairly
> significant restructuring of some of the code paths.  Some of that
> restructuring might actually be beneficial, but still...

Sorry about being late to the party.


 .. snip..
> diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
> new file mode 100644
> index 000000000000..05567d706f92
> --- /dev/null
> +++ b/arch/x86/kernel/espfix_64.c
> @@ -0,0 +1,136 @@
> +/* ----------------------------------------------------------------------- *
> + *
> + *   Copyright 2014 Intel Corporation; author: H. Peter Anvin
> + *
> + *   This file is part of the Linux kernel, and is made available under
> + *   the terms of the GNU General Public License version 2 or (at your
> + *   option) any later version; incorporated herein by reference.
> + *
> + * ----------------------------------------------------------------------- */
> +
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/percpu.h>
> +#include <linux/gfp.h>
> +#include <asm/pgtable.h>
> +
> +#define ESPFIX_STACK_SIZE	64UL
> +#define ESPFIX_STACKS_PER_PAGE	(PAGE_SIZE/ESPFIX_STACK_SIZE)
> +
> +#define ESPFIX_MAX_CPUS (ESPFIX_STACKS_PER_PAGE << (PGDIR_SHIFT-PAGE_SHIFT-16))
> +#if CONFIG_NR_CPUS > ESPFIX_MAX_CPUS
> +# error "Need more than one PGD for the ESPFIX hack"
> +#endif
> +
> +#define ESPFIX_BASE_ADDR	(-2UL << PGDIR_SHIFT)
> +
> +#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
> +
> +/* This contains the *bottom* address of the espfix stack */
> +DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
> +
> +/* Initialization mutex - should this be a spinlock? */
> +static DEFINE_MUTEX(espfix_init_mutex);
> +
> +/* Page allocation bitmap - each page serves ESPFIX_STACKS_PER_PAGE CPUs */
> +#define ESPFIX_MAX_PAGES  DIV_ROUND_UP(CONFIG_NR_CPUS, ESPFIX_STACKS_PER_PAGE)
> +#define ESPFIX_MAP_SIZE   DIV_ROUND_UP(ESPFIX_MAX_PAGES, BITS_PER_LONG)
> +static unsigned long espfix_page_alloc_map[ESPFIX_MAP_SIZE];
> +
> +static __page_aligned_bss pud_t espfix_pud_page[PTRS_PER_PUD]
> +	__aligned(PAGE_SIZE);
> +
> +/*
> + * This returns the bottom address of the espfix stack for a specific CPU.
> + * The math allows for a non-power-of-two ESPFIX_STACK_SIZE, in which case
> + * we have to account for some amount of padding at the end of each page.
> + */
> +static inline unsigned long espfix_base_addr(unsigned int cpu)
> +{
> +	unsigned long page, addr;
> +
> +	page = (cpu / ESPFIX_STACKS_PER_PAGE) << PAGE_SHIFT;
> +	addr = page + (cpu % ESPFIX_STACKS_PER_PAGE) * ESPFIX_STACK_SIZE;
> +	addr = (addr & 0xffffUL) | ((addr & ~0xffffUL) << 16);
> +	addr += ESPFIX_BASE_ADDR;
> +	return addr;
> +}
> +
> +#define PTE_STRIDE        (65536/PAGE_SIZE)
> +#define ESPFIX_PTE_CLONES (PTRS_PER_PTE/PTE_STRIDE)
> +#define ESPFIX_PMD_CLONES PTRS_PER_PMD
> +#define ESPFIX_PUD_CLONES (65536/(ESPFIX_PTE_CLONES*ESPFIX_PMD_CLONES))
> +
> +void init_espfix_this_cpu(void)
> +{
> +	unsigned int cpu, page;
> +	unsigned long addr;
> +	pgd_t pgd, *pgd_p;
> +	pud_t pud, *pud_p;
> +	pmd_t pmd, *pmd_p;
> +	pte_t pte, *pte_p;
> +	int n;
> +	void *stack_page;
> +	pteval_t ptemask;
> +
> +	/* We only have to do this once... */
> +	if (likely(this_cpu_read(espfix_stack)))
> +		return;		/* Already initialized */
> +
> +	cpu = smp_processor_id();
> +	addr = espfix_base_addr(cpu);
> +	page = cpu/ESPFIX_STACKS_PER_PAGE;
> +
> +	/* Did another CPU already set this up? */
> +	if (likely(test_bit(page, espfix_page_alloc_map)))
> +		goto done;
> +
> +	mutex_lock(&espfix_init_mutex);
> +
> +	/* Did we race on the lock? */
> +	if (unlikely(test_bit(page, espfix_page_alloc_map)))
> +		goto unlock_done;
> +
> +	ptemask = __supported_pte_mask;
> +
> +	pgd_p = &init_level4_pgt[pgd_index(addr)];
> +	pgd = *pgd_p;
> +	if (!pgd_present(pgd)) {
> +		/* This can only happen on the BSP */
> +		pgd = __pgd(__pa_symbol(espfix_pud_page) |

Any particular reason you are using __pgd

> +			    (_KERNPG_TABLE & ptemask));
> +		set_pgd(pgd_p, pgd);
> +	}
> +
> +	pud_p = &espfix_pud_page[pud_index(addr)];
> +	pud = *pud_p;
> +	if (!pud_present(pud)) {
> +		pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
> +		pud = __pud(__pa(pmd_p) | (_KERNPG_TABLE & ptemask));

_pud
> +		for (n = 0; n < ESPFIX_PUD_CLONES; n++)
> +			set_pud(&pud_p[n], pud);
> +	}
> +
> +	pmd_p = pmd_offset(&pud, addr);
> +	pmd = *pmd_p;
> +	if (!pmd_present(pmd)) {
> +		pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
> +		pmd = __pmd(__pa(pte_p) | (_KERNPG_TABLE & ptemask));

and _pmd?
> +		for (n = 0; n < ESPFIX_PMD_CLONES; n++)
> +			set_pmd(&pmd_p[n], pmd);
> +	}
> +
> +	pte_p = pte_offset_kernel(&pmd, addr);
> +	stack_page = (void *)__get_free_page(GFP_KERNEL);
> +	pte = __pte(__pa(stack_page) | (__PAGE_KERNEL & ptemask));

and __pte instead of the 'pmd', 'pud', 'pmd' and 'pte' macros?

> +	for (n = 0; n < ESPFIX_PTE_CLONES; n++)
> +		set_pte(&pte_p[n*PTE_STRIDE], pte);
> +
> +	/* Job is done for this CPU and any CPU which shares this page */
> +	set_bit(page, espfix_page_alloc_map);
> +
> +unlock_done:
> +	mutex_unlock(&espfix_init_mutex);
> +done:
> +	this_cpu_write(espfix_stack, addr);
> +}

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-25 21:02     ` Konrad Rzeszutek Wilk
@ 2014-04-25 21:16       ` H. Peter Anvin
  0 siblings, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-25 21:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, H. Peter Anvin, boris.ostrovsky
  Cc: Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Andy Lutomirski, Borislav Petkov,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On 04/25/2014 02:02 PM, Konrad Rzeszutek Wilk wrote:
> 
> Any particular reason you are using __pgd
> 
> _pud
> 
> and _pmd?
> 
> and __pte instead of the 'pmd', 'pud', 'pmd' and 'pte' macros?
> 

Not that I know of other than that the semantics of the various macros
are not described anywhere to the best of my knowledge.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-25 12:02   ` Pavel Machek
@ 2014-04-25 21:20     ` H. Peter Anvin
  0 siblings, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-25 21:20 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linux Kernel Mailing List, H. Peter Anvin, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Andy Lutomirski,
	Konrad Rzeszutek Wilk, Boris Ostrovsky, Borislav Petkov,
	Arjan van de Ven, Brian Gerst, Alexandre Julliard, Andi Kleen,
	Thomas Gleixner

On 04/25/2014 05:02 AM, Pavel Machek wrote:
> 
> Just to understand the consequences -- we leak 16 bit of kernel data
> to the userspace, right? Because it is %esp, we know that we leak
> stack address, which is not too sensitive, but will make kernel
> address randomization less useful...?
> 

It is rather sensitive, in fact.

>> The 64-bit implementation works like this:
>>
>> Set up a ministack for each CPU, which is then mapped 65536 times
>> using the page tables.  This implementation uses the second-to-last
>> PGD slot for this; with a 64-byte espfix stack this is sufficient for
>> 2^18 CPUs (currently we support a max of 2^13 CPUs.)
> 
> 16-bit stack segments on 64-bit machine. Who still uses it? Dosemu?
> Wine? Would the solution be to disallow that?

Welcome to the show.  We do, in fact disallow it now in the 3.15-rc
series.  The Wine guys are complaining.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-23 16:56           ` H. Peter Anvin
@ 2014-04-28 13:04             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 136+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-28 13:04 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Boris Ostrovsky, Andrew Lutomirski, H. Peter Anvin,
	Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Borislav Petkov, Arjan van de Ven,
	Brian Gerst, Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Wed, Apr 23, 2014 at 09:56:00AM -0700, H. Peter Anvin wrote:
> On 04/23/2014 07:24 AM, Boris Ostrovsky wrote:
> >>
> >> Konrad - I really could use some help figuring out what needs to be done
> >> for this not to break Xen.
> > 
> > This does break Xen PV:
> > 
> 
> I know it does.  This is why I asked for help.

This week is chaotic for me but I taking a stab at it. Should have
something by the end of the week on top of your patch.

> 
> This is fundamentally the problem with PV and *especially* the way Xen
> PV was integrated into Linux: it acts as a development brake for native
> hardware.  Fortunately, Konrad has been quite responsive to that kind of
> problems, which hasn't always been true of the Xen community in the past.

Thank you for such kind words!

I hope that in Chicago you will be have a chance to meet other folks that
are involved in Xen and formulate a similar opinion of them.

Cheers!

> 
> 	-hpa
> 

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-24  4:53     ` Andrew Lutomirski
  2014-04-24 22:24       ` H. Peter Anvin
@ 2014-04-28 23:05       ` H. Peter Anvin
  2014-04-28 23:08         ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-28 23:05 UTC (permalink / raw)
  To: Andrew Lutomirski, comex
  Cc: H. Peter Anvin, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:
> 
> This particular vector hurts: you can safely keep trying until it works.
> 
> This just gave me an evil idea: what if we make the whole espfix area
> read-only?  This has some weird effects.  To switch to the espfix
> stack, you have to write to an alias.  That's a little strange but
> harmless and barely complicates the implementation.  If the iret
> faults, though, I think the result will be a #DF.  This may actually
> be a good thing: if the #DF handler detects that the cause was a bad
> espfix iret, it could just return directly to bad_iret or send the
> signal itself the same way that do_stack_segment does.  This could
> even be written in C :)
> 
> Peter, is this idea completely nuts?  The only exceptions that can
> happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
> so they won't double-fault.
> 

So I tried writing this bit up, but it fails in some rather spectacular
ways.  Furthermore, I have been unable to debug it under Qemu, because
breakpoints don't work right (common Qemu problem, sadly.)

The kernel code is at:

https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/

There are two tests:

git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
./run16 test/hello.elf
http://www.zytor.com/~hpa/ldttest.c

The former will exercise the irq_return_ldt path, but not the fault
path; the latter will exercise the fault path, but doesn't actually use
a 16-bit segment.

Under the 3.14 stock kernel, the former should die with SIGBUS and the
latter should pass.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-28 23:05       ` H. Peter Anvin
@ 2014-04-28 23:08         ` H. Peter Anvin
  2014-04-29  0:02           ` Andrew Lutomirski
  0 siblings, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-28 23:08 UTC (permalink / raw)
  To: H. Peter Anvin, Andrew Lutomirski, comex
  Cc: Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
> 
> So I tried writing this bit up, but it fails in some rather spectacular
> ways.  Furthermore, I have been unable to debug it under Qemu, because
> breakpoints don't work right (common Qemu problem, sadly.)
> 
> The kernel code is at:
> 
> https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
> 
> There are two tests:
> 
> git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
> ./run16 test/hello.elf
> http://www.zytor.com/~hpa/ldttest.c
> 
> The former will exercise the irq_return_ldt path, but not the fault
> path; the latter will exercise the fault path, but doesn't actually use
> a 16-bit segment.
> 
> Under the 3.14 stock kernel, the former should die with SIGBUS and the
> latter should pass.
> 

Current status of the above code: if I remove the randomization in
espfix_64.c then the first test passes; the second generally crashes the
machine.  With the randomization there, both generally crash the machine.

All my testing so far has been under KVM or Qemu, so there is always the
possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
something simpler than that.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-28 23:08         ` H. Peter Anvin
@ 2014-04-29  0:02           ` Andrew Lutomirski
  2014-04-29  0:15             ` H. Peter Anvin
  2014-04-29  0:20             ` Andrew Lutomirski
  0 siblings, 2 replies; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-29  0:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H. Peter Anvin, comex, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin <hpa@linux.intel.com> wrote:
> On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
>>
>> So I tried writing this bit up, but it fails in some rather spectacular
>> ways.  Furthermore, I have been unable to debug it under Qemu, because
>> breakpoints don't work right (common Qemu problem, sadly.)
>>
>> The kernel code is at:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
>>
>> There are two tests:
>>
>> git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
>> ./run16 test/hello.elf
>> http://www.zytor.com/~hpa/ldttest.c
>>
>> The former will exercise the irq_return_ldt path, but not the fault
>> path; the latter will exercise the fault path, but doesn't actually use
>> a 16-bit segment.
>>
>> Under the 3.14 stock kernel, the former should die with SIGBUS and the
>> latter should pass.
>>
>
> Current status of the above code: if I remove the randomization in
> espfix_64.c then the first test passes; the second generally crashes the
> machine.  With the randomization there, both generally crash the machine.
>
> All my testing so far has been under KVM or Qemu, so there is always the
> possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
> something simpler than that.

I'm compiling your branch.  In the mean time, two possibly stupid questions:

What's the assembly code in the double-fault entry for?

Have you tried hbreak in qemu?  I've had better luck with hbreak than
regular break in the past.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-29  0:02           ` Andrew Lutomirski
@ 2014-04-29  0:15             ` H. Peter Anvin
  2014-04-29  0:20             ` Andrew Lutomirski
  1 sibling, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-29  0:15 UTC (permalink / raw)
  To: Andrew Lutomirski, H. Peter Anvin
  Cc: comex, Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/28/2014 05:02 PM, Andrew Lutomirski wrote:
> 
> I'm compiling your branch.  In the mean time, two possibly stupid questions:
> 
> What's the assembly code in the double-fault entry for?
> 

It was easier for me to add it there than adding all the glue
(prototypes and so on) to put it into C code... can convert it to C when
it works.

> Have you tried hbreak in qemu?  I've had better luck with hbreak than
> regular break in the past.

Yes, no real change.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-29  0:02           ` Andrew Lutomirski
  2014-04-29  0:15             ` H. Peter Anvin
@ 2014-04-29  0:20             ` Andrew Lutomirski
  2014-04-29  2:38               ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: Andrew Lutomirski @ 2014-04-29  0:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: H. Peter Anvin, comex, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On Mon, Apr 28, 2014 at 5:02 PM, Andrew Lutomirski <amluto@gmail.com> wrote:
> On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin <hpa@linux.intel.com> wrote:
>> On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
>>>
>>> So I tried writing this bit up, but it fails in some rather spectacular
>>> ways.  Furthermore, I have been unable to debug it under Qemu, because
>>> breakpoints don't work right (common Qemu problem, sadly.)
>>>
>>> The kernel code is at:
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
>>>
>>> There are two tests:
>>>
>>> git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
>>> ./run16 test/hello.elf
>>> http://www.zytor.com/~hpa/ldttest.c
>>>
>>> The former will exercise the irq_return_ldt path, but not the fault
>>> path; the latter will exercise the fault path, but doesn't actually use
>>> a 16-bit segment.
>>>
>>> Under the 3.14 stock kernel, the former should die with SIGBUS and the
>>> latter should pass.
>>>
>>
>> Current status of the above code: if I remove the randomization in
>> espfix_64.c then the first test passes; the second generally crashes the
>> machine.  With the randomization there, both generally crash the machine.
>>
>> All my testing so far has been under KVM or Qemu, so there is always the
>> possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
>> something simpler than that.
>
> I'm compiling your branch.  In the mean time, two possibly stupid questions:
>
> What's the assembly code in the double-fault entry for?
>
> Have you tried hbreak in qemu?  I've had better luck with hbreak than
> regular break in the past.
>

ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
on your branch.  It even said this:

qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
found during reset

I have no idea what an uncluncked fd is :)

hello.elf fails to sigbus.  weird.  gdb says:

1: x/i $pc
=> 0xffffffff8170559c <irq_return_ldt+90>:
    jmp    0xffffffff81705537 <irq_return_iret>
(gdb) si
<signal handler called>
1: x/i $pc
=> 0xffffffff81705537 <irq_return_iret>:    iretq
(gdb) si
Cannot access memory at address 0xf0000000f
(gdb) info registers
rax            0xffe4000f00001000    -7881234923384832
rbx            0x1000000010    68719476752
rcx            0xffe4f5580000f000    -7611541041909760
rdx            0x805d000    134598656
rsi            0x102170000ffe3    283772784279523
rdi            0xf00000007    64424509447
rbp            0xf0000000f    0xf0000000f
rsp            0xf0000000f    0xf0000000f
r8             0x0    0
r9             0x0    0
r10            0x0    0
r11            0x0    0
r12            0x0    0
r13            0x0    0
r14            0x0    0
r15            0x0    0
rip            0x0    0x0 <irq_stack_union>
eflags         0x0    [ ]
cs             0x0    0
ss             0x37f    895
ds             0x0    0
es             0x0    0
fs             0x0    0
---Type <return> to continue, or q <return> to quit---
gs             0x0    0

I got this with 'hbreak irq_return_ldt' using 'target remote :1234'
and virtme-run --console --kimg
~/apps/linux-devel/arch/x86/boot/bzImage --qemu-opts -s

This set of registers looks thoroughly bogus.  I don't trust it.  I'm
now stuck -- single-stepping stays exactly where it started.
Something is rather screwed up here.  Telling gdb to continue causes
gdb to explode and 'Hello, Afterworld!' to be displayed.

I was not able to get a breakpoint on __do_double_fault to hit.

FWIW, I think that gdb is known to have issues debugging a guest that
switches bitness.

--Andy

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-29  0:20             ` Andrew Lutomirski
@ 2014-04-29  2:38               ` H. Peter Anvin
  2014-04-29  2:44                 ` H. Peter Anvin
  2014-04-29  3:45                 ` H. Peter Anvin
  0 siblings, 2 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-29  2:38 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: H. Peter Anvin, comex, Linux Kernel Mailing List, Linus Torvalds,
	Ingo Molnar, Alexander van Heukelum, Konrad Rzeszutek Wilk,
	Boris Ostrovsky, Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:
> 
> ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
> on your branch.  It even said this:
> 
> qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
> found during reset
> 
> I have no idea what an uncluncked fd is :)
> 

It means 9p wasn't properly shut down.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-29  2:38               ` H. Peter Anvin
@ 2014-04-29  2:44                 ` H. Peter Anvin
  2014-04-29  3:45                 ` H. Peter Anvin
  1 sibling, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-29  2:44 UTC (permalink / raw)
  To: H. Peter Anvin, Andrew Lutomirski
  Cc: comex, Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/28/2014 07:38 PM, H. Peter Anvin wrote:
> On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:
>>
>> ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
>> on your branch.  It even said this:
>>
>> qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
>> found during reset
>>
>> I have no idea what an uncluncked fd is :)
>>
> 
> It means 9p wasn't properly shut down.
> 

(A "fid" is like the 9p version of a file descriptor.  Sort of.)

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-29  2:38               ` H. Peter Anvin
  2014-04-29  2:44                 ` H. Peter Anvin
@ 2014-04-29  3:45                 ` H. Peter Anvin
  2014-04-29  3:47                   ` H. Peter Anvin
  2014-04-29  4:36                   ` H. Peter Anvin
  1 sibling, 2 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-29  3:45 UTC (permalink / raw)
  To: H. Peter Anvin, Andrew Lutomirski
  Cc: comex, Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/28/2014 07:38 PM, H. Peter Anvin wrote:
> On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:
>>
>> ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
>> on your branch.  It even said this:
>>
>> qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
>> found during reset
>>
>> I have no idea what an uncluncked fd is :)
>>
> 
> It means 9p wasn't properly shut down.
> 

OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
it never sets SS to an LDT segment.  This means that it should really
have zero footprint versus the espfix code, and implies that we instead
have another bug involved.  Why the espfix code should have any effect
whatsoever is a mystery, however... if it indeed does?

I have uploaded a fixed ldttest.c, but it seems we might be chasing more
than that...

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-29  3:45                 ` H. Peter Anvin
@ 2014-04-29  3:47                   ` H. Peter Anvin
  2014-04-29  4:36                   ` H. Peter Anvin
  1 sibling, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-29  3:47 UTC (permalink / raw)
  To: H. Peter Anvin, Andrew Lutomirski
  Cc: comex, Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner

On 04/28/2014 08:45 PM, H. Peter Anvin wrote:
> 
> OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
> it never sets SS to an LDT segment.  This means that it should really
> have zero footprint versus the espfix code, and implies that we instead
> have another bug involved.  Why the espfix code should have any effect
> whatsoever is a mystery, however... if it indeed does?
> 
> I have uploaded a fixed ldttest.c, but it seems we might be chasing more
> than that...
> 

In particular, I was already wondered how we avoid an "upside down
swapgs" with a #GP on IRET.  The answer might be that we don't...

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-29  3:45                 ` H. Peter Anvin
  2014-04-29  3:47                   ` H. Peter Anvin
@ 2014-04-29  4:36                   ` H. Peter Anvin
  2014-04-29  7:14                     ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-29  4:36 UTC (permalink / raw)
  To: H. Peter Anvin, Andrew Lutomirski
  Cc: comex, Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner, Steven Rostedt

On 04/28/2014 08:45 PM, H. Peter Anvin wrote:
> 
> OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
> it never sets SS to an LDT segment.  This means that it should really
> have zero footprint versus the espfix code, and implies that we instead
> have another bug involved.  Why the espfix code should have any effect
> whatsoever is a mystery, however... if it indeed does?
> 
> I have uploaded a fixed ldttest.c, but it seems we might be chasing more
> than that...
> 

With the test fixed, the bug was easy to find: we can't compare against
__KERNEL_DS in the doublefault handler, because both SS and the image on
the stack have the stack segment set to zero (NULL).

With that both ldttest and run16 pass with the doublefault code, even
with randomization turned back on.

I have pushed out the fix.

There are still things that need fixing: we need to go through the
espfix path even when returning from NMI/MC (which fortunately can't
nest with taking an NMI/MC on the espfix path itself, since in that case
we will have been interrupted while running in the kernel with a kernel
stack.)

(Cc: Rostedt because of the NMI issue.)

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*
  2014-04-29  4:36                   ` H. Peter Anvin
@ 2014-04-29  7:14                     ` H. Peter Anvin
  0 siblings, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-04-29  7:14 UTC (permalink / raw)
  To: H. Peter Anvin, Andrew Lutomirski
  Cc: comex, Linux Kernel Mailing List, Linus Torvalds, Ingo Molnar,
	Alexander van Heukelum, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	Borislav Petkov, Arjan van de Ven, Brian Gerst,
	Alexandre Julliard, Andi Kleen, Thomas Gleixner, Steven Rostedt

On 04/28/2014 09:36 PM, H. Peter Anvin wrote:
> 
> There are still things that need fixing: we need to go through the
> espfix path even when returning from NMI/MC (which fortunately can't
> nest with taking an NMI/MC on the espfix path itself, since in that case
> we will have been interrupted while running in the kernel with a kernel
> stack.)
> 
> (Cc: Rostedt because of the NMI issue.)
> 

NMI is fine: we go through irq_return except for nested NMI.  There are
only three IRETs in the kernel (irq_return, nested_nmi_out, and the
early trap handler) and all of them are good.

I think we just need to clean up the PV aspects of this and then we
should be in good shape.  I have done a bunch of cleanups to the
development git tree.

I'm considering making 16-bit segment support a EXPERT config option for
both 32- and 64-bit kernels, as it seems like a bit of a waste for
embedded systems which don't need this kind of backward compatibility.
Maybe that is something that can be left for someone else to implement
if they feel like it.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-04-14  7:48         ` Alexandre Julliard
@ 2014-05-07  9:18           ` Sven Joachim
  2014-05-07 10:18             ` Borislav Petkov
  2014-05-07 16:57             ` Linus Torvalds
  0 siblings, 2 replies; 136+ messages in thread
From: Sven Joachim @ 2014-05-07  9:18 UTC (permalink / raw)
  To: Alexandre Julliard
  Cc: Linus Torvalds, Brian Gerst, Ingo Molnar, H. Peter Anvin,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On 2014-04-14 09:48 +0200, Alexandre Julliard wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> On Fri, Apr 11, 2014 at 11:45 AM, Brian Gerst <brgerst@gmail.com> wrote:
>>>
>>> I haven't tested it recently but I do know it has worked on 64-bit
>>> kernels.  There is no reason for it not to, the only thing not
>>> supported in long mode is vm86.  16-bit protected mode is unchanged.
>>
>> Afaik 64-bit windows doesn't support 16-bit binaries, so I just
>> assumed Wine wouldn't do it either on x86-64. Not for any real
>> technical reasons, though.
>>
>> HOWEVER. I'd like to hear something more definitive than "I haven't
>> tested recently". The "we don't break user space" is about having
>> actual real *users*, not about test programs.
>>
>> Are there people actually using 16-bit old windows programs under
>> wine? That's what matters.

It seems that at least some 32-bit programs are also broken, since after
upgrading the kernel to 3.14.3 I can no longer start my old chess
database program:

,----
| % file CB70.exe 
| CB70.exe: PE32 executable (GUI) Intel 80386, for MS Windows
| % LANG=C wine CB70.exe
| modify_ldt: Invalid argument
| modify_ldt: Invalid argument
| modify_ldt: Invalid argument
| modify_ldt: Invalid argument
| modify_ldt: Invalid argument
`----

And here it just hangs, with wineboot.exe taking 100% CPU.  I had to
kill first wineboot.exe and then CB70.exe. :-(

> Yes, there is still a significant number of users, and we still
> regularly get bug reports about specific 16-bit apps. It would be really
> nice if we could continue to support them on x86-64, particularly since
> Microsoft doesn't ;-)

I would rather not set up a virtual machine just for wine (I don't have
Windows anymore).

Cheers,
       Sven

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-05-07  9:18           ` Sven Joachim
@ 2014-05-07 10:18             ` Borislav Petkov
  2014-05-07 16:57             ` Linus Torvalds
  1 sibling, 0 replies; 136+ messages in thread
From: Borislav Petkov @ 2014-05-07 10:18 UTC (permalink / raw)
  To: Sven Joachim
  Cc: Alexandre Julliard, Linus Torvalds, Brian Gerst, Ingo Molnar,
	H. Peter Anvin, Linux Kernel Mailing List, Thomas Gleixner,
	stable, H. Peter Anvin

On Wed, May 07, 2014 at 11:18:49AM +0200, Sven Joachim wrote:
> I would rather not set up a virtual machine just for wine (I don't
> have Windows anymore).

What about reactos? (I'm not saying this shouldn't be addressed,
regardless...)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-05-07  9:18           ` Sven Joachim
  2014-05-07 10:18             ` Borislav Petkov
@ 2014-05-07 16:57             ` Linus Torvalds
  2014-05-07 17:09               ` H. Peter Anvin
                                 ` (2 more replies)
  1 sibling, 3 replies; 136+ messages in thread
From: Linus Torvalds @ 2014-05-07 16:57 UTC (permalink / raw)
  To: Sven Joachim
  Cc: Alexandre Julliard, Brian Gerst, Ingo Molnar, H. Peter Anvin,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 1036 bytes --]

On Wed, May 7, 2014 at 2:18 AM, Sven Joachim <svenjoac@gmx.de> wrote:
>
> It seems that at least some 32-bit programs are also broken, since after
> upgrading the kernel to 3.14.3 I can no longer start my old chess
> database program:

So for backporting (and for 3.15) maybe this (TOTALLY UNTESTED) patch
would be acceptable.

It adds a "/proc/sys/abi/ldt16" sysctl that defaults to zero (off). If
you hit this issue and care about your old Windows program more than
you care about a kernel stack address information leak, you can do

   echo 1 > /proc/sys/abi/ldt16

as root (add it to your startup scripts), and you should be ok.

Afaik, 16-bit programs under wine already need

  echo 0 > /proc/sys/vm/mmap_min_addr

because they want to map things at address 0, so this isn't a new concept.

I would like to repeat that this is totally untested. And the sysct
table is only added if you have COMPAT support enabled on x86-64, but
I assume anybody who runs old windows binaries very much does that ;)

                        Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1389 bytes --]

 arch/x86/kernel/ldt.c        | 4 +++-
 arch/x86/vdso/vdso32-setup.c | 8 ++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index af1d14a9ebda..dcbbaa165bde 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -20,6 +20,8 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
+int sysctl_ldt16 = 0;
+
 #ifdef CONFIG_SMP
 static void flush_ldt(void *current_mm)
 {
@@ -234,7 +236,7 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 	 * IRET leaking the high bits of the kernel stack address.
 	 */
 #ifdef CONFIG_X86_64
-	if (!ldt_info.seg_32bit) {
+	if (!ldt_info.seg_32bit && !sysctl_ldt16) {
 		error = -EINVAL;
 		goto out_unlock;
 	}
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 00348980a3a6..e1f220e3ca68 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -39,6 +39,7 @@
 #ifdef CONFIG_X86_64
 #define vdso_enabled			sysctl_vsyscall32
 #define arch_setup_additional_pages	syscall32_setup_pages
+extern int sysctl_ldt16;
 #endif
 
 /*
@@ -249,6 +250,13 @@ static struct ctl_table abi_table2[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "ldt16",
+		.data		= &sysctl_ldt16,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 	{}
 };
 

^ permalink raw reply related	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-05-07 16:57             ` Linus Torvalds
@ 2014-05-07 17:09               ` H. Peter Anvin
  2014-05-07 17:50                 ` Alexandre Julliard
  2014-05-08  6:43                 ` Sven Joachim
  2014-05-12 13:16               ` Josh Boyer
  2014-05-14 23:43               ` [tip:x86/urgent] x86-64, modify_ldt: Make support for 16-bit segments a runtime option tip-bot for Linus Torvalds
  2 siblings, 2 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-05-07 17:09 UTC (permalink / raw)
  To: Linus Torvalds, Sven Joachim
  Cc: Alexandre Julliard, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On 05/07/2014 09:57 AM, Linus Torvalds wrote:
> 
> Afaik, 16-bit programs under wine already need
> 
>   echo 0 > /proc/sys/vm/mmap_min_addr
> 
> because they want to map things at address 0, so this isn't a new concept.
> 

I think that applies to DOSEMU, but not to Wine.

Sven: if you have the ability to build your own kernel, could you also
try the "x86/espfix" branch of the git tree:

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/

(clone URLs:)
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

... to make sure the proper solution works for you?

I'm somewhat curious if this program you have is actually a 32-bit
program or if it is really a 16-bit program wrapped in a 32-bit
installer of some kind.  Hard to know without seeing the program in
question.

	-hpa


^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-05-07 17:09               ` H. Peter Anvin
@ 2014-05-07 17:50                 ` Alexandre Julliard
  2014-05-08  6:43                 ` Sven Joachim
  1 sibling, 0 replies; 136+ messages in thread
From: Alexandre Julliard @ 2014-05-07 17:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Sven Joachim, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

"H. Peter Anvin" <hpa@zytor.com> writes:

> On 05/07/2014 09:57 AM, Linus Torvalds wrote:
>> 
>> Afaik, 16-bit programs under wine already need
>> 
>>   echo 0 > /proc/sys/vm/mmap_min_addr
>> 
>> because they want to map things at address 0, so this isn't a new concept.
>> 
>
> I think that applies to DOSEMU, but not to Wine.

Yes, there are a few exceptions, but most Win16 apps run fine without
mapping address 0.

> I'm somewhat curious if this program you have is actually a 32-bit
> program or if it is really a 16-bit program wrapped in a 32-bit
> installer of some kind.  Hard to know without seeing the program in
> question.

It could be a mix of both, there are various thunking mechanisms that
allow 32-bit binaries to use 16-bit components. This was pretty common
in the Win95 days.

-- 
Alexandre Julliard
julliard@winehq.org

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-05-07 17:09               ` H. Peter Anvin
  2014-05-07 17:50                 ` Alexandre Julliard
@ 2014-05-08  6:43                 ` Sven Joachim
  2014-05-08 13:50                   ` H. Peter Anvin
  1 sibling, 1 reply; 136+ messages in thread
From: Sven Joachim @ 2014-05-08  6:43 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Alexandre Julliard, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On 2014-05-07 19:09 +0200, H. Peter Anvin wrote:

> On 05/07/2014 09:57 AM, Linus Torvalds wrote:
>> 
>> Afaik, 16-bit programs under wine already need
>> 
>>   echo 0 > /proc/sys/vm/mmap_min_addr
>> 
>> because they want to map things at address 0, so this isn't a new concept.
>> 
>
> I think that applies to DOSEMU, but not to Wine.

DOSEMU does no longer need it either.  If vm.mmap_min_addr is > 0, it
turns on CPU emulation, which it has to use anyway due to lack of vm86
mode.

> Sven: if you have the ability to build your own kernel, could you also
> try the "x86/espfix" branch of the git tree:
>
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/
>
> (clone URLs:)
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>
> ... to make sure the proper solution works for you?

Works fine here, thanks.

> I'm somewhat curious if this program you have is actually a 32-bit
> program or if it is really a 16-bit program wrapped in a 32-bit
> installer of some kind.  Hard to know without seeing the program in
> question.

The main application (a chess database program) is 32-bit, but it comes
with several 16-bit analysis engines that are loaded on startup (I see
them in lsof output), so that's the situation described by Alexandre.

Cheers,
       Sven

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-05-08  6:43                 ` Sven Joachim
@ 2014-05-08 13:50                   ` H. Peter Anvin
  2014-05-08 20:13                     ` H. Peter Anvin
  2014-05-08 20:40                     ` H. Peter Anvin
  0 siblings, 2 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-05-08 13:50 UTC (permalink / raw)
  To: Sven Joachim
  Cc: Linus Torvalds, Alexandre Julliard, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

Actually it could use KVM instead of CPU emulation on nearly all modern processors...

On May 7, 2014 11:43:59 PM PDT, Sven Joachim <svenjoac@gmx.de> wrote:
>On 2014-05-07 19:09 +0200, H. Peter Anvin wrote:
>
>> On 05/07/2014 09:57 AM, Linus Torvalds wrote:
>>> 
>>> Afaik, 16-bit programs under wine already need
>>> 
>>>   echo 0 > /proc/sys/vm/mmap_min_addr
>>> 
>>> because they want to map things at address 0, so this isn't a new
>concept.
>>> 
>>
>> I think that applies to DOSEMU, but not to Wine.
>
>DOSEMU does no longer need it either.  If vm.mmap_min_addr is > 0, it
>turns on CPU emulation, which it has to use anyway due to lack of vm86
>mode.
>
>> Sven: if you have the ability to build your own kernel, could you
>also
>> try the "x86/espfix" branch of the git tree:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/
>>
>> (clone URLs:)
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
>>
>> ... to make sure the proper solution works for you?
>
>Works fine here, thanks.
>
>> I'm somewhat curious if this program you have is actually a 32-bit
>> program or if it is really a 16-bit program wrapped in a 32-bit
>> installer of some kind.  Hard to know without seeing the program in
>> question.
>
>The main application (a chess database program) is 32-bit, but it comes
>with several 16-bit analysis engines that are loaded on startup (I see
>them in lsof output), so that's the situation described by Alexandre.
>
>Cheers,
>       Sven

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-05-08 13:50                   ` H. Peter Anvin
@ 2014-05-08 20:13                     ` H. Peter Anvin
  2014-05-08 20:40                     ` H. Peter Anvin
  1 sibling, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-05-08 20:13 UTC (permalink / raw)
  To: H. Peter Anvin, Sven Joachim
  Cc: Linus Torvalds, Alexandre Julliard, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable

On 05/08/2014 06:50 AM, H. Peter Anvin wrote:
> Actually it could use KVM instead of CPU emulation on nearly all modern processors...

Of course, at that point you might just run qemu-kvm instead of DOSEMU,
since I seem to recall that DOSEMU is still a real version of DOS.  I
have to admit to mostly using DOSBOX for games and KVM for anything else
that needs DOS these days... just what happens to work best for me.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-05-08 13:50                   ` H. Peter Anvin
  2014-05-08 20:13                     ` H. Peter Anvin
@ 2014-05-08 20:40                     ` H. Peter Anvin
  1 sibling, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-05-08 20:40 UTC (permalink / raw)
  To: Sven Joachim
  Cc: Linus Torvalds, Alexandre Julliard, Brian Gerst, Ingo Molnar,
	Linux Kernel Mailing List, Thomas Gleixner, stable,
	H. Peter Anvin

On 05/08/2014 06:50 AM, H. Peter Anvin wrote:
> Actually it could use KVM instead of CPU emulation on nearly all modern processors...

That being said, it would be cool if someone would either port the
lredir backend (MFS) into Qemu, or finish the 9P front end I started
writing at one point, but probably will never have time to finish.
	
	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-05-07 16:57             ` Linus Torvalds
  2014-05-07 17:09               ` H. Peter Anvin
@ 2014-05-12 13:16               ` Josh Boyer
  2014-05-12 16:52                 ` H. Peter Anvin
  2014-05-14 23:43               ` [tip:x86/urgent] x86-64, modify_ldt: Make support for 16-bit segments a runtime option tip-bot for Linus Torvalds
  2 siblings, 1 reply; 136+ messages in thread
From: Josh Boyer @ 2014-05-12 13:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sven Joachim, Alexandre Julliard, Brian Gerst, Ingo Molnar,
	H. Peter Anvin, Linux Kernel Mailing List, Thomas Gleixner,
	stable, H. Peter Anvin, info

On Wed, May 7, 2014 at 12:57 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, May 7, 2014 at 2:18 AM, Sven Joachim <svenjoac@gmx.de> wrote:
>>
>> It seems that at least some 32-bit programs are also broken, since after
>> upgrading the kernel to 3.14.3 I can no longer start my old chess
>> database program:

Now that this has hit 3.14.y stable, we have another report of this
commit breaking something in Wine, this time MS Access 2000:

https://bugzilla.redhat.com/show_bug.cgi?id=1096725 (reporter now CC'd)

> So for backporting (and for 3.15) maybe this (TOTALLY UNTESTED) patch
> would be acceptable.

I don't think your patch went anywhere.  I have no idea if you want to
push that or revert or just tell people to run old apps in 32-bit
guests at this point.  Just forwarding on the information.

josh

^ permalink raw reply	[flat|nested] 136+ messages in thread

* Re: [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels
  2014-05-12 13:16               ` Josh Boyer
@ 2014-05-12 16:52                 ` H. Peter Anvin
  0 siblings, 0 replies; 136+ messages in thread
From: H. Peter Anvin @ 2014-05-12 16:52 UTC (permalink / raw)
  To: Josh Boyer, Linus Torvalds
  Cc: Sven Joachim, Alexandre Julliard, Brian Gerst, Ingo Molnar,
	H. Peter Anvin, Linux Kernel Mailing List, Thomas Gleixner,
	stable, info

On 05/12/2014 06:16 AM, Josh Boyer wrote:
> On Wed, May 7, 2014 at 12:57 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Wed, May 7, 2014 at 2:18 AM, Sven Joachim <svenjoac@gmx.de> wrote:
>>>
>>> It seems that at least some 32-bit programs are also broken, since after
>>> upgrading the kernel to 3.14.3 I can no longer start my old chess
>>> database program:
> 
> Now that this has hit 3.14.y stable, we have another report of this
> commit breaking something in Wine, this time MS Access 2000:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1096725 (reporter now CC'd)
> 
>> So for backporting (and for 3.15) maybe this (TOTALLY UNTESTED) patch
>> would be acceptable.
> 
> I don't think your patch went anywhere.  I have no idea if you want to
> push that or revert or just tell people to run old apps in 32-bit
> guests at this point.  Just forwarding on the information.
> 

Linus is off the net at the moment.  Someone needs to take his patch and
test it/clean it up.

	-hpa



^ permalink raw reply	[flat|nested] 136+ messages in thread

* [tip:x86/urgent] x86-64, modify_ldt: Make support for 16-bit segments a runtime option
  2014-05-07 16:57             ` Linus Torvalds
  2014-05-07 17:09               ` H. Peter Anvin
  2014-05-12 13:16               ` Josh Boyer
@ 2014-05-14 23:43               ` tip-bot for Linus Torvalds
  2 siblings, 0 replies; 136+ messages in thread
From: tip-bot for Linus Torvalds @ 2014-05-14 23:43 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, torvalds, stable, tglx, hpa

Commit-ID:  fa81511bb0bbb2b1aace3695ce869da9762624ff
Gitweb:     http://git.kernel.org/tip/fa81511bb0bbb2b1aace3695ce869da9762624ff
Author:     Linus Torvalds <torvalds@linux-foundation.org>
AuthorDate: Wed, 14 May 2014 16:33:54 -0700
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Wed, 14 May 2014 16:33:54 -0700

x86-64, modify_ldt: Make support for 16-bit segments a runtime option

Checkin:

b3b42ac2cbae x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels

disabled 16-bit segments on 64-bit kernels due to an information
leak.  However, it does seem that people are genuinely using Wine to
run old 16-bit Windows programs on Linux.

A proper fix for this ("espfix64") is coming in the upcoming merge
window, but as a temporary fix, create a sysctl to allow the
administrator to re-enable support for 16-bit segments.

It adds a "/proc/sys/abi/ldt16" sysctl that defaults to zero (off). If
you hit this issue and care about your old Windows program more than
you care about a kernel stack address information leak, you can do

   echo 1 > /proc/sys/abi/ldt16

as root (add it to your startup scripts), and you should be ok.

The sysctl table is only added if you have COMPAT support enabled on
x86-64, but I assume anybody who runs old windows binaries very much
does that ;)

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/CA%2B55aFw9BPoD10U1LfHbOMpHWZkvJTkMcfCs9s3urPr1YyWBxw@mail.gmail.com
Cc: <stable@vger.kernel.org>
---
 arch/x86/kernel/ldt.c        | 4 +++-
 arch/x86/vdso/vdso32-setup.c | 8 ++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index af1d14a..dcbbaa1 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -20,6 +20,8 @@
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
 
+int sysctl_ldt16 = 0;
+
 #ifdef CONFIG_SMP
 static void flush_ldt(void *current_mm)
 {
@@ -234,7 +236,7 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 	 * IRET leaking the high bits of the kernel stack address.
 	 */
 #ifdef CONFIG_X86_64
-	if (!ldt_info.seg_32bit) {
+	if (!ldt_info.seg_32bit && !sysctl_ldt16) {
 		error = -EINVAL;
 		goto out_unlock;
 	}
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 0034898..e1f220e 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -39,6 +39,7 @@
 #ifdef CONFIG_X86_64
 #define vdso_enabled			sysctl_vsyscall32
 #define arch_setup_additional_pages	syscall32_setup_pages
+extern int sysctl_ldt16;
 #endif
 
 /*
@@ -249,6 +250,13 @@ static struct ctl_table abi_table2[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "ldt16",
+		.data		= &sysctl_ldt16,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 	{}
 };
 

^ permalink raw reply related	[flat|nested] 136+ messages in thread

end of thread, other threads:[~2014-05-14 23:43 UTC | newest]

Thread overview: 136+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-11 17:36 [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels tip-bot for H. Peter Anvin
2014-04-11 18:12 ` Andy Lutomirski
2014-04-11 18:20   ` H. Peter Anvin
2014-04-11 18:27 ` Brian Gerst
2014-04-11 18:29   ` H. Peter Anvin
2014-04-11 18:35     ` Brian Gerst
2014-04-11 21:16     ` Andy Lutomirski
2014-04-11 21:24       ` H. Peter Anvin
2014-04-11 21:53         ` Andy Lutomirski
2014-04-11 21:59           ` H. Peter Anvin
2014-04-11 22:15             ` Andy Lutomirski
2014-04-11 22:18               ` H. Peter Anvin
2014-04-13  4:20           ` H. Peter Anvin
2014-04-12 23:26         ` Alexander van Heukelum
2014-04-12 23:31           ` H. Peter Anvin
2014-04-12 23:49             ` Alexander van Heukelum
2014-04-13  0:03               ` H. Peter Anvin
2014-04-13  1:25                 ` Andy Lutomirski
2014-04-13  1:29                   ` Andy Lutomirski
2014-04-13  3:00                     ` H. Peter Anvin
2014-04-11 21:34       ` Linus Torvalds
2014-04-11 18:41   ` Linus Torvalds
2014-04-11 18:45     ` Brian Gerst
2014-04-11 18:50       ` Linus Torvalds
2014-04-12  4:44         ` Brian Gerst
2014-04-12 17:18           ` H. Peter Anvin
2014-04-12 19:35             ` Borislav Petkov
2014-04-12 19:44               ` H. Peter Anvin
2014-04-12 20:11                 ` Borislav Petkov
2014-04-12 20:34                   ` Brian Gerst
2014-04-12 20:59                     ` Borislav Petkov
2014-04-12 21:13                       ` Brian Gerst
2014-04-12 21:40                         ` Borislav Petkov
2014-04-14  7:21                           ` Ingo Molnar
2014-04-14  9:44                             ` Borislav Petkov
2014-04-14  9:47                               ` Ingo Molnar
2014-04-12 21:53                 ` Linus Torvalds
2014-04-12 22:25                   ` H. Peter Anvin
2014-04-13  2:56                     ` Andi Kleen
2014-04-13  3:02                       ` H. Peter Anvin
2014-04-13  3:13                       ` Linus Torvalds
2014-04-12 20:29             ` Brian Gerst
2014-04-14  7:48         ` Alexandre Julliard
2014-05-07  9:18           ` Sven Joachim
2014-05-07 10:18             ` Borislav Petkov
2014-05-07 16:57             ` Linus Torvalds
2014-05-07 17:09               ` H. Peter Anvin
2014-05-07 17:50                 ` Alexandre Julliard
2014-05-08  6:43                 ` Sven Joachim
2014-05-08 13:50                   ` H. Peter Anvin
2014-05-08 20:13                     ` H. Peter Anvin
2014-05-08 20:40                     ` H. Peter Anvin
2014-05-12 13:16               ` Josh Boyer
2014-05-12 16:52                 ` H. Peter Anvin
2014-05-14 23:43               ` [tip:x86/urgent] x86-64, modify_ldt: Make support for 16-bit segments a runtime option tip-bot for Linus Torvalds
2014-04-11 18:46     ` [tip:x86/urgent] x86-64, modify_ldt: Ban 16-bit segments on 64-bit kernels H. Peter Anvin
2014-04-14  7:27       ` Ingo Molnar
2014-04-14 15:45         ` H. Peter Anvin
2014-04-13  2:54     ` Andi Kleen
2014-04-21 22:47 ` [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE* H. Peter Anvin
2014-04-21 23:19   ` Andrew Lutomirski
2014-04-21 23:29     ` H. Peter Anvin
2014-04-22  0:37       ` Andrew Lutomirski
2014-04-22  0:53         ` H. Peter Anvin
2014-04-22  1:06           ` Andrew Lutomirski
2014-04-22  1:14             ` H. Peter Anvin
2014-04-22  1:28               ` Andrew Lutomirski
2014-04-22  1:47                 ` H. Peter Anvin
2014-04-22  1:53                   ` Andrew Lutomirski
2014-04-22 11:23                     ` Borislav Petkov
2014-04-22 14:46                       ` Borislav Petkov
2014-04-22 16:03                         ` Andrew Lutomirski
2014-04-22 16:10                           ` H. Peter Anvin
2014-04-22 16:33                             ` Andrew Lutomirski
2014-04-22 16:43                               ` Linus Torvalds
2014-04-22 17:00                                 ` Andrew Lutomirski
2014-04-22 17:04                                   ` Linus Torvalds
2014-04-22 17:11                                     ` Andrew Lutomirski
2014-04-22 17:15                                       ` H. Peter Anvin
2014-04-23  9:54                                         ` One Thousand Gnomes
2014-04-23 15:53                                           ` H. Peter Anvin
2014-04-23 17:08                                             ` Andrew Lutomirski
2014-04-23 17:16                                               ` H. Peter Anvin
2014-04-23 17:25                                                 ` Andrew Lutomirski
2014-04-23 17:28                                                   ` H. Peter Anvin
2014-04-23 17:45                                                     ` Andrew Lutomirski
2014-04-22 17:19                                       ` Linus Torvalds
2014-04-22 17:29                                         ` H. Peter Anvin
2014-04-22 17:46                                           ` Andrew Lutomirski
2014-04-22 17:59                                             ` H. Peter Anvin
2014-04-22 18:03                                             ` Brian Gerst
2014-04-22 18:06                                               ` H. Peter Anvin
2014-04-22 18:17                                                 ` Brian Gerst
2014-04-22 18:51                                                   ` H. Peter Anvin
2014-04-22 19:55                                                     ` Brian Gerst
2014-04-22 20:17                                                       ` H. Peter Anvin
2014-04-22 23:08                                                         ` Brian Gerst
2014-04-22 23:39                                                     ` Andi Kleen
2014-04-22 23:40                                                       ` H. Peter Anvin
2014-04-22 17:11                                     ` H. Peter Anvin
2014-04-22 17:26                                       ` Borislav Petkov
2014-04-22 17:29                                         ` Andrew Lutomirski
2014-04-22 19:27                                           ` Borislav Petkov
2014-04-23  6:24                                     ` H. Peter Anvin
2014-04-23  8:57                                       ` Alexandre Julliard
2014-04-22 17:09                                   ` H. Peter Anvin
2014-04-22 17:20                                     ` Andrew Lutomirski
2014-04-22 17:24                                       ` H. Peter Anvin
2014-04-22 11:25   ` Borislav Petkov
2014-04-23  1:17   ` H. Peter Anvin
2014-04-23  1:23     ` Andrew Lutomirski
2014-04-23  1:42       ` H. Peter Anvin
2014-04-23 14:24         ` Boris Ostrovsky
2014-04-23 16:56           ` H. Peter Anvin
2014-04-28 13:04             ` Konrad Rzeszutek Wilk
2014-04-25 21:02     ` Konrad Rzeszutek Wilk
2014-04-25 21:16       ` H. Peter Anvin
2014-04-24  4:13   ` comex
2014-04-24  4:53     ` Andrew Lutomirski
2014-04-24 22:24       ` H. Peter Anvin
2014-04-24 22:31         ` Andrew Lutomirski
2014-04-24 22:37           ` H. Peter Anvin
2014-04-24 22:43             ` Andrew Lutomirski
2014-04-28 23:05       ` H. Peter Anvin
2014-04-28 23:08         ` H. Peter Anvin
2014-04-29  0:02           ` Andrew Lutomirski
2014-04-29  0:15             ` H. Peter Anvin
2014-04-29  0:20             ` Andrew Lutomirski
2014-04-29  2:38               ` H. Peter Anvin
2014-04-29  2:44                 ` H. Peter Anvin
2014-04-29  3:45                 ` H. Peter Anvin
2014-04-29  3:47                   ` H. Peter Anvin
2014-04-29  4:36                   ` H. Peter Anvin
2014-04-29  7:14                     ` H. Peter Anvin
2014-04-25 12:02   ` Pavel Machek
2014-04-25 21:20     ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).