linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* New vsyscall emulation breaks JITs
@ 2011-08-05 20:09 Andi Kleen
  2011-08-05 20:23 ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2011-08-05 20:09 UTC (permalink / raw)
  To: luto, x86, linux-kernel, torvalds; +Cc: lueckintel, kimwooyoung


Andy,

We found that your new vsyscall emulation in

commit 5cec93c216db77c45f7ce970d46283bcb1933884
Author: Andy Lutomirski <luto@MIT.EDU>
Date:   Sun Jun 5 13:50:24 2011 -0400

    x86-64: Emulate legacy vsyscalls

breaks JITs that execute x86 code and use the legacy vsyscalls.

The problem is that the JIT translates the vsyscall page into
its code buffer and executes the "int 0xcc" there. Then 
when the kernel gets the interrupt it doesn't see the vsyscall
page as the source and crashes the program.

For some reason several modern executables also seem
to still use the old vsyscall page, so this problem can be hit
quickly.

This happened with pin (http://www.pintool.org/), however
I expect it will affect all user space x86 JITs (valgrind, 
dynamo, qemu-user, etc.)

What to do? Right now this broke existing setups.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 20:09 New vsyscall emulation breaks JITs Andi Kleen
@ 2011-08-05 20:23 ` H. Peter Anvin
  2011-08-05 20:26   ` Andi Kleen
  2011-08-05 20:45   ` Andrew Lutomirski
  0 siblings, 2 replies; 37+ messages in thread
From: H. Peter Anvin @ 2011-08-05 20:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: luto, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On 08/05/2011 01:09 PM, Andi Kleen wrote:
> 
> Andy,
> 
> We found that your new vsyscall emulation in
> 
> commit 5cec93c216db77c45f7ce970d46283bcb1933884
> Author: Andy Lutomirski <luto@MIT.EDU>
> Date:   Sun Jun 5 13:50:24 2011 -0400
> 
>     x86-64: Emulate legacy vsyscalls
> 
> breaks JITs that execute x86 code and use the legacy vsyscalls.
> 
> The problem is that the JIT translates the vsyscall page into
> its code buffer and executes the "int 0xcc" there. Then 
> when the kernel gets the interrupt it doesn't see the vsyscall
> page as the source and crashes the program.
> 
> For some reason several modern executables also seem
> to still use the old vsyscall page, so this problem can be hit
> quickly.
> 
> This happened with pin (http://www.pintool.org/), however
> I expect it will affect all user space x86 JITs (valgrind, 
> dynamo, qemu-user, etc.)
> 
> What to do? Right now this broke existing setups.
> 

I have to say I believe that trying to JIT the vdso or vsyscall pages is
extremely dubious at best.  They are fundamentally different from normal
user space in that the kernel can muck with them any time, without
notifying userspace about it.  The other aspect of this is that this is
about the legacy vsyscall page, which we're trying to get rid of, partly
because of security problems.

As such, it's not entirely obvious what the right thing to do here is.
On one hand, it "break user space" but on the other hand that is
userspace doing something fundamentally broken in the first place.

	-hpa


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 20:23 ` H. Peter Anvin
@ 2011-08-05 20:26   ` Andi Kleen
  2011-08-05 20:36     ` H. Peter Anvin
  2011-08-05 20:45   ` Andrew Lutomirski
  1 sibling, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2011-08-05 20:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, luto, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

> I have to say I believe that trying to JIT the vdso or vsyscall pages is
> extremely dubious at best.  They are fundamentally different from normal
> user space in that the kernel can muck with them any time, without
> notifying userspace about it.  The other aspect of this is that this is
> about the legacy vsyscall page, which we're trying to get rid of, partly
> because of security problems.

There's clear evidence now you can't: it's used even by new binaries.

-Andi


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 20:26   ` Andi Kleen
@ 2011-08-05 20:36     ` H. Peter Anvin
  2011-08-05 20:47       ` Andi Kleen
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2011-08-05 20:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: luto, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On 08/05/2011 01:26 PM, Andi Kleen wrote:
>> I have to say I believe that trying to JIT the vdso or vsyscall pages is
>> extremely dubious at best.  They are fundamentally different from normal
>> user space in that the kernel can muck with them any time, without
>> notifying userspace about it.  The other aspect of this is that this is
>> about the legacy vsyscall page, which we're trying to get rid of, partly
>> because of security problems.
> 
> There's clear evidence now you can't: it's used even by new binaries.

time() is not supported by vdso; this is a problem.  Getting rid of it
is a long-term thing.

	-hpa


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 20:23 ` H. Peter Anvin
  2011-08-05 20:26   ` Andi Kleen
@ 2011-08-05 20:45   ` Andrew Lutomirski
  2011-08-05 20:48     ` H. Peter Anvin
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-05 20:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On Fri, Aug 5, 2011 at 4:23 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 08/05/2011 01:09 PM, Andi Kleen wrote:
>>
>> Andy,
>>
>> We found that your new vsyscall emulation in
>>
>> commit 5cec93c216db77c45f7ce970d46283bcb1933884
>> Author: Andy Lutomirski <luto@MIT.EDU>
>> Date:   Sun Jun 5 13:50:24 2011 -0400
>>
>>     x86-64: Emulate legacy vsyscalls
>>
>> breaks JITs that execute x86 code and use the legacy vsyscalls.
>>
>> The problem is that the JIT translates the vsyscall page into
>> its code buffer and executes the "int 0xcc" there. Then
>> when the kernel gets the interrupt it doesn't see the vsyscall
>> page as the source and crashes the program.
>>
>> For some reason several modern executables also seem
>> to still use the old vsyscall page, so this problem can be hit
>> quickly.
>>
>> This happened with pin (http://www.pintool.org/), however
>> I expect it will affect all user space x86 JITs (valgrind,
>> dynamo, qemu-user, etc.)
>>
>> What to do? Right now this broke existing setups.
>>
>
> I have to say I believe that trying to JIT the vdso or vsyscall pages is
> extremely dubious at best.  They are fundamentally different from normal
> user space in that the kernel can muck with them any time, without
> notifying userspace about it.  The other aspect of this is that this is
> about the legacy vsyscall page, which we're trying to get rid of, partly
> because of security problems.

Valgrind in particular is already smart enough not to JIT
gettimeofday().  It crashes on the getcpu vsyscall with or without the
vsyscall emulation patch, so we get off easy there.  (I haven't tried
to debug it.)

I suspect that qemu-user won't have this problem either because I
doubt it looks into the vsyscall page in the first place.  Presumably
it just maps the original program.  In any case, it's a full emulator,
and I don't see why the host kernel should matter.

I'm a bit disinclined to play with Pin, because the license tells me
that I shouldn't reverse-engineer anything, and that would be the
whole point.  It claims to be pre-release code.

If by dynamo you mean DynamoRIO, I can't get it to build.


An older version of the vsyscall emulation code used a fancier
sequence that would survive relocation, although it involved a 'ret'
instruction and it would be nice not to put 'ret' into the vsyscall
page.


hpa: time is supported (as of 3.0) by the vdso, and very new glibc
uses the vdso version.  We could add a native time implementation back
to the vsyscall page without too much pain as a short-term fix, but
that would be less than ideal.

>
> As such, it's not entirely obvious what the right thing to do here is.
> On one hand, it "break user space" but on the other hand that is
> userspace doing something fundamentally broken in the first place.
>
>        -hpa
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 20:36     ` H. Peter Anvin
@ 2011-08-05 20:47       ` Andi Kleen
  0 siblings, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2011-08-05 20:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, luto, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On Fri, Aug 05, 2011 at 01:36:48PM -0700, H. Peter Anvin wrote:
> On 08/05/2011 01:26 PM, Andi Kleen wrote:
> >> I have to say I believe that trying to JIT the vdso or vsyscall pages is
> >> extremely dubious at best.  They are fundamentally different from normal
> >> user space in that the kernel can muck with them any time, without
> >> notifying userspace about it.  The other aspect of this is that this is
> >> about the legacy vsyscall page, which we're trying to get rid of, partly
> >> because of security problems.
> > 
> > There's clear evidence now you can't: it's used even by new binaries.
> 
> time() is not supported by vdso; this is a problem.  Getting rid of it
> is a long-term thing.

Yes you're right the problem is time. I set a breakpoint on the vsyscalls
and I get:

(gdb) bt
#0  0xffffffffff600400 in ?? ()
#1  0x00007fffe4f703fd in time () at ../sysdeps/unix/sysv/linux/x86_64/time.S:36
...

This is with a very new glibc, in fact a recent git version.

So clearly it's all broken and even outside JITs everyone using time()
will be slow and if they use JITs don't work at all.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 20:45   ` Andrew Lutomirski
@ 2011-08-05 20:48     ` H. Peter Anvin
  2011-08-05 20:52       ` Andi Kleen
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2011-08-05 20:48 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On 08/05/2011 01:45 PM, Andrew Lutomirski wrote:
> 
> hpa: time is supported (as of 3.0) by the vdso, and very new glibc
> uses the vdso version.  We could add a native time implementation back
> to the vsyscall page without too much pain as a short-term fix, but
> that would be less than ideal.
> 

How new does glibc have to be?

How much of a pain would it be to make the legacy vs emulated vsyscall
page a config option?

	-hpa

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 20:48     ` H. Peter Anvin
@ 2011-08-05 20:52       ` Andi Kleen
  2011-08-05 21:00         ` Andrew Lutomirski
  0 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2011-08-05 20:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Lutomirski, Andi Kleen, x86, linux-kernel, torvalds,
	lueckintel, kimwooyoung

On Fri, Aug 05, 2011 at 01:48:28PM -0700, H. Peter Anvin wrote:
> On 08/05/2011 01:45 PM, Andrew Lutomirski wrote:
> > 
> > hpa: time is supported (as of 3.0) by the vdso, and very new glibc
> > uses the vdso version.  We could add a native time implementation back
> > to the vsyscall page without too much pain as a short-term fix, but
> > that would be less than ideal.
> > 
> 
> How new does glibc have to be?

Mine from May 17 doesn't support it.

> How much of a pain would it be to make the legacy vs emulated vsyscall
> page a config option?

CONFIG_DONT_BREAK_MY_BINARIES? 

If anything runtime, but really for me it looks like the vsyscall
changes should be only in one of those limited compability paranoia 
patchkits.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 20:52       ` Andi Kleen
@ 2011-08-05 21:00         ` Andrew Lutomirski
  2011-08-05 21:21           ` Andi Kleen
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-05 21:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On Fri, Aug 5, 2011 at 4:52 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Fri, Aug 05, 2011 at 01:48:28PM -0700, H. Peter Anvin wrote:
>> On 08/05/2011 01:45 PM, Andrew Lutomirski wrote:
>> >
>> > hpa: time is supported (as of 3.0) by the vdso, and very new glibc
>> > uses the vdso version.  We could add a native time implementation back
>> > to the vsyscall page without too much pain as a short-term fix, but
>> > that would be less than ideal.
>> >
>>
>> How new does glibc have to be?
>
> Mine from May 17 doesn't support it.

c738465a4c13370f58b797a82cdf1c67e1121867 from May 28.

>
>> How much of a pain would it be to make the legacy vs emulated vsyscall
>> page a config option?
>
> CONFIG_DONT_BREAK_MY_BINARIES?
>

If gettimeofday could be a pure syscall fallback, then it wouldn't be
so bad.  With the vread_tsc changes, the vsyscall page can't directly
call ->vread anymore, and making *that* conditional would be rather
ugly.

> If anything runtime, but really for me it looks like the vsyscall
> changes should be only in one of those limited compability paranoia
> patchkits.

Switching it in runtime would be a giant mess because user code might
be executing from the vsyscall page while we try to switch it.
Switching at boot time might not be so bad.  We'd just compile the
emulation code in unconditionally but have a fallback page that we
could map if needed.

I also filed this issue:
https://code.google.com/p/dynamorio/issues/detail?id=530

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 21:00         ` Andrew Lutomirski
@ 2011-08-05 21:21           ` Andi Kleen
  2011-08-05 21:26             ` Andrew Lutomirski
  2011-08-09 13:26             ` Andrew Lutomirski
  0 siblings, 2 replies; 37+ messages in thread
From: Andi Kleen @ 2011-08-05 21:21 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, H. Peter Anvin, x86, linux-kernel, torvalds,
	lueckintel, kimwooyoung

On Fri, Aug 05, 2011 at 05:00:44PM -0400, Andrew Lutomirski wrote:
> > If anything runtime, but really for me it looks like the vsyscall
> > changes should be only in one of those limited compability paranoia
> > patchkits.
> 
> Switching it in runtime would be a giant mess because user code might

You can always switch at boot time.

But really serious binary incompatibility like this should not be default
(not even talking about the slow down for existing binaries using time())

-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 21:21           ` Andi Kleen
@ 2011-08-05 21:26             ` Andrew Lutomirski
  2011-08-05 22:06               ` H. Peter Anvin
  2011-08-09 13:26             ` Andrew Lutomirski
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-05 21:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On Fri, Aug 5, 2011 at 5:21 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Fri, Aug 05, 2011 at 05:00:44PM -0400, Andrew Lutomirski wrote:
>> > If anything runtime, but really for me it looks like the vsyscall
>> > changes should be only in one of those limited compability paranoia
>> > patchkits.
>>
>> Switching it in runtime would be a giant mess because user code might
>
> You can always switch at boot time.

For a boot time switch, it might be nicer to just switch between the
current int 0xcc sequence and the older

mov cx, 0x<magic>
int 0xcc
ret

sequence.

That way there's a ret in the vsyscall page but no syscall instruction.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 21:26             ` Andrew Lutomirski
@ 2011-08-05 22:06               ` H. Peter Anvin
  2011-08-05 22:11                 ` Andrew Lutomirski
  0 siblings, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2011-08-05 22:06 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On 08/05/2011 02:26 PM, Andrew Lutomirski wrote:
> 
> For a boot time switch, it might be nicer to just switch between the
> current int 0xcc sequence and the older
> 
> mov cx, 0x<magic>
> int 0xcc
> ret
> 
> sequence.
> 
> That way there's a ret in the vsyscall page but no syscall instruction.
> 

Refresh my memory... we have what... six legacy vsyscall entry points?
We could, hypothetically, burn six interrupt vectors with them.  If we
get them from the 0x40-0x4f range, then they are harmless standalone REX
prefixes (and INC/DEC instructions in 32-bit mode.)

The issue with pin as far as I understand is that it's executing an
instruction at a different address and expecting it to have identical
semantics, which is an incorrect assumption for trapping instructions
(consider doing that for something like SYSENTER!).

Now, as far as RET is concerned I don't see how it does anything that
the INT instruction doesn't do already; ANY of the emulated instructions
have to return to the address on the stack in order to work at all, OR
they can return to the next address and do RET.

	-hpa

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 22:06               ` H. Peter Anvin
@ 2011-08-05 22:11                 ` Andrew Lutomirski
  2011-08-06  0:20                   ` Andrew Lutomirski
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-05 22:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On Fri, Aug 5, 2011 at 6:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 08/05/2011 02:26 PM, Andrew Lutomirski wrote:
>>
>> For a boot time switch, it might be nicer to just switch between the
>> current int 0xcc sequence and the older
>>
>> mov cx, 0x<magic>
>> int 0xcc
>> ret
>>
>> sequence.
>>
>> That way there's a ret in the vsyscall page but no syscall instruction.
>>
>
> Refresh my memory... we have what... six legacy vsyscall entry points?
> We could, hypothetically, burn six interrupt vectors with them.  If we
> get them from the 0x40-0x4f range, then they are harmless standalone REX
> prefixes (and INC/DEC instructions in 32-bit mode.)

Only three.  I have no real objection to burning two more vectors.
0xCD would also be safe.  ISTR that lower numbers like 0x40 might
actually mean something (ISA?).  32-bit semantics are irrelevant
because 32-bit code can't jump to the vsyscall page anyway.

>
> The issue with pin as far as I understand is that it's executing an
> instruction at a different address and expecting it to have identical
> semantics, which is an incorrect assumption for trapping instructions
> (consider doing that for something like SYSENTER!).

Agreed.  I think that no matter what we do we should encourage
userspace apps to stop doing dumb things.

>
> Now, as far as RET is concerned I don't see how it does anything that
> the INT instruction doesn't do already; ANY of the emulated instructions
> have to return to the address on the stack in order to work at all, OR
> they can return to the next address and do RET.

True.  But the emulated vsyscalls will segfault unless the registers
are valid points that they can write to, which makes them a little
harder to use than ret.

(And some day we could have a sysctl to turn off even the emulated syscalls.)

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 22:11                 ` Andrew Lutomirski
@ 2011-08-06  0:20                   ` Andrew Lutomirski
  2011-08-06  0:32                     ` H. Peter Anvin
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-06  0:20 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On Fri, Aug 5, 2011 at 6:11 PM, Andrew Lutomirski <luto@mit.edu> wrote:
> On Fri, Aug 5, 2011 at 6:06 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 08/05/2011 02:26 PM, Andrew Lutomirski wrote:
>>>
>>> For a boot time switch, it might be nicer to just switch between the
>>> current int 0xcc sequence and the older
>>>
>>> mov cx, 0x<magic>
>>> int 0xcc
>>> ret
>>>
>>> sequence.
>>>
>>> That way there's a ret in the vsyscall page but no syscall instruction.
>>>
>>
>> Refresh my memory... we have what... six legacy vsyscall entry points?
>> We could, hypothetically, burn six interrupt vectors with them.  If we
>> get them from the 0x40-0x4f range, then they are harmless standalone REX
>> prefixes (and INC/DEC instructions in 32-bit mode.)
>
> Only three.  I have no real objection to burning two more vectors.
> 0xCD would also be safe.  ISTR that lower numbers like 0x40 might
> actually mean something (ISA?).  32-bit semantics are irrelevant
> because 32-bit code can't jump to the vsyscall page anyway.

I was thinking of 0x20 - 0x39.  0x40, 0x41, and 0x42 should do the
trick.  I'll cook up a patch.

If you want to keep those vectors available for devices as well, we
could hook do_general_protection instead, but that's a little messy.
Are there x86 machines out there that are starved for interrupt
vectors?

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-06  0:20                   ` Andrew Lutomirski
@ 2011-08-06  0:32                     ` H. Peter Anvin
  2011-08-06  3:01                       ` [RFC] x86-64: Allow emulated vsyscalls from user addresses Andy Lutomirski
                                         ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: H. Peter Anvin @ 2011-08-06  0:32 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung,
	Suresh Siddha

On 08/05/2011 05:20 PM, Andrew Lutomirski wrote:
> 
> I was thinking of 0x20 - 0x39.  0x40, 0x41, and 0x42 should do the
> trick.  I'll cook up a patch.
> 
> If you want to keep those vectors available for devices as well, we
> could hook do_general_protection instead, but that's a little messy.
> Are there x86 machines out there that are starved for interrupt
> vectors?
> 

Yes, but 3 aren't going to matter much.

However, on systems which have interrupt migration enabled we're not
using 0x21-0x2f for anything (because we need a single interrupt with
absolutely lowest priority).  Out of that range, there are a couple of
values which should be safe to use because they would be harmless
instructions of various forms:

	0x24	- AND AL, imm8
	0x25	- AND EAX, imm32
	0x26	- ES:
	0x2C	- SUB AL, imm8
	0x2D	- SUB EAX, imm32
	0x2E	- CS:

[Cc: Suresh who is the expert on the interrupt assignments]

	-hpa

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC] x86-64: Allow emulated vsyscalls from user addresses
  2011-08-06  0:32                     ` H. Peter Anvin
@ 2011-08-06  3:01                       ` Andy Lutomirski
  2011-08-06  3:04                       ` [RFC v2] " Andy Lutomirski
  2011-08-09 22:27                       ` New vsyscall emulation breaks JITs Suresh Siddha
  2 siblings, 0 replies; 37+ messages in thread
From: Andy Lutomirski @ 2011-08-06  3:01 UTC (permalink / raw)
  To: H. Peter Anvin"
  Cc: Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung,
	Suresh Siddha, Andy Lutomirski

A few dynamic recompilation tools are too clever for their own good.
They trace control flow through the vsyscall page and recompile that
code somewhere else.  Then they expect it to work.  DynamoRIO
(http://dynamorio.org/) and Pin (http://www.pintool.org/) are
affected.  They crash when tracing programs that use vsyscalls.
Valgrind is smart enough not to cause problems.  It crashes on the
getcpu vsyscall, but that has nothing to do with emulation.

This patch makes each of the three vsyscall entries use a different
vector so that they can work when relocated.  It assumes that the
code that relocates them is okay with the int instruction acting
like ret.  DynamoRIO at least appears to work.

We print an obnoxious (rate-limited) message to the log when this
happens.  Hopefully it will inspire the JIT tools to learn not to
trace into kernel address space.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---

This uses vectors 0x40, 0x41, and 0x42 for now.  They are REX
prefixes in 64-bit code, and jumping to the second byte of one
of these instructions will turn into 'rex.? int3', which will
trap.

 arch/x86/kernel/vsyscall_64.c     |   75 ++++++++++--------------------------
 arch/x86/kernel/vsyscall_emu_64.S |    6 +-
 2 files changed, 24 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index f785f5b..a33ad02 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -105,22 +105,8 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
 	       regs->sp, regs->ax, regs->si, regs->di);
 }
 
-static int addr_to_vsyscall_nr(unsigned long addr)
-{
-	int nr;
-
-	if ((addr & ~0xC00UL) != VSYSCALL_START)
-		return -EINVAL;
-
-	nr = (addr & 0xC00UL) >> 10;
-	if (nr >= 3)
-		return -EINVAL;
-
-	return nr;
-}
-
-void emulate_vsyscall(struct pt_regs *regs, int nr,
-		      long (*vsys)(struct pt_regs *))
+static void emulate_vsyscall(struct pt_regs *regs, int nr,
+			     long (*vsys)(struct pt_regs *))
 {
 	struct task_struct *tsk;
 	unsigned long caller;
@@ -128,6 +114,8 @@ void emulate_vsyscall(struct pt_regs *regs, int nr,
 
 	local_irq_enable();
 
+	trace_emulate_vsyscall(nr);
+
 	if (!user_64bit_mode(regs)) {
 		/*
 		 * If we trapped from kernel mode, we might as well OOPS now
@@ -138,50 +126,29 @@ void emulate_vsyscall(struct pt_regs *regs, int nr,
 
 		/* Compat mode and non-compat 32-bit CS should both segfault. */
 		warn_bad_vsyscall(KERN_WARNING, regs,
-				  "illegal int 0xcc from 32-bit mode");
+				  "illegal emulated vsyscall from 32-bit mode");
 		goto sigsegv;
 	}
 
-	/*
-	 * x86-ism here: regs->ip points to the instruction after the int 0xcc,
-	 * and int 0xcc is two bytes long.
-	 */
-	vsyscall_nr = addr_to_vsyscall_nr(regs->ip - 2);
-
-	trace_emulate_vsyscall(vsyscall_nr);
-
-	if (vsyscall_nr < 0) {
-		warn_bad_vsyscall(KERN_WARNING, regs,
-				  "illegal int 0xcc (exploit attempt?)");
-		goto sigsegv;
-	}
+	tsk = current;
+	if (seccomp_mode(&tsk->seccomp))
+		do_exit(SIGKILL);
 
 	if (get_user(caller, (unsigned long __user *)regs->sp) != 0) {
 		warn_bad_vsyscall(KERN_WARNING, regs, "int 0xcc with bad stack (exploit attempt?)");
 		goto sigsegv;
 	}
 
-	tsk = current;
-	if (seccomp_mode(&tsk->seccomp))
-		do_exit(SIGKILL);
+	/*
+	 * x86-ism here: regs->ip points to the instruction after the int 0xcc,
+	 * and int 0xcc is two bytes long.
+	 */
+	if (((regs->ip - 2) & ~0xfff) != VSYSCALL_START)
+		warn_bad_vsyscall(KERN_WARNING, regs,
+				  "emulated vsyscall from bogus address -- "
+				  "fix your code");
 
-	switch (vsyscall_nr) {
-	case 0:
-		ret = sys_gettimeofday(
-			(struct timeval __user *)regs->di,
-			(struct timezone __user *)regs->si);
-		break;
-
-	case 1:
-		ret = sys_time((time_t __user *)regs->di);
-		break;
-
-	case 2:
-		ret = sys_getcpu((unsigned __user *)regs->di,
-				 (unsigned __user *)regs->si,
-				 0);
-		break;
-	}
+	ret = vsys(regs);
 
 	if (ret == -EFAULT) {
 		/*
@@ -223,9 +190,9 @@ static long vsys_gettimeofday(struct pt_regs *regs)
 		(struct timezone __user *)regs->si);
 }
 
-void dotraplinkage emulate_vsyscall0(struct pt_regs *regs, long error_code)
+void dotraplinkage do_emulate_vsyscall0(struct pt_regs *regs, long error_code)
 {
-	emulate_vsyscall(regs, vsys_gettimeofday);
+	emulate_vsyscall(regs, 0, vsys_gettimeofday);
 }
 
 static long vsys_time(struct pt_regs *regs)
@@ -233,7 +200,7 @@ static long vsys_time(struct pt_regs *regs)
 	return sys_time((time_t __user *)regs->di);
 }
 
-void dotraplinkage emulate_vsyscall1(struct pt_regs *regs, long error_code)
+void dotraplinkage do_emulate_vsyscall1(struct pt_regs *regs, long error_code)
 {
 	emulate_vsyscall(regs, 1, vsys_time);
 }
@@ -245,7 +212,7 @@ static long vsys_getcpu(struct pt_regs *regs)
 			  0);
 }
 
-void dotraplinkage emulate_vsyscall2(struct pt_regs *regs, long error_code)
+void dotraplinkage do_emulate_vsyscall2(struct pt_regs *regs, long error_code)
 {
 	emulate_vsyscall(regs, 2, vsys_getcpu);
 }
diff --git a/arch/x86/kernel/vsyscall_emu_64.S b/arch/x86/kernel/vsyscall_emu_64.S
index ffa845e..a4f02a3 100644
--- a/arch/x86/kernel/vsyscall_emu_64.S
+++ b/arch/x86/kernel/vsyscall_emu_64.S
@@ -13,15 +13,15 @@
 
 .section .vsyscall_0, "a"
 ENTRY(vsyscall_0)
-	int $VSYSCALL_EMU_VECTOR
+	int $VSYSCALL0_EMU_VECTOR
 END(vsyscall_0)
 
 .section .vsyscall_1, "a"
 ENTRY(vsyscall_1)
-	int $VSYSCALL_EMU_VECTOR
+	int $VSYSCALL1_EMU_VECTOR
 END(vsyscall_1)
 
 .section .vsyscall_2, "a"
 ENTRY(vsyscall_2)
-	int $VSYSCALL_EMU_VECTOR
+	int $VSYSCALL2_EMU_VECTOR
 END(vsyscall_2)
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [RFC v2] x86-64: Allow emulated vsyscalls from user addresses
  2011-08-06  0:32                     ` H. Peter Anvin
  2011-08-06  3:01                       ` [RFC] x86-64: Allow emulated vsyscalls from user addresses Andy Lutomirski
@ 2011-08-06  3:04                       ` Andy Lutomirski
  2011-08-06  6:45                         ` Ingo Molnar
  2011-08-11 13:16                         ` Pavel Machek
  2011-08-09 22:27                       ` New vsyscall emulation breaks JITs Suresh Siddha
  2 siblings, 2 replies; 37+ messages in thread
From: Andy Lutomirski @ 2011-08-06  3:04 UTC (permalink / raw)
  To: H. Peter Anvin"
  Cc: Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung,
	Suresh Siddha, Andy Lutomirski

A few dynamic recompilation tools are too clever for their own good.
They trace control flow through the vsyscall page and recompile that
code somewhere else.  Then they expect it to work.  DynamoRIO
(http://dynamorio.org/) and Pin (http://www.pintool.org/) are
affected.  They crash when tracing programs that use vsyscalls.
Valgrind is smart enough not to cause problems.  It crashes on the
getcpu vsyscall, but that has nothing to do with emulation.

This patch makes each of the three vsyscall entries use a different
vector so that they can work when relocated.  It assumes that the
code that relocates them is okay with the int instruction acting
like ret.  DynamoRIO at least appears to work.

We print an obnoxious (rate-limited) message to the log when this
happens.  Hopefully it will inspire the JIT tools to learn not to
trace into kernel address space.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---

This uses vectors 0x40, 0x41, and 0x42 for now.  They are REX
prefixes in 64-bit code, and jumping to the second byte of one
of these instructions will turn into 'rex.? int3', which will
trap.

Changes from v1: Sending the correct patch this time.

 arch/x86/include/asm/irq_vectors.h |   11 ++--
 arch/x86/include/asm/traps.h       |    8 ++-
 arch/x86/kernel/entry_64.S         |    4 +-
 arch/x86/kernel/traps.c            |   14 ++++-
 arch/x86/kernel/vsyscall_64.c      |  114 ++++++++++++++++++-----------------
 arch/x86/kernel/vsyscall_emu_64.S  |    6 +-
 6 files changed, 88 insertions(+), 69 deletions(-)

diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index f9a3209..b9c229a 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -15,10 +15,9 @@
  * IDT entries:
  *
  *  Vectors   0 ...  31 : system traps and exceptions - hardcoded events
- *  Vectors  32 ... 127 : device interrupts
- *  Vector  128         : legacy int80 syscall interface
- *  Vector  204         : legacy x86_64 vsyscall emulation
- *  Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 except 204 : device interrupts
+ *  Vectors  32 ... INVALIDATE_TLB_VECTOR_START-1 : device interrupts, except:
+ *   Vectors 64 ... 66  : legacy x86_64 vsyscall emulation
+ *   Vector  128        : legacy int80 syscall interface
  *  Vectors INVALIDATE_TLB_VECTOR_START ... 255 : special interrupts
  *
  * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table.
@@ -52,7 +51,9 @@
 # define SYSCALL_VECTOR			0x80
 #endif
 #ifdef CONFIG_X86_64
-# define VSYSCALL_EMU_VECTOR		0xcc
+# define VSYSCALL0_EMU_VECTOR		0x40
+# define VSYSCALL1_EMU_VECTOR		0x41
+# define VSYSCALL2_EMU_VECTOR		0x42
 #endif
 
 /*
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 2bae0a5..4335ff7 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -40,7 +40,9 @@ asmlinkage void alignment_check(void);
 asmlinkage void machine_check(void);
 #endif /* CONFIG_X86_MCE */
 asmlinkage void simd_coprocessor_error(void);
-asmlinkage void emulate_vsyscall(void);
+asmlinkage void emulate_vsyscall0(void);
+asmlinkage void emulate_vsyscall1(void);
+asmlinkage void emulate_vsyscall2(void);
 
 dotraplinkage void do_divide_error(struct pt_regs *, long);
 dotraplinkage void do_debug(struct pt_regs *, long);
@@ -67,7 +69,9 @@ dotraplinkage void do_alignment_check(struct pt_regs *, long);
 dotraplinkage void do_machine_check(struct pt_regs *, long);
 #endif
 dotraplinkage void do_simd_coprocessor_error(struct pt_regs *, long);
-dotraplinkage void do_emulate_vsyscall(struct pt_regs *, long);
+dotraplinkage void do_emulate_vsyscall0(struct pt_regs *, long);
+dotraplinkage void do_emulate_vsyscall1(struct pt_regs *, long);
+dotraplinkage void do_emulate_vsyscall2(struct pt_regs *, long);
 #ifdef CONFIG_X86_32
 dotraplinkage void do_iret_error(struct pt_regs *, long);
 #endif
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index e13329d..10489e5 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1111,7 +1111,9 @@ zeroentry spurious_interrupt_bug do_spurious_interrupt_bug
 zeroentry coprocessor_error do_coprocessor_error
 errorentry alignment_check do_alignment_check
 zeroentry simd_coprocessor_error do_simd_coprocessor_error
-zeroentry emulate_vsyscall do_emulate_vsyscall
+zeroentry emulate_vsyscall0 do_emulate_vsyscall0
+zeroentry emulate_vsyscall1 do_emulate_vsyscall1
+zeroentry emulate_vsyscall2 do_emulate_vsyscall2
 
 
 	/* Reload gs selector with exception handling */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 9682ec5..6ae5e3a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -873,9 +873,17 @@ void __init trap_init(void)
 #endif
 
 #ifdef CONFIG_X86_64
-	BUG_ON(test_bit(VSYSCALL_EMU_VECTOR, used_vectors));
-	set_system_intr_gate(VSYSCALL_EMU_VECTOR, &emulate_vsyscall);
-	set_bit(VSYSCALL_EMU_VECTOR, used_vectors);
+	BUG_ON(test_bit(VSYSCALL0_EMU_VECTOR, used_vectors));
+	set_system_intr_gate(VSYSCALL0_EMU_VECTOR, &emulate_vsyscall0);
+	set_bit(VSYSCALL0_EMU_VECTOR, used_vectors);
+
+	BUG_ON(test_bit(VSYSCALL1_EMU_VECTOR, used_vectors));
+	set_system_intr_gate(VSYSCALL1_EMU_VECTOR, &emulate_vsyscall1);
+	set_bit(VSYSCALL1_EMU_VECTOR, used_vectors);
+
+	BUG_ON(test_bit(VSYSCALL2_EMU_VECTOR, used_vectors));
+	set_system_intr_gate(VSYSCALL2_EMU_VECTOR, &emulate_vsyscall2);
+	set_bit(VSYSCALL2_EMU_VECTOR, used_vectors);
 #endif
 
 	/*
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 93a0d46..a33ad02 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -8,11 +8,9 @@
  *  Special thanks to Ingo Molnar for his early experience with
  *  a different vsyscall implementation for Linux/IA32 and for the name.
  *
- *  vsyscall 1 is located at -10Mbyte, vsyscall 2 is located
- *  at virtual address -10Mbyte+1024bytes etc... There are at max 4
- *  vsyscalls. One vsyscall can reserve more than 1 slot to avoid
- *  jumping out of line if necessary. We cannot add more with this
- *  mechanism because older kernels won't return -ENOSYS.
+ *  There are exactly three vsyscalls.  vsyscall 0 is at -10Mbyte,
+ *  and vsyscalls 1 and 2 are 1024 and 2048 bytes past vsyscall 0.
+ *  We cannot (and do not want to) add more.
  *
  *  Note: the concept clashes with user mode linux.  UML users should
  *  use the vDSO.
@@ -107,29 +105,17 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
 	       regs->sp, regs->ax, regs->si, regs->di);
 }
 
-static int addr_to_vsyscall_nr(unsigned long addr)
-{
-	int nr;
-
-	if ((addr & ~0xC00UL) != VSYSCALL_START)
-		return -EINVAL;
-
-	nr = (addr & 0xC00UL) >> 10;
-	if (nr >= 3)
-		return -EINVAL;
-
-	return nr;
-}
-
-void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
+static void emulate_vsyscall(struct pt_regs *regs, int nr,
+			     long (*vsys)(struct pt_regs *))
 {
 	struct task_struct *tsk;
 	unsigned long caller;
-	int vsyscall_nr;
 	long ret;
 
 	local_irq_enable();
 
+	trace_emulate_vsyscall(nr);
+
 	if (!user_64bit_mode(regs)) {
 		/*
 		 * If we trapped from kernel mode, we might as well OOPS now
@@ -140,50 +126,29 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
 
 		/* Compat mode and non-compat 32-bit CS should both segfault. */
 		warn_bad_vsyscall(KERN_WARNING, regs,
-				  "illegal int 0xcc from 32-bit mode");
+				  "illegal emulated vsyscall from 32-bit mode");
 		goto sigsegv;
 	}
 
-	/*
-	 * x86-ism here: regs->ip points to the instruction after the int 0xcc,
-	 * and int 0xcc is two bytes long.
-	 */
-	vsyscall_nr = addr_to_vsyscall_nr(regs->ip - 2);
-
-	trace_emulate_vsyscall(vsyscall_nr);
-
-	if (vsyscall_nr < 0) {
-		warn_bad_vsyscall(KERN_WARNING, regs,
-				  "illegal int 0xcc (exploit attempt?)");
-		goto sigsegv;
-	}
+	tsk = current;
+	if (seccomp_mode(&tsk->seccomp))
+		do_exit(SIGKILL);
 
 	if (get_user(caller, (unsigned long __user *)regs->sp) != 0) {
 		warn_bad_vsyscall(KERN_WARNING, regs, "int 0xcc with bad stack (exploit attempt?)");
 		goto sigsegv;
 	}
 
-	tsk = current;
-	if (seccomp_mode(&tsk->seccomp))
-		do_exit(SIGKILL);
+	/*
+	 * x86-ism here: regs->ip points to the instruction after the int 0xcc,
+	 * and int 0xcc is two bytes long.
+	 */
+	if (((regs->ip - 2) & ~0xfff) != VSYSCALL_START)
+		warn_bad_vsyscall(KERN_WARNING, regs,
+				  "emulated vsyscall from bogus address -- "
+				  "fix your code");
 
-	switch (vsyscall_nr) {
-	case 0:
-		ret = sys_gettimeofday(
-			(struct timeval __user *)regs->di,
-			(struct timezone __user *)regs->si);
-		break;
-
-	case 1:
-		ret = sys_time((time_t __user *)regs->di);
-		break;
-
-	case 2:
-		ret = sys_getcpu((unsigned __user *)regs->di,
-				 (unsigned __user *)regs->si,
-				 0);
-		break;
-	}
+	ret = vsys(regs);
 
 	if (ret == -EFAULT) {
 		/*
@@ -213,6 +178,45 @@ sigsegv:
 	local_irq_disable();
 }
 
+
+/*
+ * These are the actual vsyscall emulation entries.
+ */
+
+static long vsys_gettimeofday(struct pt_regs *regs)
+{
+	return  sys_gettimeofday(
+		(struct timeval __user *)regs->di,
+		(struct timezone __user *)regs->si);
+}
+
+void dotraplinkage do_emulate_vsyscall0(struct pt_regs *regs, long error_code)
+{
+	emulate_vsyscall(regs, 0, vsys_gettimeofday);
+}
+
+static long vsys_time(struct pt_regs *regs)
+{
+	return sys_time((time_t __user *)regs->di);
+}
+
+void dotraplinkage do_emulate_vsyscall1(struct pt_regs *regs, long error_code)
+{
+	emulate_vsyscall(regs, 1, vsys_time);
+}
+
+static long vsys_getcpu(struct pt_regs *regs)
+{
+	return sys_getcpu((unsigned __user *)regs->di,
+			  (unsigned __user *)regs->si,
+			  0);
+}
+
+void dotraplinkage do_emulate_vsyscall2(struct pt_regs *regs, long error_code)
+{
+	emulate_vsyscall(regs, 2, vsys_getcpu);
+}
+
 /*
  * Assume __initcall executes before all user space. Hopefully kmod
  * doesn't violate that. We'll find out if it does.
diff --git a/arch/x86/kernel/vsyscall_emu_64.S b/arch/x86/kernel/vsyscall_emu_64.S
index ffa845e..a4f02a3 100644
--- a/arch/x86/kernel/vsyscall_emu_64.S
+++ b/arch/x86/kernel/vsyscall_emu_64.S
@@ -13,15 +13,15 @@
 
 .section .vsyscall_0, "a"
 ENTRY(vsyscall_0)
-	int $VSYSCALL_EMU_VECTOR
+	int $VSYSCALL0_EMU_VECTOR
 END(vsyscall_0)
 
 .section .vsyscall_1, "a"
 ENTRY(vsyscall_1)
-	int $VSYSCALL_EMU_VECTOR
+	int $VSYSCALL1_EMU_VECTOR
 END(vsyscall_1)
 
 .section .vsyscall_2, "a"
 ENTRY(vsyscall_2)
-	int $VSYSCALL_EMU_VECTOR
+	int $VSYSCALL2_EMU_VECTOR
 END(vsyscall_2)
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [RFC v2] x86-64: Allow emulated vsyscalls from user addresses
  2011-08-06  3:04                       ` [RFC v2] " Andy Lutomirski
@ 2011-08-06  6:45                         ` Ingo Molnar
  2011-08-07 12:19                           ` Borislav Petkov
  2011-08-11 13:16                         ` Pavel Machek
  1 sibling, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2011-08-06  6:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin",
	Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung,
	Suresh Siddha


* Andy Lutomirski <luto@mit.edu> wrote:

> A few dynamic recompilation tools are too clever for their own 
> good. They trace control flow through the vsyscall page and 
> recompile that code somewhere else.  Then they expect it to work.  
> DynamoRIO (http://dynamorio.org/) and Pin (http://www.pintool.org/) 
> are affected.  They crash when tracing programs that use vsyscalls. 
> Valgrind is smart enough not to cause problems.  It crashes on the 
> getcpu vsyscall, but that has nothing to do with emulation.
> 
> This patch makes each of the three vsyscall entries use a different 
> vector so that they can work when relocated.  It assumes that the 
> code that relocates them is okay with the int instruction acting 
> like ret.  DynamoRIO at least appears to work.
> 
> We print an obnoxious (rate-limited) message to the log when this 
> happens.  Hopefully it will inspire the JIT tools to learn not to 
> trace into kernel address space.
> 
> Signed-off-by: Andy Lutomirski <luto@mit.edu>
> ---
> 
> This uses vectors 0x40, 0x41, and 0x42 for now.  They are REX
> prefixes in 64-bit code, and jumping to the second byte of one
> of these instructions will turn into 'rex.? int3', which will
> trap.
> 
> Changes from v1: Sending the correct patch this time.
> 
>  arch/x86/include/asm/irq_vectors.h |   11 ++--
>  arch/x86/include/asm/traps.h       |    8 ++-
>  arch/x86/kernel/entry_64.S         |    4 +-
>  arch/x86/kernel/traps.c            |   14 ++++-
>  arch/x86/kernel/vsyscall_64.c      |  114 ++++++++++++++++++-----------------
>  arch/x86/kernel/vsyscall_emu_64.S  |    6 +-
>  6 files changed, 88 insertions(+), 69 deletions(-)
> 
> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index f9a3209..b9c229a 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -15,10 +15,9 @@
>   * IDT entries:
>   *
>   *  Vectors   0 ...  31 : system traps and exceptions - hardcoded events
> - *  Vectors  32 ... 127 : device interrupts
> - *  Vector  128         : legacy int80 syscall interface
> - *  Vector  204         : legacy x86_64 vsyscall emulation
> - *  Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 except 204 : device interrupts
> + *  Vectors  32 ... INVALIDATE_TLB_VECTOR_START-1 : device interrupts, except:
> + *   Vectors 64 ... 66  : legacy x86_64 vsyscall emulation
> + *   Vector  128        : legacy int80 syscall interface
>   *  Vectors INVALIDATE_TLB_VECTOR_START ... 255 : special interrupts
>   *
>   * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table.
> @@ -52,7 +51,9 @@
>  # define SYSCALL_VECTOR			0x80
>  #endif
>  #ifdef CONFIG_X86_64
> -# define VSYSCALL_EMU_VECTOR		0xcc
> +# define VSYSCALL0_EMU_VECTOR		0x40
> +# define VSYSCALL1_EMU_VECTOR		0x41
> +# define VSYSCALL2_EMU_VECTOR		0x42
>  #endif
>  
>  /*
> diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
> index 2bae0a5..4335ff7 100644
> --- a/arch/x86/include/asm/traps.h
> +++ b/arch/x86/include/asm/traps.h
> @@ -40,7 +40,9 @@ asmlinkage void alignment_check(void);
>  asmlinkage void machine_check(void);
>  #endif /* CONFIG_X86_MCE */
>  asmlinkage void simd_coprocessor_error(void);
> -asmlinkage void emulate_vsyscall(void);
> +asmlinkage void emulate_vsyscall0(void);
> +asmlinkage void emulate_vsyscall1(void);
> +asmlinkage void emulate_vsyscall2(void);
>  
>  dotraplinkage void do_divide_error(struct pt_regs *, long);
>  dotraplinkage void do_debug(struct pt_regs *, long);
> @@ -67,7 +69,9 @@ dotraplinkage void do_alignment_check(struct pt_regs *, long);
>  dotraplinkage void do_machine_check(struct pt_regs *, long);
>  #endif
>  dotraplinkage void do_simd_coprocessor_error(struct pt_regs *, long);
> -dotraplinkage void do_emulate_vsyscall(struct pt_regs *, long);
> +dotraplinkage void do_emulate_vsyscall0(struct pt_regs *, long);
> +dotraplinkage void do_emulate_vsyscall1(struct pt_regs *, long);
> +dotraplinkage void do_emulate_vsyscall2(struct pt_regs *, long);
>  #ifdef CONFIG_X86_32
>  dotraplinkage void do_iret_error(struct pt_regs *, long);
>  #endif
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index e13329d..10489e5 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1111,7 +1111,9 @@ zeroentry spurious_interrupt_bug do_spurious_interrupt_bug
>  zeroentry coprocessor_error do_coprocessor_error
>  errorentry alignment_check do_alignment_check
>  zeroentry simd_coprocessor_error do_simd_coprocessor_error
> -zeroentry emulate_vsyscall do_emulate_vsyscall
> +zeroentry emulate_vsyscall0 do_emulate_vsyscall0
> +zeroentry emulate_vsyscall1 do_emulate_vsyscall1
> +zeroentry emulate_vsyscall2 do_emulate_vsyscall2
>  
>  
>  	/* Reload gs selector with exception handling */
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 9682ec5..6ae5e3a 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -873,9 +873,17 @@ void __init trap_init(void)
>  #endif
>  
>  #ifdef CONFIG_X86_64
> -	BUG_ON(test_bit(VSYSCALL_EMU_VECTOR, used_vectors));
> -	set_system_intr_gate(VSYSCALL_EMU_VECTOR, &emulate_vsyscall);
> -	set_bit(VSYSCALL_EMU_VECTOR, used_vectors);
> +	BUG_ON(test_bit(VSYSCALL0_EMU_VECTOR, used_vectors));
> +	set_system_intr_gate(VSYSCALL0_EMU_VECTOR, &emulate_vsyscall0);
> +	set_bit(VSYSCALL0_EMU_VECTOR, used_vectors);
> +
> +	BUG_ON(test_bit(VSYSCALL1_EMU_VECTOR, used_vectors));
> +	set_system_intr_gate(VSYSCALL1_EMU_VECTOR, &emulate_vsyscall1);
> +	set_bit(VSYSCALL1_EMU_VECTOR, used_vectors);
> +
> +	BUG_ON(test_bit(VSYSCALL2_EMU_VECTOR, used_vectors));
> +	set_system_intr_gate(VSYSCALL2_EMU_VECTOR, &emulate_vsyscall2);
> +	set_bit(VSYSCALL2_EMU_VECTOR, used_vectors);
>  #endif
>  
>  	/*
> diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
> index 93a0d46..a33ad02 100644
> --- a/arch/x86/kernel/vsyscall_64.c
> +++ b/arch/x86/kernel/vsyscall_64.c
> @@ -8,11 +8,9 @@
>   *  Special thanks to Ingo Molnar for his early experience with
>   *  a different vsyscall implementation for Linux/IA32 and for the name.
>   *
> - *  vsyscall 1 is located at -10Mbyte, vsyscall 2 is located
> - *  at virtual address -10Mbyte+1024bytes etc... There are at max 4
> - *  vsyscalls. One vsyscall can reserve more than 1 slot to avoid
> - *  jumping out of line if necessary. We cannot add more with this
> - *  mechanism because older kernels won't return -ENOSYS.
> + *  There are exactly three vsyscalls.  vsyscall 0 is at -10Mbyte,
> + *  and vsyscalls 1 and 2 are 1024 and 2048 bytes past vsyscall 0.
> + *  We cannot (and do not want to) add more.
>   *
>   *  Note: the concept clashes with user mode linux.  UML users should
>   *  use the vDSO.
> @@ -107,29 +105,17 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
>  	       regs->sp, regs->ax, regs->si, regs->di);
>  }
>  
> -static int addr_to_vsyscall_nr(unsigned long addr)
> -{
> -	int nr;
> -
> -	if ((addr & ~0xC00UL) != VSYSCALL_START)
> -		return -EINVAL;
> -
> -	nr = (addr & 0xC00UL) >> 10;
> -	if (nr >= 3)
> -		return -EINVAL;
> -
> -	return nr;
> -}
> -
> -void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
> +static void emulate_vsyscall(struct pt_regs *regs, int nr,
> +			     long (*vsys)(struct pt_regs *))
>  {
>  	struct task_struct *tsk;
>  	unsigned long caller;
> -	int vsyscall_nr;
>  	long ret;
>  
>  	local_irq_enable();
>  
> +	trace_emulate_vsyscall(nr);
> +
>  	if (!user_64bit_mode(regs)) {
>  		/*
>  		 * If we trapped from kernel mode, we might as well OOPS now
> @@ -140,50 +126,29 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
>  
>  		/* Compat mode and non-compat 32-bit CS should both segfault. */
>  		warn_bad_vsyscall(KERN_WARNING, regs,
> -				  "illegal int 0xcc from 32-bit mode");
> +				  "illegal emulated vsyscall from 32-bit mode");
>  		goto sigsegv;
>  	}
>  
> -	/*
> -	 * x86-ism here: regs->ip points to the instruction after the int 0xcc,
> -	 * and int 0xcc is two bytes long.
> -	 */
> -	vsyscall_nr = addr_to_vsyscall_nr(regs->ip - 2);
> -
> -	trace_emulate_vsyscall(vsyscall_nr);
> -
> -	if (vsyscall_nr < 0) {
> -		warn_bad_vsyscall(KERN_WARNING, regs,
> -				  "illegal int 0xcc (exploit attempt?)");
> -		goto sigsegv;
> -	}
> +	tsk = current;
> +	if (seccomp_mode(&tsk->seccomp))
> +		do_exit(SIGKILL);
>  
>  	if (get_user(caller, (unsigned long __user *)regs->sp) != 0) {
>  		warn_bad_vsyscall(KERN_WARNING, regs, "int 0xcc with bad stack (exploit attempt?)");
>  		goto sigsegv;
>  	}
>  
> -	tsk = current;
> -	if (seccomp_mode(&tsk->seccomp))
> -		do_exit(SIGKILL);
> +	/*
> +	 * x86-ism here: regs->ip points to the instruction after the int 0xcc,
> +	 * and int 0xcc is two bytes long.
> +	 */
> +	if (((regs->ip - 2) & ~0xfff) != VSYSCALL_START)
> +		warn_bad_vsyscall(KERN_WARNING, regs,
> +				  "emulated vsyscall from bogus address -- "
> +				  "fix your code");
>  
> -	switch (vsyscall_nr) {
> -	case 0:
> -		ret = sys_gettimeofday(
> -			(struct timeval __user *)regs->di,
> -			(struct timezone __user *)regs->si);
> -		break;
> -
> -	case 1:
> -		ret = sys_time((time_t __user *)regs->di);
> -		break;
> -
> -	case 2:
> -		ret = sys_getcpu((unsigned __user *)regs->di,
> -				 (unsigned __user *)regs->si,
> -				 0);
> -		break;
> -	}
> +	ret = vsys(regs);
>  
>  	if (ret == -EFAULT) {
>  		/*
> @@ -213,6 +178,45 @@ sigsegv:
>  	local_irq_disable();
>  }
>  
> +
> +/*
> + * These are the actual vsyscall emulation entries.
> + */
> +
> +static long vsys_gettimeofday(struct pt_regs *regs)
> +{
> +	return  sys_gettimeofday(
> +		(struct timeval __user *)regs->di,
> +		(struct timezone __user *)regs->si);
> +}
> +
> +void dotraplinkage do_emulate_vsyscall0(struct pt_regs *regs, long error_code)
> +{
> +	emulate_vsyscall(regs, 0, vsys_gettimeofday);
> +}
> +
> +static long vsys_time(struct pt_regs *regs)
> +{
> +	return sys_time((time_t __user *)regs->di);
> +}
> +
> +void dotraplinkage do_emulate_vsyscall1(struct pt_regs *regs, long error_code)
> +{
> +	emulate_vsyscall(regs, 1, vsys_time);
> +}
> +
> +static long vsys_getcpu(struct pt_regs *regs)
> +{
> +	return sys_getcpu((unsigned __user *)regs->di,
> +			  (unsigned __user *)regs->si,
> +			  0);
> +}
> +
> +void dotraplinkage do_emulate_vsyscall2(struct pt_regs *regs, long error_code)
> +{
> +	emulate_vsyscall(regs, 2, vsys_getcpu);
> +}
> +

Surprisingly, this looks a bit cleaner to me than the original code, 
as the emulated syscalls separate out so nicely.

The flip side is using up more of our vector space - but 
realistically we could put all this code behind a default-off 
LEGACY_VSYSCALL switch a year or two down the line, when distros have 
ugpraded glibc.

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] x86-64: Allow emulated vsyscalls from user addresses
  2011-08-06  6:45                         ` Ingo Molnar
@ 2011-08-07 12:19                           ` Borislav Petkov
  2011-08-07 12:58                             ` Andrew Lutomirski
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2011-08-07 12:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, H. Peter Anvin, Andi Kleen, x86, linux-kernel,
	torvalds, lueckintel, kimwooyoung, Suresh Siddha

On Sat, Aug 06, 2011 at 08:45:52AM +0200, Ingo Molnar wrote:
> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> > index 9682ec5..6ae5e3a 100644
> > --- a/arch/x86/kernel/traps.c
> > +++ b/arch/x86/kernel/traps.c
> > @@ -873,9 +873,17 @@ void __init trap_init(void)
> >  #endif
> >  
> >  #ifdef CONFIG_X86_64
> > -	BUG_ON(test_bit(VSYSCALL_EMU_VECTOR, used_vectors));
> > -	set_system_intr_gate(VSYSCALL_EMU_VECTOR, &emulate_vsyscall);
> > -	set_bit(VSYSCALL_EMU_VECTOR, used_vectors);
> > +	BUG_ON(test_bit(VSYSCALL0_EMU_VECTOR, used_vectors));
> > +	set_system_intr_gate(VSYSCALL0_EMU_VECTOR, &emulate_vsyscall0);
> > +	set_bit(VSYSCALL0_EMU_VECTOR, used_vectors);
> > +
> > +	BUG_ON(test_bit(VSYSCALL1_EMU_VECTOR, used_vectors));
> > +	set_system_intr_gate(VSYSCALL1_EMU_VECTOR, &emulate_vsyscall1);
> > +	set_bit(VSYSCALL1_EMU_VECTOR, used_vectors);
> > +
> > +	BUG_ON(test_bit(VSYSCALL2_EMU_VECTOR, used_vectors));
> > +	set_system_intr_gate(VSYSCALL2_EMU_VECTOR, &emulate_vsyscall2);
> > +	set_bit(VSYSCALL2_EMU_VECTOR, used_vectors);
> >  #endif

..

> Surprisingly, this looks a bit cleaner to me than the original code,
> as the emulated syscalls separate out so nicely.
>
> The flip side is using up more of our vector space - but realistically
> we could put all this code behind a default-off LEGACY_VSYSCALL switch
> a year or two down the line, when distros have ugpraded glibc.

This probably needs an entry in <Documentation/feature-removal-schedule.txt>.

Also, what do we do with userspace which decides to hardcode "int 0x4[012]"
somewhere in the meantime?

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] x86-64: Allow emulated vsyscalls from user addresses
  2011-08-07 12:19                           ` Borislav Petkov
@ 2011-08-07 12:58                             ` Andrew Lutomirski
  2011-08-07 15:44                               ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-07 12:58 UTC (permalink / raw)
  To: Borislav Petkov, Ingo Molnar, Andy Lutomirski, H. Peter Anvin,
	Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung,
	Suresh Siddha

On Sun, Aug 7, 2011 at 8:19 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Sat, Aug 06, 2011 at 08:45:52AM +0200, Ingo Molnar wrote:
>> > diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> > index 9682ec5..6ae5e3a 100644
>> > --- a/arch/x86/kernel/traps.c
>> > +++ b/arch/x86/kernel/traps.c
>> > @@ -873,9 +873,17 @@ void __init trap_init(void)
>> >  #endif
>> >
>> >  #ifdef CONFIG_X86_64
>> > -   BUG_ON(test_bit(VSYSCALL_EMU_VECTOR, used_vectors));
>> > -   set_system_intr_gate(VSYSCALL_EMU_VECTOR, &emulate_vsyscall);
>> > -   set_bit(VSYSCALL_EMU_VECTOR, used_vectors);
>> > +   BUG_ON(test_bit(VSYSCALL0_EMU_VECTOR, used_vectors));
>> > +   set_system_intr_gate(VSYSCALL0_EMU_VECTOR, &emulate_vsyscall0);
>> > +   set_bit(VSYSCALL0_EMU_VECTOR, used_vectors);
>> > +
>> > +   BUG_ON(test_bit(VSYSCALL1_EMU_VECTOR, used_vectors));
>> > +   set_system_intr_gate(VSYSCALL1_EMU_VECTOR, &emulate_vsyscall1);
>> > +   set_bit(VSYSCALL1_EMU_VECTOR, used_vectors);
>> > +
>> > +   BUG_ON(test_bit(VSYSCALL2_EMU_VECTOR, used_vectors));
>> > +   set_system_intr_gate(VSYSCALL2_EMU_VECTOR, &emulate_vsyscall2);
>> > +   set_bit(VSYSCALL2_EMU_VECTOR, used_vectors);
>> >  #endif
>
> ..
>
>> Surprisingly, this looks a bit cleaner to me than the original code,
>> as the emulated syscalls separate out so nicely.
>>
>> The flip side is using up more of our vector space - but realistically
>> we could put all this code behind a default-off LEGACY_VSYSCALL switch
>> a year or two down the line, when distros have ugpraded glibc.
>
> This probably needs an entry in <Documentation/feature-removal-schedule.txt>.

Will do.  Maybe that will encourage glibc to stop using them in static binaries.

I'll wire up sys_getcpu on x86_64 at the same time for good measure.
It's not currently available as a real syscall.

>
> Also, what do we do with userspace which decides to hardcode "int 0x4[012]"
> somewhere in the meantime?

Break it?  Any code that does that will get an unconditional warning
with this patch.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] x86-64: Allow emulated vsyscalls from user addresses
  2011-08-07 12:58                             ` Andrew Lutomirski
@ 2011-08-07 15:44                               ` Borislav Petkov
  2011-08-07 16:14                                 ` Andrew Lutomirski
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2011-08-07 15:44 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Ingo Molnar, H. Peter Anvin, Andi Kleen, x86, linux-kernel,
	torvalds, lueckintel, kimwooyoung, Suresh Siddha

On Sun, Aug 07, 2011 at 08:58:56AM -0400, Andrew Lutomirski wrote:
> > Also, what do we do with userspace which decides to hardcode "int 0x4[012]"
> > somewhere in the meantime?
> 
> Break it?  Any code that does that will get an unconditional warning
> with this patch.

Ok, I hope you're right. Because I'm sure you remember the last
prominent time the kernel broke userspace in the face of powertop.
Although having the warning should be fine, i.e. along the lines of "you
silly userspace process have been warned."

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] x86-64: Allow emulated vsyscalls from user addresses
  2011-08-07 15:44                               ` Borislav Petkov
@ 2011-08-07 16:14                                 ` Andrew Lutomirski
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-07 16:14 UTC (permalink / raw)
  To: Borislav Petkov, Andrew Lutomirski, Ingo Molnar, H. Peter Anvin,
	Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung,
	Suresh Siddha

On Sun, Aug 7, 2011 at 11:44 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Sun, Aug 07, 2011 at 08:58:56AM -0400, Andrew Lutomirski wrote:
>> > Also, what do we do with userspace which decides to hardcode "int 0x4[012]"
>> > somewhere in the meantime?
>>
>> Break it?  Any code that does that will get an unconditional warning
>> with this patch.
>
> Ok, I hope you're right. Because I'm sure you remember the last
> prominent time the kernel broke userspace in the face of powertop.
> Although having the warning should be fine, i.e. along the lines of "you
> silly userspace process have been warned."

We have an advantage this time: why would anyone want to use them?
They're annoying to use and they're slower than syscalls.

We could resurrect the patch that randomized the vectors at boot, but
that was IMO ugly.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-05 21:21           ` Andi Kleen
  2011-08-05 21:26             ` Andrew Lutomirski
@ 2011-08-09 13:26             ` Andrew Lutomirski
  2011-08-09 15:04               ` Andi Kleen
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-09 13:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On Fri, Aug 5, 2011 at 5:21 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Fri, Aug 05, 2011 at 05:00:44PM -0400, Andrew Lutomirski wrote:
>> > If anything runtime, but really for me it looks like the vsyscall
>> > changes should be only in one of those limited compability paranoia
>> > patchkits.
>>
>> Switching it in runtime would be a giant mess because user code might
>
> You can always switch at boot time.
>
> But really serious binary incompatibility like this should not be default
> (not even talking about the slow down for existing binaries using time())

Why do we care about pin again?

$ ./pin -t obj-intel64/opcodemix.so -- /bin/ls
E:3.0 is not a supported linux release

So we've already broken it completely, and they'll have to release a
new version anyway to fix it.  This version is from June of this year.

I'll send out the updated patch anyway for the benefit of DynamoRIO
(which is the only thing I know of that is affected).

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-09 13:26             ` Andrew Lutomirski
@ 2011-08-09 15:04               ` Andi Kleen
  2011-08-09 15:22                 ` Andrew Lutomirski
  0 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2011-08-09 15:04 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, H. Peter Anvin, x86, linux-kernel, torvalds,
	lueckintel, kimwooyoung

> Why do we care about pin again?

Binary compatibility is not about someone especially caring.

> 
> $ ./pin -t obj-intel64/opcodemix.so -- /bin/ls
> E:3.0 is not a supported linux release

It works with

echo "2.6.39" > /osrelease
mount --bind /osrelease /proc/sys/kernel/osrelease

-Andi


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-09 15:04               ` Andi Kleen
@ 2011-08-09 15:22                 ` Andrew Lutomirski
  2011-08-09 16:47                   ` [RFC] x86-64: Add vsyscall=emulate|native|none option Andy Lutomirski
  2011-08-09 16:57                   ` New vsyscall emulation breaks JITs H. Peter Anvin
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-09 15:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On Tue, Aug 9, 2011 at 11:04 AM, Andi Kleen <andi@firstfloor.org> wrote:
>> Why do we care about pin again?
>
> Binary compatibility is not about someone especially caring.

I'm all for binary compatibility with programs that already worked...

>
>>
>> $ ./pin -t obj-intel64/opcodemix.so -- /bin/ls
>> E:3.0 is not a supported linux release
>
> It works with
>
> echo "2.6.39" > /osrelease
> mount --bind /osrelease /proc/sys/kernel/osrelease
>

...but that's more than a little bit sad.  (For example, my patches
"broke" anything that relied on vsyscall 3, aka enosys().  But that
was already broken for years, so clearly no one cared.)

In any case, my patch fixes DynamoRIO but not pin.  Pin dies with:

[ 4988.945491] test_vsyscall[4587] emulated vsyscall from bogus
address -- fix your code nr: 0 ip:7fdc3a5ce78f cs:33 sp:7fffc2339a88
ax:ffffffffff600000 si:0 di:400d0a
[ 4988.945497] test_vsyscall[4587] vsyscall fault (exploit attempt?)
nr: 0 ip:7fdc3a5ce78f cs:33 sp:7fffc2339a88 ax:ffffffffff600000 si:0
di:400d0a

and I don't know what's going on.  I suspect that the tracer assumes
that int 0x40 continues execution at the next instruction.

x86 maintainers: I can think of a few choices:

1. Stick a ret instruction in the vsyscall page.  Downside: now
there's an unrestricted ret instruction in the vsyscall page.

2. Don't apply my patch and let pin and DynamoRIO break.  Downside:
DynamoRIO actually works in 3.0.

3. Apply my patch and assume that the number of users that would
benefit from a more complete fix is close to zero, since pin won't
even try to run on 3.0 or 3.1 without gross hacks.  (Pin is prerelease
software and apparently actively maintained by people who make it very
hard for non-users to contact, but I'm trying.)

4. Put native syscall instructions in the vsyscall page, mark it NX,
and trap and emulate vsyscalls on the instruction fetch fault.  I can
do this with no overhead in the success path of the page fault code,
but it will slow down vsyscall emulation quite a bit and it's
intrusive.  This seems like overkill to fix a single known binary that
doesn't really work anyway.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [RFC] x86-64: Add vsyscall=emulate|native|none option
  2011-08-09 15:22                 ` Andrew Lutomirski
@ 2011-08-09 16:47                   ` Andy Lutomirski
  2011-08-09 19:54                     ` Linus Torvalds
  2011-08-09 16:57                   ` New vsyscall emulation breaks JITs H. Peter Anvin
  1 sibling, 1 reply; 37+ messages in thread
From: Andy Lutomirski @ 2011-08-09 16:47 UTC (permalink / raw)
  To: x86
  Cc: Andy Lutomirski, H. Peter Anvin, Andi Kleen, linux-kernel,
	torvalds, lueckintel, kimwooyoung, Ingo Molnar, Borislav Petkov

vsyscall=native makes vsyscalls as fast as syscalls and makes pin
and DynamoRIO work.  vsyscall=emulate (default) preserves current
behavior, and vsyscall=none is good for paranoid people who don't
need their boxes to work reliably.

Signed-off-by: Andy Lutomirski <luto@mit.edu>
---

This is an alternate fix.  It applies on top of the patch that
wires up getcpu on x86_64.

 Documentation/kernel-parameters.txt |   21 +++++++++++++++++++
 arch/x86/kernel/vsyscall_64.c       |   35 +++++++++++++++++++++++++++++++-
 arch/x86/kernel/vsyscall_emu_64.S   |   37 ++++++++++++++++++++++++++++++++++-
 3 files changed, 90 insertions(+), 3 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index e279b72..78926aa 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2680,6 +2680,27 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 	vmpoff=		[KNL,S390] Perform z/VM CP command after power off.
 			Format: <command>
 
+	vsyscall=	[X86-64]
+			Controls the behavior of vsyscalls (i.e. calls to
+			fixed addresses of 0xffffffffff600x00 from legacy
+			code).  Most statically-linked binaries and older
+			versions of glibc use these calls.  Because these
+			functions are at fixed addresses, they make nice
+			targets for exploits that can control RIP.
+
+			emulate     [default] Vsyscalls turn into traps and are
+			            emulated reasonably safely.
+
+			native      Vsyscalls are native syscall instructions.
+			            This is a little bit faster than trapping
+			            and makes a few dynamic recompilers work
+			            better than they would in emulation mode.
+			            It also makes exploits much easier to write.
+
+			none        Vsyscalls don't work at all.  This makes
+			            them quite hard to use for exploits but
+			            might break your system.
+
 	vt.cur_default=	[VT] Default cursor shape.
 			Format: 0xCCBBAA, where AA, BB, and CC are the same as
 			the parameters of the <Esc>[?A;B;Cc escape sequence;
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index bf8e9ff..e06a200 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -56,6 +56,27 @@ DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data) =
 	.lock = __SEQLOCK_UNLOCKED(__vsyscall_gtod_data.lock),
 };
 
+static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE;
+
+static int __init vsyscall_setup(char *str)
+{
+        if (str) {
+                if (!strcmp("emulate", str))
+			vsyscall_mode = EMULATE;
+                else if (!strcmp("native", str))
+                        vsyscall_mode = NATIVE;
+                else if (!strcmp("none", str))
+                        vsyscall_mode = NONE;
+		else
+			return -EINVAL;
+
+		return 0;
+        }
+
+        return -EINVAL;
+}
+early_param("vsyscall", vsyscall_setup);
+
 void update_vsyscall_tz(void)
 {
 	unsigned long flags;
@@ -151,7 +172,13 @@ void dotraplinkage do_emulate_vsyscall(struct pt_regs *regs, long error_code)
 
 	if (vsyscall_nr < 0) {
 		warn_bad_vsyscall(KERN_WARNING, regs,
-				  "illegal int 0xcc (exploit attempt?)");
+				  "illegal int 0xcc (exploit attempt or buggy program) -- look up the vsyscall kernel parameter if you need a workaround");
+		goto sigsegv;
+	}
+
+	if (vsyscall_mode == NONE) {
+		warn_bad_vsyscall(KERN_INFO, regs,
+				  "vsyscall attempted with vsyscall=none -- sending SIGSEGV");
 		goto sigsegv;
 	}
 
@@ -260,8 +287,12 @@ void __init map_vsyscall(void)
 	extern char __vvar_page;
 	unsigned long physaddr_vvar_page = __pa_symbol(&__vvar_page);
 
-	/* Note that VSYSCALL_MAPPED_PAGES must agree with the code below. */
+	extern char __native_vsyscall_page;
+	if (vsyscall_mode == NATIVE)
+		physaddr_page0 = __pa_symbol(&__native_vsyscall_page);
+
 	__set_fixmap(VSYSCALL_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL);
+
 	__set_fixmap(VVAR_PAGE, physaddr_vvar_page, PAGE_KERNEL_VVAR);
 	BUILD_BUG_ON((unsigned long)__fix_to_virt(VVAR_PAGE) != (unsigned long)VVAR_ADDRESS);
 }
diff --git a/arch/x86/kernel/vsyscall_emu_64.S b/arch/x86/kernel/vsyscall_emu_64.S
index ffa845e..97bb09d 100644
--- a/arch/x86/kernel/vsyscall_emu_64.S
+++ b/arch/x86/kernel/vsyscall_emu_64.S
@@ -7,9 +7,17 @@
  */
 
 #include <linux/linkage.h>
+
 #include <asm/irq_vectors.h>
+#include <asm/page_types.h>
+#include <asm/unistd_64.h>
+
+/*
+ * There are two versions of the vsyscall code. The unused parts of the
+ * pages are filled with 0xcc by the linker script.
+ */
 
-/* The unused parts of the page are filled with 0xcc by the linker script. */
+/* Mostly safe version used for vsyscall=emulate and vsyscall=none */
 
 .section .vsyscall_0, "a"
 ENTRY(vsyscall_0)
@@ -25,3 +33,30 @@ END(vsyscall_1)
 ENTRY(vsyscall_2)
 	int $VSYSCALL_EMU_VECTOR
 END(vsyscall_2)
+
+
+/* Much less safe version used for vsyscall=native */
+
+__PAGE_ALIGNED_DATA
+	.globl __native_vsyscall_page
+	.balign PAGE_SIZE, 0xcc
+	.type __native_syscall_page, @object
+__native_vsyscall_page:
+
+	mov $__NR_gettimeofday, %rax
+	syscall
+	ret
+
+	.balign 1024, 0xcc
+	mov $__NR_time, %rax
+	syscall
+	ret
+
+	.balign 1024, 0xcc
+	mov $__NR_getcpu, %rax
+	syscall
+	ret
+
+	.balign 4096, 0xcc
+
+	.size __native_vsyscall_page, 4096
\ No newline at end of file
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-09 15:22                 ` Andrew Lutomirski
  2011-08-09 16:47                   ` [RFC] x86-64: Add vsyscall=emulate|native|none option Andy Lutomirski
@ 2011-08-09 16:57                   ` H. Peter Anvin
  2011-08-09 17:05                     ` Andrew Lutomirski
  1 sibling, 1 reply; 37+ messages in thread
From: H. Peter Anvin @ 2011-08-09 16:57 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On 08/09/2011 10:22 AM, Andrew Lutomirski wrote:
> 
> In any case, my patch fixes DynamoRIO but not pin.  Pin dies with:
> 
> [ 4988.945491] test_vsyscall[4587] emulated vsyscall from bogus
> address -- fix your code nr: 0 ip:7fdc3a5ce78f cs:33 sp:7fffc2339a88
> ax:ffffffffff600000 si:0 di:400d0a
> [ 4988.945497] test_vsyscall[4587] vsyscall fault (exploit attempt?)
> nr: 0 ip:7fdc3a5ce78f cs:33 sp:7fffc2339a88 ax:ffffffffff600000 si:0
> di:400d0a
> 
> and I don't know what's going on.  I suspect that the tracer assumes
> that int 0x40 continues execution at the next instruction.
> 
> x86 maintainers: I can think of a few choices:
> 
> 1. Stick a ret instruction in the vsyscall page.  Downside: now
> there's an unrestricted ret instruction in the vsyscall page.
> 

How much worse is a ret instruction over the INT instructions that
modifies very little of the register state and then rets?

> 3. Apply my patch and assume that the number of users that would
> benefit from a more complete fix is close to zero, since pin won't
> even try to run on 3.0 or 3.1 without gross hacks.  (Pin is prerelease
> software and apparently actively maintained by people who make it very
> hard for non-users to contact, but I'm trying.)

Since pin is going to have to be fixed anyway to run on 3.x, it seems
reasonable to me that they can just fix their vsyscall handling at the
same time.

Now, the multimodal patch seems reasonable, too.

I think to some extent there are no actually good solutions here, just
varying degrees of bad.  Being able to completely disable vsyscall
without having to recompile seems attractive, though.

	-hpa


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-09 16:57                   ` New vsyscall emulation breaks JITs H. Peter Anvin
@ 2011-08-09 17:05                     ` Andrew Lutomirski
       [not found]                       ` <1312919938.17118.YahooMailNeo@web120010.mail.ne1.yahoo.com>
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-09 17:05 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andi Kleen, x86, linux-kernel, torvalds, lueckintel, kimwooyoung

On Tue, Aug 9, 2011 at 12:57 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 08/09/2011 10:22 AM, Andrew Lutomirski wrote:
>>
>> In any case, my patch fixes DynamoRIO but not pin.  Pin dies with:
>>
>> [ 4988.945491] test_vsyscall[4587] emulated vsyscall from bogus
>> address -- fix your code nr: 0 ip:7fdc3a5ce78f cs:33 sp:7fffc2339a88
>> ax:ffffffffff600000 si:0 di:400d0a
>> [ 4988.945497] test_vsyscall[4587] vsyscall fault (exploit attempt?)
>> nr: 0 ip:7fdc3a5ce78f cs:33 sp:7fffc2339a88 ax:ffffffffff600000 si:0
>> di:400d0a
>>
>> and I don't know what's going on.  I suspect that the tracer assumes
>> that int 0x40 continues execution at the next instruction.
>>
>> x86 maintainers: I can think of a few choices:
>>
>> 1. Stick a ret instruction in the vsyscall page.  Downside: now
>> there's an unrestricted ret instruction in the vsyscall page.
>>
>
> How much worse is a ret instruction over the INT instructions that
> modifies very little of the register state and then rets?

I'm far from an expert in exploit writing, but I suspect it's
sometimes an additional challenge to make sure that esi and edi are
valid pointers before jumping into the vsyscall.  That's why I added
the code that turns EFAULT into SIGSEGV.

>
>> 3. Apply my patch and assume that the number of users that would
>> benefit from a more complete fix is close to zero, since pin won't
>> even try to run on 3.0 or 3.1 without gross hacks.  (Pin is prerelease
>> software and apparently actively maintained by people who make it very
>> hard for non-users to contact, but I'm trying.)
>
> Since pin is going to have to be fixed anyway to run on 3.x, it seems
> reasonable to me that they can just fix their vsyscall handling at the
> same time.
>
> Now, the multimodal patch seems reasonable, too.
>
> I think to some extent there are no actually good solutions here, just
> varying degrees of bad.  Being able to completely disable vsyscall
> without having to recompile seems attractive, though.

Agreed.

I have a rather minimal vm that actually works with vsyscall=none.  If
you like that patch, I can send it on top of the patch it depends on.
I could also try to keep it from wasting one page of memory for the
unused image by playing some initdata games or otherwise freeing
whichever page isn't selected.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC] x86-64: Add vsyscall=emulate|native|none option
  2011-08-09 16:47                   ` [RFC] x86-64: Add vsyscall=emulate|native|none option Andy Lutomirski
@ 2011-08-09 19:54                     ` Linus Torvalds
  0 siblings, 0 replies; 37+ messages in thread
From: Linus Torvalds @ 2011-08-09 19:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, H. Peter Anvin, Andi Kleen, linux-kernel, lueckintel,
	kimwooyoung, Ingo Molnar, Borislav Petkov

On Tue, Aug 9, 2011 at 9:47 AM, Andy Lutomirski <luto@mit.edu> wrote:
> vsyscall=native makes vsyscalls as fast as syscalls and makes pin
> and DynamoRIO work.  vsyscall=emulate (default) preserves current
> behavior, and vsyscall=none is good for paranoid people who don't
> need their boxes to work reliably.

I think at this stage we should do this, but additionally just change
the default to 'native', avoiding the incompatibilities.

                            Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
       [not found]                       ` <1312919938.17118.YahooMailNeo@web120010.mail.ne1.yahoo.com>
@ 2011-08-09 20:59                         ` H. Peter Anvin
  2011-08-09 21:04                         ` Andrew Lutomirski
  1 sibling, 0 replies; 37+ messages in thread
From: H. Peter Anvin @ 2011-08-09 20:59 UTC (permalink / raw)
  To: Greg Lueck
  Cc: Andrew Lutomirski, Andi Kleen, x86, linux-kernel, torvalds, kimwooyoung

On 08/09/2011 02:58 PM, Greg Lueck wrote:
> I apologize that I’m just jumping into this conversation now.  I was
> swamped yesterday and this morning, and I only just started reading it
> today.
> 
> Pin needs to recognize all possible syscall trap instructions, so we
> will need to change our code to recognize INT 0xCC as a syscall trap. 

What you probably SHOULD be doing is to recognize the vsyscall/vdso as a
special chunk of memory.  The act of entering the vsyscall code is the
point where you need to intercept, if you're going to do that; relying
on the contents of the vsyscall/vdso page to have specific properties is
just plain broken, as you're "chasing the implementation", as well as
violate inherent properties of this particular memory space.

> SYSENTER instruction specially (on 32-bit).  When we see SYSENTER, Pin
> executes the syscall natively and then resumes JIT-compilation at the
> normal resume point in the gate area.  This works regardless of where
> Pin attaches to the application, and it also has the nice advantage that
> Pin tools see the exact sequence of user space instructions that the
> application would execute if it ran natively.

... and it's also complete bunk if you want any modicum of stability.
Keep in mind that the kernel can change the content of the vsyscall/vdso
memory at any time, without notifying userspace.

	-hpa

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
       [not found]                       ` <1312919938.17118.YahooMailNeo@web120010.mail.ne1.yahoo.com>
  2011-08-09 20:59                         ` H. Peter Anvin
@ 2011-08-09 21:04                         ` Andrew Lutomirski
  2011-08-09 22:36                           ` Linus Torvalds
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-09 21:04 UTC (permalink / raw)
  To: Greg Lueck
  Cc: H. Peter Anvin, Andi Kleen, x86, linux-kernel, torvalds, kimwooyoung

On Tue, Aug 9, 2011 at 3:58 PM, Greg Lueck <lueckintel@yahoo.com> wrote:
> I apologize that I’m just jumping into this conversation now.  I was swamped
> yesterday and this morning, and I only just started reading it today.
> Pin needs to recognize all possible syscall trap instructions, so we will
> need to change our code to recognize INT 0xCC as a syscall trap.  When Pin
> recognizes a system call trap instruction, it does _not_ copy the
> instruction into the translated code area.  Instead, we arrange for the trap
> to be executed natively from within our Pin VM engine.  On 64-bit, we use
> the SYSCALL instruction to do the trap regardless of what the original
> instruction was.  The SEGV that Andi saw is really just fallout from the
> fact that Pin didn’t know about INT 0xCC.  We assume that any INT
> instruction with no special semantic will just fault.  We copy these unknown
> INT’s into the Pin translated code area and execute them from there, where
> we expect them to raise a synchronous signal.  Pin’s signal emulation will
> take over from this point in case the application intentionally executed the
> weird INT with the expectation of handling the signal.
> In addition to recognizing the INT 0xCC instruction as a system call, we
> should probably handle unknown INT’s in the vdso / vsyscall gate area
> specially.  For example, we may want to raise a warning since this case
> probably indicates a new system call trap that we must handle specially.
> I need to read through the thread in more detail still, but I think one of
> the proposals was to use additional INT’s for syscall traps in the vsyscall
> area.  If so, Pin will need to recognize these.  It would be helpful to us
> if you could provide a disassembly of the proposed vsyscall and vdso gate
> areas.  Or, we could probably work with Andi to get these from the kernel
> sources.  In particular, we need to know how to find the system call number
> and its arguments at the point when the application executes the INT that
> traps into the kernel.  (We know the normal ABI for passing system call
> arguments, but I suppose it’s possible that these new INT’s will use a
> different ABI.)

Eek.  I'd really rather not have anything make any assumption beyond
the fact that a call or jump to the vsyscall page has certain
semantics.

> I also saw that you bumped into a Pin error with 3.0 kernels.
> Coincidentally, this was fixed last week and will be available in our next
> Pin release.  If you would like a private kit with this fix, I can send you
> one.

That would be helpful.

> Finally, I’d like to answer your questions about why Pin can’t just execute
> the vdso / vsyscall code natively.  We changed the way Pin handles the gate
> code when we added our attach / detach feature, allowing Pin to attach to a
> native process.  Consider that Pin may attach to a process that is executing
> in the middle of the gate code, or worse, it may attach while in a signal
> handler that will subsequently return into the middle of the gate code.  In
> both cases, Pin will not see the CALL instruction that enters the gate, so
> it’s too late to simply call the gate code natively.  We can’t natively
> execute in the middle of the gate because the RET will execute natively and
> continue native execution of the rest of the application, outside of Pin’s
> JIT compiler.  We thought about single-stepping the application until the PC
> is outside of the gate area, but this wouldn’t work in the signal handler
> case.

That's a fun corner case.  Is the problem that you might receive a
signal while single-stepping?

>  Instead, we decided to let Pin JIT-compile the gate code instructions
> just like any other application code, and we handle the SYSENTER instruction
> specially (on 32-bit).  When we see SYSENTER, Pin executes the syscall
> natively and then resumes JIT-compilation at the normal resume point in the
> gate area.  This works regardless of where Pin attaches to the application,
> and it also has the nice advantage that Pin tools see the exact sequence of
> user space instructions that the application would execute if it ran
> natively.

Here's a different proposal, then:

What if the kernel had the sequence:

mov $__NR_whatever,%eax
syscall
ret

in the vsyscall page but marked the vsyscall page NX.  Then the kernel
would emulate the vsyscall when it received an instruction fetch page
fault.  pin could do exactly what it does right now, since the code
that RIP is pointing at if the attach happens right before the fault
would do what it's supposed to do.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-06  0:32                     ` H. Peter Anvin
  2011-08-06  3:01                       ` [RFC] x86-64: Allow emulated vsyscalls from user addresses Andy Lutomirski
  2011-08-06  3:04                       ` [RFC v2] " Andy Lutomirski
@ 2011-08-09 22:27                       ` Suresh Siddha
  2 siblings, 0 replies; 37+ messages in thread
From: Suresh Siddha @ 2011-08-09 22:27 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andrew Lutomirski, Andi Kleen, x86, linux-kernel, torvalds,
	lueckintel, kimwooyoung

On Fri, 2011-08-05 at 17:32 -0700, H. Peter Anvin wrote:
> On 08/05/2011 05:20 PM, Andrew Lutomirski wrote:
> > 
> > I was thinking of 0x20 - 0x39.  0x40, 0x41, and 0x42 should do the
> > trick.  I'll cook up a patch.
> > 
> > If you want to keep those vectors available for devices as well, we
> > could hook do_general_protection instead, but that's a little messy.
> > Are there x86 machines out there that are starved for interrupt
> > vectors?
> > 
> 
> Yes, but 3 aren't going to matter much.
> 
> However, on systems which have interrupt migration enabled we're not
> using 0x21-0x2f for anything (because we need a single interrupt with
> absolutely lowest priority).

Double checked to make sure and we actually allow 0x21-0x2f to be used
for device interrupts (commit 6579b474572fd54c583ac074e8e7aaae926c62ef).
So reserving the vectors in this range should be same as reserving in
any other range available for use.

Thanks.

>   Out of that range, there are a couple of
> values which should be safe to use because they would be harmless
> instructions of various forms:
> 
> 	0x24	- AND AL, imm8
> 	0x25	- AND EAX, imm32
> 	0x26	- ES:
> 	0x2C	- SUB AL, imm8
> 	0x2D	- SUB EAX, imm32
> 	0x2E	- CS:
> 
> [Cc: Suresh who is the expert on the interrupt assignments]
> 
> 	-hpa


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-09 21:04                         ` Andrew Lutomirski
@ 2011-08-09 22:36                           ` Linus Torvalds
  2011-08-10  0:56                             ` H. Peter Anvin
       [not found]                             ` <1312934493.45753.YahooMailNeo@web120015.mail.ne1.yahoo.com>
  0 siblings, 2 replies; 37+ messages in thread
From: Linus Torvalds @ 2011-08-09 22:36 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Greg Lueck, H. Peter Anvin, Andi Kleen, x86, linux-kernel, kimwooyoung

On Tue, Aug 9, 2011 at 2:04 PM, Andrew Lutomirski <luto@mit.edu> wrote:
>
> Here's a different proposal, then:
>
> What if the kernel had the sequence:
>
> mov $__NR_whatever,%eax
> syscall
> ret
>
> in the vsyscall page but marked the vsyscall page NX.

This sounds like a sound idea. And then the difference between "fast
and native" and "slow and trapping" ends up literally being just the
NX bit.

                        Linus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
  2011-08-09 22:36                           ` Linus Torvalds
@ 2011-08-10  0:56                             ` H. Peter Anvin
       [not found]                             ` <1312934493.45753.YahooMailNeo@web120015.mail.ne1.yahoo.com>
  1 sibling, 0 replies; 37+ messages in thread
From: H. Peter Anvin @ 2011-08-10  0:56 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Lutomirski
  Cc: Greg Lueck, Andi Kleen, x86, linux-kernel, kimwooyoung

Linus Torvalds <torvalds@linux-foundation.org> wrote:

>On Tue, Aug 9, 2011 at 2:04 PM, Andrew Lutomirski <luto@mit.edu> wrote:
>>
>> Here's a different proposal, then:
>>
>> What if the kernel had the sequence:
>>
>> mov $__NR_whatever,%eax
>> syscall
>> ret
>>
>> in the vsyscall page but marked the vsyscall page NX.
>
>This sounds like a sound idea. And then the difference between "fast
>and native" and "slow and trapping" ends up literally being just the
>NX bit.
>
>                        Linus

Very promising idea indeed.
-- 
Sent from my mobile phone. Please excuse my brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: New vsyscall emulation breaks JITs
       [not found]                             ` <1312934493.45753.YahooMailNeo@web120015.mail.ne1.yahoo.com>
@ 2011-08-10  1:49                               ` H. Peter Anvin
  0 siblings, 0 replies; 37+ messages in thread
From: H. Peter Anvin @ 2011-08-10  1:49 UTC (permalink / raw)
  To: Greg Lueck, Linus Torvalds, Andrew Lutomirski
  Cc: Andi Kleen, x86, linux-kernel, kimwooyoung

Greg Lueck <lueckintel@yahoo.com> wrote:

>Yes, this sounds like a cleaner solution.  What happens, though, if the
>system call is interrupted by a signal or by ptrace(ATTACH)?  Does RIP
>point at the target of the RET instruction?  Is it moved back to the
>entry of the vsyscall page?  Does it point immediately after the
>SYSCALL instruction?  GDB might also care about these details.
>
>> That's a fun corner case.  Is the problem that you might receive a
>> signal while single-stepping?
>
>
>Actually, the situation is more difficult.  The application may have
>received a signal while inside the gate, sometime before the SYSENTER
>trap.  The signal context frame on the application's stack now has RIP
>pointing someplace inside the gate.  At this point, Pin attaches to the
>native process, and it has no reasonable way to know about the saved
>context with this RIP value.  Later, the application (running under
>Pin) will return from its handler and resume execution in the middle of
>the gate code.  What can Pin do here?  It' s too late to execute
>natively at the start of the gate.  If Pin executes natively at the
>signal return point, Pin will lose control of the application and it
>will execute natively from that point forward.
>
>-- Greg
>
>
>
>
>________________________________
>From: Linus Torvalds <torvalds@linux-foundation.org>
>To: Andrew Lutomirski <luto@mit.edu>
>Cc: Greg Lueck <lueckintel@yahoo.com>; H. Peter Anvin <hpa@zytor.com>;
>Andi Kleen <andi@firstfloor.org>; "x86@kernel.org" <x86@kernel.org>;
>"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>;
>"kimwooyoung@gmail.com" <kimwooyoung@gmail.com>
>Sent: Tuesday, August 9, 2011 6:36 PM
>Subject: Re: New vsyscall emulation breaks JITs
>
>On Tue, Aug 9, 2011 at 2:04 PM, Andrew Lutomirski <luto@mit.edu> wrote:
>>
>> Here's a different proposal, then:
>>
>> What if the kernel had the sequence:
>>
>> mov $__NR_whatever,%eax
>> syscall
>> ret
>>
>> in the vsyscall page but marked the vsyscall page NX.
>
>This sounds like a sound idea. And then the difference between "fast
>and native" and "slow and trapping" ends up literally being just the
>NX bit.
>
>                        Linus

The logical answer is that rip will point to the entry to the vsyscall page.
-- 
Sent from my mobile phone. Please excuse my brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] x86-64: Allow emulated vsyscalls from user addresses
  2011-08-06  3:04                       ` [RFC v2] " Andy Lutomirski
  2011-08-06  6:45                         ` Ingo Molnar
@ 2011-08-11 13:16                         ` Pavel Machek
  2011-08-11 13:27                           ` Andrew Lutomirski
  1 sibling, 1 reply; 37+ messages in thread
From: Pavel Machek @ 2011-08-11 13:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Andi Kleen, x86, linux-kernel, torvalds,
	lueckintel, kimwooyoung, Suresh Siddha

Hi!

> They trace control flow through the vsyscall page and recompile that
> code somewhere else.  Then they expect it to work.  DynamoRIO
> (http://dynamorio.org/) and Pin (http://www.pintool.org/) are
> affected.  They crash when tracing programs that use vsyscalls.
> Valgrind is smart enough not to cause problems.  It crashes on the
> getcpu vsyscall, but that has nothing to do with emulation.
> 
> This patch makes each of the three vsyscall entries use a different
> vector so that they can work when relocated.  It assumes that the
> code that relocates them is okay with the int instruction acting
> like ret.  DynamoRIO at least appears to work.

int acting as ret is seriously weird semantics. And no, invalid
syscall parameters will not cause segfault, just return of -EFAULT. So
... can this be changed?

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC v2] x86-64: Allow emulated vsyscalls from user addresses
  2011-08-11 13:16                         ` Pavel Machek
@ 2011-08-11 13:27                           ` Andrew Lutomirski
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Lutomirski @ 2011-08-11 13:27 UTC (permalink / raw)
  To: Pavel Machek
  Cc: H. Peter Anvin, Andi Kleen, x86, linux-kernel, torvalds,
	lueckintel, kimwooyoung, Suresh Siddha

On Thu, Aug 11, 2011 at 9:16 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> They trace control flow through the vsyscall page and recompile that
>> code somewhere else.  Then they expect it to work.  DynamoRIO
>> (http://dynamorio.org/) and Pin (http://www.pintool.org/) are
>> affected.  They crash when tracing programs that use vsyscalls.
>> Valgrind is smart enough not to cause problems.  It crashes on the
>> getcpu vsyscall, but that has nothing to do with emulation.
>>
>> This patch makes each of the three vsyscall entries use a different
>> vector so that they can work when relocated.  It assumes that the
>> code that relocates them is okay with the int instruction acting
>> like ret.  DynamoRIO at least appears to work.
>
> int acting as ret is seriously weird semantics. And no, invalid
> syscall parameters will not cause segfault, just return of -EFAULT. So
> ... can this be changed?

Can which be changed?  int acting as ret already was (in a different
patch, now in tip/x86/vdso), although I still think that user code
should do its best to make no assumptions about the vsyscall page.

invalid syscall parameters do indeed return -EFAULT, but invalid
*v*syscall parameters will segfault in 3.0 and before, since the
vsyscall implementation is just user code and has no exception
handling.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2011-08-11 13:27 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-05 20:09 New vsyscall emulation breaks JITs Andi Kleen
2011-08-05 20:23 ` H. Peter Anvin
2011-08-05 20:26   ` Andi Kleen
2011-08-05 20:36     ` H. Peter Anvin
2011-08-05 20:47       ` Andi Kleen
2011-08-05 20:45   ` Andrew Lutomirski
2011-08-05 20:48     ` H. Peter Anvin
2011-08-05 20:52       ` Andi Kleen
2011-08-05 21:00         ` Andrew Lutomirski
2011-08-05 21:21           ` Andi Kleen
2011-08-05 21:26             ` Andrew Lutomirski
2011-08-05 22:06               ` H. Peter Anvin
2011-08-05 22:11                 ` Andrew Lutomirski
2011-08-06  0:20                   ` Andrew Lutomirski
2011-08-06  0:32                     ` H. Peter Anvin
2011-08-06  3:01                       ` [RFC] x86-64: Allow emulated vsyscalls from user addresses Andy Lutomirski
2011-08-06  3:04                       ` [RFC v2] " Andy Lutomirski
2011-08-06  6:45                         ` Ingo Molnar
2011-08-07 12:19                           ` Borislav Petkov
2011-08-07 12:58                             ` Andrew Lutomirski
2011-08-07 15:44                               ` Borislav Petkov
2011-08-07 16:14                                 ` Andrew Lutomirski
2011-08-11 13:16                         ` Pavel Machek
2011-08-11 13:27                           ` Andrew Lutomirski
2011-08-09 22:27                       ` New vsyscall emulation breaks JITs Suresh Siddha
2011-08-09 13:26             ` Andrew Lutomirski
2011-08-09 15:04               ` Andi Kleen
2011-08-09 15:22                 ` Andrew Lutomirski
2011-08-09 16:47                   ` [RFC] x86-64: Add vsyscall=emulate|native|none option Andy Lutomirski
2011-08-09 19:54                     ` Linus Torvalds
2011-08-09 16:57                   ` New vsyscall emulation breaks JITs H. Peter Anvin
2011-08-09 17:05                     ` Andrew Lutomirski
     [not found]                       ` <1312919938.17118.YahooMailNeo@web120010.mail.ne1.yahoo.com>
2011-08-09 20:59                         ` H. Peter Anvin
2011-08-09 21:04                         ` Andrew Lutomirski
2011-08-09 22:36                           ` Linus Torvalds
2011-08-10  0:56                             ` H. Peter Anvin
     [not found]                             ` <1312934493.45753.YahooMailNeo@web120015.mail.ne1.yahoo.com>
2011-08-10  1:49                               ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).