All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
@ 2015-03-23 16:47 Denys Vlasenko
  2015-03-23 19:37 ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: Denys Vlasenko @ 2015-03-23 16:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

This vDSO code only gets used by 64-bit kernel,
not 32-bit. In 64-bit kernels, data segment is the same
for 32-bit and 64-bit userspace, and SYSRET insn does load %ss
with its selector. No need to repeat it by hand. Segment loads
are somewhat expensive: tens of cycles.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
---

Patch was run-tested.

 arch/x86/vdso/vdso32/syscall.S | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S
index 5415b56..ccdb9ef 100644
--- a/arch/x86/vdso/vdso32/syscall.S
+++ b/arch/x86/vdso/vdso32/syscall.S
@@ -19,8 +19,15 @@ __kernel_vsyscall:
 .Lpush_ebp:
 	movl	%ecx, %ebp
 	syscall
-	movl	$__USER32_DS, %ecx
-	movl	%ecx, %ss
+	/*
+	 * Used to load __USER32_DS to %ss here,
+	 * but it's not necessary: this vDSO is only used if our kernel
+	 * is 64-bit one (and we are on AMD CPU).
+	 * For 64-bit kernels, __USER32_DS and __USER_DS are the same.
+	 * SYSRET restores %ss to the same value when returning to
+	 * either 64- or 32-bit userspace, and 64-bit kernel uses the same
+	 * descriptor for %ss in 64- and 32-bit userspace.
+	 */
 	movl	%ebp, %ecx
 	popl	%ebp
 .Lpop_ebp:
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-23 16:47 [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss Denys Vlasenko
@ 2015-03-23 19:37 ` Andy Lutomirski
  2015-03-23 20:38   ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2015-03-23 19:37 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Mon, Mar 23, 2015 at 9:47 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> This vDSO code only gets used by 64-bit kernel,
> not 32-bit. In 64-bit kernels, data segment is the same
> for 32-bit and 64-bit userspace, and SYSRET insn does load %ss
> with its selector. No need to repeat it by hand. Segment loads
> are somewhat expensive: tens of cycles.
>
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Borislav Petkov <bp@alien8.de>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: Alexei Starovoitov <ast@plumgrid.com>
> CC: Will Drewry <wad@chromium.org>
> CC: Kees Cook <keescook@chromium.org>
> CC: x86@kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>
> Patch was run-tested.

Applied, thanks.

--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-23 19:37 ` Andy Lutomirski
@ 2015-03-23 20:38   ` Andy Lutomirski
  2015-03-23 21:55     ` Denys Vlasenko
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2015-03-23 20:38 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Mon, Mar 23, 2015 at 12:37 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Mar 23, 2015 at 9:47 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> This vDSO code only gets used by 64-bit kernel,
>> not 32-bit. In 64-bit kernels, data segment is the same
>> for 32-bit and 64-bit userspace, and SYSRET insn does load %ss
>> with its selector. No need to repeat it by hand. Segment loads
>> are somewhat expensive: tens of cycles.
>>
>> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
>> CC: Linus Torvalds <torvalds@linux-foundation.org>
>> CC: Steven Rostedt <rostedt@goodmis.org>
>> CC: Ingo Molnar <mingo@kernel.org>
>> CC: Borislav Petkov <bp@alien8.de>
>> CC: "H. Peter Anvin" <hpa@zytor.com>
>> CC: Andy Lutomirski <luto@amacapital.net>
>> CC: Oleg Nesterov <oleg@redhat.com>
>> CC: Frederic Weisbecker <fweisbec@gmail.com>
>> CC: Alexei Starovoitov <ast@plumgrid.com>
>> CC: Will Drewry <wad@chromium.org>
>> CC: Kees Cook <keescook@chromium.org>
>> CC: x86@kernel.org
>> CC: linux-kernel@vger.kernel.org
>> ---
>>
>> Patch was run-tested.
>
> Applied, thanks.

Actually, I want to remove the added comment in the code.  I don't see
why we should have a specific comment about SS and not about, say, CS,
ESP, or anything else.  OK?

>
> --Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-23 20:38   ` Andy Lutomirski
@ 2015-03-23 21:55     ` Denys Vlasenko
  2015-03-24  6:34       ` Ingo Molnar
  0 siblings, 1 reply; 16+ messages in thread
From: Denys Vlasenko @ 2015-03-23 21:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Mon, Mar 23, 2015 at 9:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Actually, I want to remove the added comment in the code.  I don't see
> why we should have a specific comment about SS and not about, say, CS,
> ESP, or anything else.  OK?

Ok.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-23 21:55     ` Denys Vlasenko
@ 2015-03-24  6:34       ` Ingo Molnar
  2015-03-24 14:08         ` Denys Vlasenko
  0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2015-03-24  6:34 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Denys Vlasenko, Linus Torvalds, Steven Rostedt,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel


* Denys Vlasenko <vda.linux@googlemail.com> wrote:

> On Mon, Mar 23, 2015 at 9:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > Actually, I want to remove the added comment in the code.  I don't see
> > why we should have a specific comment about SS and not about, say, CS,
> > ESP, or anything else.  OK?
> 
> Ok.

Might be nice to place a more generic description there, which 
registers are expected to be saved by user-space calling in here, etc.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-24  6:34       ` Ingo Molnar
@ 2015-03-24 14:08         ` Denys Vlasenko
  2015-03-24 15:50           ` Ingo Molnar
  2015-03-24 16:55           ` Brian Gerst
  0 siblings, 2 replies; 16+ messages in thread
From: Denys Vlasenko @ 2015-03-24 14:08 UTC (permalink / raw)
  To: Ingo Molnar, Denys Vlasenko
  Cc: Andy Lutomirski, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On 03/24/2015 07:34 AM, Ingo Molnar wrote:
> 
> * Denys Vlasenko <vda.linux@googlemail.com> wrote:
> 
>> On Mon, Mar 23, 2015 at 9:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> Actually, I want to remove the added comment in the code.  I don't see
>>> why we should have a specific comment about SS and not about, say, CS,
>>> ESP, or anything else.  OK?
>>
>> Ok.
> 
> Might be nice to place a more generic description there, which 
> registers are expected to be saved by user-space calling in here, etc.

__kernel_vsyscall entry point has the same ABI in any 32-bit vDSO,
the good old int 0x80 calling convention:

syscall# in eax,
params in ebx/ecx/edx/esi/edi/ebp,
all registers are preserved by the syscall.

(I think we don't guarantee that all flags are preserved:
I have a testcase where DF gets cleared).

Each flavor of fast kernel call does necessary massaging to
conform to the ABI. E.g. SYSCALL-based fast call clobbers ecx,
so its vDSO saves/restores ecx on stack.

Do you want a patch which adds such comment into every vDSO?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-24 14:08         ` Denys Vlasenko
@ 2015-03-24 15:50           ` Ingo Molnar
  2015-03-24 16:55           ` Brian Gerst
  1 sibling, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2015-03-24 15:50 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Denys Vlasenko, Andy Lutomirski, Linus Torvalds, Steven Rostedt,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> On 03/24/2015 07:34 AM, Ingo Molnar wrote:
> > 
> > * Denys Vlasenko <vda.linux@googlemail.com> wrote:
> > 
> >> On Mon, Mar 23, 2015 at 9:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >>> Actually, I want to remove the added comment in the code.  I don't see
> >>> why we should have a specific comment about SS and not about, say, CS,
> >>> ESP, or anything else.  OK?
> >>
> >> Ok.
> > 
> > Might be nice to place a more generic description there, which 
> > registers are expected to be saved by user-space calling in here, etc.
> 
> __kernel_vsyscall entry point has the same ABI in any 32-bit vDSO,
> the good old int 0x80 calling convention:
> 
> syscall# in eax,
> params in ebx/ecx/edx/esi/edi/ebp,
> all registers are preserved by the syscall.
> 
> (I think we don't guarantee that all flags are preserved:
> I have a testcase where DF gets cleared).

I think the fact that the people developing this code are unsure about 
exactly what gets saved/restored is justification enough to document 
the circumstances a bit better.

> Each flavor of fast kernel call does necessary massaging to conform 
> to the ABI. E.g. SYSCALL-based fast call clobbers ecx, so its vDSO 
> saves/restores ecx on stack.
> 
> Do you want a patch which adds such comment into every vDSO?

Well, maybe it's better to extend the already existing descriptions at 
the syscall entry points to be a full description of all details, and 
put a reference to that description into the vDSOs?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-24 14:08         ` Denys Vlasenko
  2015-03-24 15:50           ` Ingo Molnar
@ 2015-03-24 16:55           ` Brian Gerst
  2015-03-24 20:17             ` Denys Vlasenko
  1 sibling, 1 reply; 16+ messages in thread
From: Brian Gerst @ 2015-03-24 16:55 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Denys Vlasenko, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Tue, Mar 24, 2015 at 10:08 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 03/24/2015 07:34 AM, Ingo Molnar wrote:
>>
>> * Denys Vlasenko <vda.linux@googlemail.com> wrote:
>>
>>> On Mon, Mar 23, 2015 at 9:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> Actually, I want to remove the added comment in the code.  I don't see
>>>> why we should have a specific comment about SS and not about, say, CS,
>>>> ESP, or anything else.  OK?
>>>
>>> Ok.
>>
>> Might be nice to place a more generic description there, which
>> registers are expected to be saved by user-space calling in here, etc.
>
> __kernel_vsyscall entry point has the same ABI in any 32-bit vDSO,
> the good old int 0x80 calling convention:
>
> syscall# in eax,
> params in ebx/ecx/edx/esi/edi/ebp,
> all registers are preserved by the syscall.
>
> (I think we don't guarantee that all flags are preserved:
> I have a testcase where DF gets cleared).

DF should always be clear on any function call per the C ABI.  But,
eflags should be preserved, at least the non-privileged bits.  I'd
like to see that testcase.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-24 16:55           ` Brian Gerst
@ 2015-03-24 20:17             ` Denys Vlasenko
  2015-03-24 21:40               ` Andy Lutomirski
  2015-03-25  0:59               ` Brian Gerst
  0 siblings, 2 replies; 16+ messages in thread
From: Denys Vlasenko @ 2015-03-24 20:17 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Ingo Molnar, Denys Vlasenko, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1117 bytes --]

On 03/24/2015 05:55 PM, Brian Gerst wrote:
>>> Might be nice to place a more generic description there, which
>>> registers are expected to be saved by user-space calling in here, etc.
>>
>> __kernel_vsyscall entry point has the same ABI in any 32-bit vDSO,
>> the good old int 0x80 calling convention:
>>
>> syscall# in eax,
>> params in ebx/ecx/edx/esi/edi/ebp,
>> all registers are preserved by the syscall.
>>
>> (I think we don't guarantee that all flags are preserved:
>> I have a testcase where DF gets cleared).
> 
> DF should always be clear on any function call per the C ABI.  But,
> eflags should be preserved, at least the non-privileged bits.  I'd
> like to see that testcase.

The testcase is a simplistic example of how to find and use
32-bit vDSO to perform system calls.

It also sets flags.DF before syscall, and checks whether registers
are preserved, including flags.DF.

On 32-bit kernel (on Intel CPU, where vDSO uses SYSENTER), I see this:

$ ./test32_syscall_vdso
Result:1

whereas on 64-bit it is

./test32_syscall_vdso
Result:0

"Result:1" means that DF was cleared.

See attached source.


[-- Attachment #2: test32_syscall_vdso.c --]
[-- Type: text/x-csrc, Size: 2919 bytes --]

// (Assuming that int 80 works,) test that syscall32/sysenter32 works
//
// i686-gcc -Os -Wall -static
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <sys/select.h>
#include <sys/time.h>
#include <sys/types.h>
#include <elf.h>

long syscall_addr;
long get_syscall(char **envp)
{
	Elf32_auxv_t *auxv;
	while (*envp++ != NULL)
		continue;
	for (auxv = (void *)envp; auxv->a_type != AT_NULL; auxv++)
		if( auxv->a_type == AT_SYSINFO)
			return auxv->a_un.a_val;
	fprintf(stderr, "AT_SYSINFO not supplied, can't test\n");
	exit(0); /* this is not a failure */
}

int nfds;
fd_set rfds;
fd_set wfds;
fd_set efds;
struct timespec timeout;
sigset_t sigmask;
struct {
	sigset_t *sp;
	int sz;
} sigmask_desc;

char RES[] = "Result:0\n";

void prep_args()
{
	nfds = 42;
	FD_ZERO(&rfds);
	FD_ZERO(&wfds);
	FD_ZERO(&efds);
	FD_SET(0, &rfds);
	FD_SET(1, &wfds);
	FD_SET(2, &efds);
	timeout.tv_sec = 0;
	timeout.tv_nsec = 123;
	sigemptyset(&sigmask);
	sigaddset(&sigmask, SIGINT);
	sigaddset(&sigmask, SIGUSR2);
	sigaddset(&sigmask, SIGRTMAX);
	sigmask_desc.sp = &sigmask;
	sigmask_desc.sz = 8; /* bytes */
}

#define PSELECT 308
// pselect(nfds,rfds,wfds,efds,timeout,sigmask)
int main(int argc, char **argv, char **envp)
{
	// This only works for non-static builds:
	//syscall_addr = dlsym(dlopen("linux-gate.so.1", RTLD_NOW), "__kernel_vsyscall");
	syscall_addr = get_syscall(envp);
	prep_args();

	asm("\n"
	// Try 6-arg syscall: pselect. It should return quickly
	"	mov	$308,%eax\n"     // PSELECT
	"	mov	nfds,%ebx\n"     // ebx  arg1
	"	mov	$rfds,%ecx\n"    // ecx  arg2
	"	mov	$wfds,%edx\n"    // edx  arg3
	"	mov	$efds,%esi\n"    // esi  arg4
	"	mov	$timeout,%edi\n" // edi  arg5
	"	mov	$sigmask_desc,%ebp\n" // %ebp arg6
	"	std\n"
	"	call	*syscall_addr\n"
	// Check that registers are not clobbered
	"	pushf\n"
	"	pop	%eax\n"
	"	cmp	nfds,%ebx\n"     // ebx  arg1
	"	movb	$0x32,RES+7\n"
	"	jne	end\n"
	"	cmp	$rfds,%ecx\n"    // ecx  arg2
	"	movb	$0x33,RES+7\n"
	"	jne	end\n"
	"	cmp	$wfds,%edx\n"    // edx  arg3
	"	movb	$0x34,RES+7\n"
	"	jne	end\n"
	"	cmp	$efds,%esi\n"    // esi  arg4
	"	movb	$0x35,RES+7\n"
	"	jne	end\n"
	"	cmp	$timeout,%edi\n" // edi  arg5
	"	movb	$0x36,RES+7\n"
	"	jne	end\n"
	"	cmpl	$sigmask_desc,%ebp\n" // %ebp arg6
	"	movb	$0x37,RES+7\n"
	"	jne	end\n"
// This fails (DF is not preserved)
// on 32-bit kernels. 64-bit ones worked:
	"	bt	$10,%eax\n"      //flags.DF still set?
	"	movb	$0x31,RES+7\n"
	"	jnc	end\n"

	"	movb	$0x30,RES+7\n"

	"end:\n"
	"	mov	$4,%eax\n"	 // WRITE
	"	mov	$1,%ebx\n"       // ebx  arg1
	"	mov	$RES,%ecx\n"     // ecx  arg2
	"	mov	$9,%edx\n"       // edx  arg3
	"	int	$0x80\n"

	// Error if result is not 0 or 1
	// (result:1 means DF was clobbered)
	"	mov	$1,%eax\n"	 // EXIT
	"	xor	%ebx,%ebx\n"     // ebx  arg1
	"	cmpb	$0x31,RES+7\n"
	"	ja	1f\n"
	"	int	$0x80\n"
	"1:	inc	%ebx\n"
	"	int	$0x80\n"
	);
	return 42; /* not reached */
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-24 20:17             ` Denys Vlasenko
@ 2015-03-24 21:40               ` Andy Lutomirski
  2015-03-25  9:28                 ` Ingo Molnar
  2015-03-25 14:55                 ` Denys Vlasenko
  2015-03-25  0:59               ` Brian Gerst
  1 sibling, 2 replies; 16+ messages in thread
From: Andy Lutomirski @ 2015-03-24 21:40 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Brian Gerst, Ingo Molnar, Denys Vlasenko, Linus Torvalds,
	Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Tue, Mar 24, 2015 at 1:17 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 03/24/2015 05:55 PM, Brian Gerst wrote:
>>>> Might be nice to place a more generic description there, which
>>>> registers are expected to be saved by user-space calling in here, etc.
>>>
>>> __kernel_vsyscall entry point has the same ABI in any 32-bit vDSO,
>>> the good old int 0x80 calling convention:
>>>
>>> syscall# in eax,
>>> params in ebx/ecx/edx/esi/edi/ebp,
>>> all registers are preserved by the syscall.
>>>
>>> (I think we don't guarantee that all flags are preserved:
>>> I have a testcase where DF gets cleared).
>>
>> DF should always be clear on any function call per the C ABI.  But,
>> eflags should be preserved, at least the non-privileged bits.  I'd
>> like to see that testcase.
>
> The testcase is a simplistic example of how to find and use
> 32-bit vDSO to perform system calls.
>
> It also sets flags.DF before syscall, and checks whether registers
> are preserved, including flags.DF.
>
> On 32-bit kernel (on Intel CPU, where vDSO uses SYSENTER), I see this:
>
> $ ./test32_syscall_vdso
> Result:1
>
> whereas on 64-bit it is
>
> ./test32_syscall_vdso
> Result:0
>
> "Result:1" means that DF was cleared.
>
> See attached source.
>

The syscall and sysenter stuff is IMO really nasty.  Here's how I'd
like it to work:

When you do "call __kernel_vsyscall", I want the net effect to be that
your eax, ebx, ecx, edx, esi, edi, and ebp at the time of the call end
up *verbatim* in pt_regs.  Your eip and rsp should be such that, if we
iret normally using pt_regs, we end up returning correctly to
userspace.  I want this to be true *regardless* of whether we're doing
a fast-path or slow-path system call.

This means that we have, literally (see below for why ret $4):

int $0x80
ret $4  <-- regs->eip points here

Then we add an opportunistic return trampoline: if a special ti flag
is set (which we set on entry here) and the return eip and regs are
appropriate, then we change the return at the last minute to vdso code
that looks like:

popl $ecx
popl $edx
ret

Obviously, to do this, we need to copy regs->ecx and regs->edx to the
appropriate places.  If we've been traced or other funny business is
going on (TIF_NOTIFY_RESUME or such is set), then we just skip the
optimization entirely.  Everything still works.

This is probably slower than the current code, but I expect it would
be fast enough, and it should be considerably more obviously correct
than the current disaster.

Now we can do a fun hack on top.  On Intel, we have sysenter/sysexitl
and, on AMD, we have syscall/sysretl.  But, if I read the docs right,
Intel has sysretl, too.  So we can ditch sysexit entirely, since this
mechanism no longer has any need to keep the entry and exit
conventions matching.

The vdso code would be something like (so untested it's not even funny):

__kernel_vsyscall:
  ALTERNATIVE_2(something or other)

__kernel_vsyscall_for_intel:
  pushl $edx
  pushl $ecx
  sysenter
  hlt  <-- just for clarity

__kernel_vsyscall_for_amd:
  pushl $ecx
  syscall
__vsyscall_after_syscall_insn:
 ret $4 <-- for binary tracers only

__kernel_vsyscall_for_int80:
  int $0x80  <-- regs->eip points here during *all* vsyscalls

__kernel_vsyscall_slow_ret:
  ret $4

__kernel_vsyscall_sysretl_target:
  popl $ecx
  ret

There is no sysexit.  Take that, Intel.

On sysenter, we copy regs->cx and regs->dx from user memory and then
we increment regs->sp by 4 and point regs->eip to
__kernel_vsyscall_for_int80.  On syscall, we copy regs->cx from user
memory and point regs->eip to __kernel_vsyscall_for_int80.

On opportunistic sysretl, we do:

*regs->sp = regs->cx;  /* put_user or whatever */
regs->eip = __kernel_vsyscall_sysretl_target
...
sysretl

We never do sysexit or sysretl in any other code path.  That is, there
is no really fast path anymore.

On AMD, we could be polite and only do the opportunistic sysretl if we
regs->eip started out pointing to __vsyscall_after_syscall_insn.

Thoughts?

I'm not planning on implementing this in the very near future, though.

--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-24 20:17             ` Denys Vlasenko
  2015-03-24 21:40               ` Andy Lutomirski
@ 2015-03-25  0:59               ` Brian Gerst
  1 sibling, 0 replies; 16+ messages in thread
From: Brian Gerst @ 2015-03-25  0:59 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Denys Vlasenko, Andy Lutomirski, Linus Torvalds,
	Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Tue, Mar 24, 2015 at 4:17 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 03/24/2015 05:55 PM, Brian Gerst wrote:
>>>> Might be nice to place a more generic description there, which
>>>> registers are expected to be saved by user-space calling in here, etc.
>>>
>>> __kernel_vsyscall entry point has the same ABI in any 32-bit vDSO,
>>> the good old int 0x80 calling convention:
>>>
>>> syscall# in eax,
>>> params in ebx/ecx/edx/esi/edi/ebp,
>>> all registers are preserved by the syscall.
>>>
>>> (I think we don't guarantee that all flags are preserved:
>>> I have a testcase where DF gets cleared).
>>
>> DF should always be clear on any function call per the C ABI.  But,
>> eflags should be preserved, at least the non-privileged bits.  I'd
>> like to see that testcase.
>
> The testcase is a simplistic example of how to find and use
> 32-bit vDSO to perform system calls.
>
> It also sets flags.DF before syscall, and checks whether registers
> are preserved, including flags.DF.
>
> On 32-bit kernel (on Intel CPU, where vDSO uses SYSENTER), I see this:
>
> $ ./test32_syscall_vdso
> Result:1
>
> whereas on 64-bit it is
>
> ./test32_syscall_vdso
> Result:0
>
> "Result:1" means that DF was cleared.
>
> See attached source.
>

It looks like 32-bit native does not restore eflags before sysexit,
while 64-bit compat does.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-24 21:40               ` Andy Lutomirski
@ 2015-03-25  9:28                 ` Ingo Molnar
  2015-03-25 15:03                   ` Denys Vlasenko
  2015-03-25 14:55                 ` Denys Vlasenko
  1 sibling, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2015-03-25  9:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Brian Gerst, Denys Vlasenko, Linus Torvalds,
	Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel


* Andy Lutomirski <luto@amacapital.net> wrote:

> Now we can do a fun hack on top.  On Intel, we have 
> sysenter/sysexitl and, on AMD, we have syscall/sysretl.  But, if I 
> read the docs right, Intel has sysretl, too.  So we can ditch 
> sysexit entirely, since this mechanism no longer has any need to 
> keep the entry and exit conventions matching.

So this only affects 32-bit vdsos, because on 64-bit both Intel and 
AMD have and use SYSCALL/SYSRET.

So my question would be: what's the performance difference between 
INT80 and sysenter entries on 32-bit, on modern CPUs?

If it's not too horrible (say below 100 cycles) then we could say that 
we start out the simplification and robustification by switching Intel 
over to INT80 + SYSRET on 32-bit, and once we know the 32-bit SYSRET 
and all the other simplifications work fine we implement the 
SYSENTER-hack on top of that?

Is there any user-space code that relies on being able to execute an 
open coded SYSENTER, or are we shielded via the vDSO?

Doing it this way would make it a lot more practical to pull off, as 
the incentive to implement the SYSENTER hack on Intel CPUs will be 
significant: dozens of cycles on 32-bit. Also, I have no problem with 
putting some pressure on Intel developers, for the absolutely 
indefensible horror interface that SYSENTER turned out to be! ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-24 21:40               ` Andy Lutomirski
  2015-03-25  9:28                 ` Ingo Molnar
@ 2015-03-25 14:55                 ` Denys Vlasenko
  2015-03-25 15:12                   ` Andy Lutomirski
  1 sibling, 1 reply; 16+ messages in thread
From: Denys Vlasenko @ 2015-03-25 14:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brian Gerst, Ingo Molnar, Denys Vlasenko, Linus Torvalds,
	Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On 03/24/2015 10:40 PM, Andy Lutomirski wrote:
> The syscall and sysenter stuff is IMO really nasty.  Here's how I'd
> like it to work:
> 
> When you do "call __kernel_vsyscall", I want the net effect to be that
> your eax, ebx, ecx, edx, esi, edi, and ebp at the time of the call end
> up *verbatim* in pt_regs.  Your eip and rsp should be such that, if we
> iret normally using pt_regs, we end up returning correctly to
> userspace.  I want this to be true *regardless* of whether we're doing
> a fast-path or slow-path system call.
> 
> This means that we have, literally (see below for why ret $4):
> 
> int $0x80
> ret $4  <-- regs->eip points here
> 
> Then we add an opportunistic return trampoline: if a special ti flag
> is set (which we set on entry here) and the return eip and regs are
> appropriate, then we change the return at the last minute to vdso code
> that looks like:
> 
> popl $ecx
> popl $edx
> ret

I don't fully understand your intent.

> The vdso code would be something like (so untested it's not even funny):
> 
> __kernel_vsyscall:
>   ALTERNATIVE_2(something or other)
> 
> __kernel_vsyscall_for_intel:
>   pushl $edx
>   pushl $ecx
>   sysenter
>   hlt  <-- just for clarity
> 
> __kernel_vsyscall_for_amd:
>   pushl $ecx
>   syscall
> __vsyscall_after_syscall_insn:
>  ret $4 <-- for binary tracers only

This ret would use former ecx value as return address?


> __kernel_vsyscall_for_int80:
>   int $0x80  <-- regs->eip points here during *all* vsyscalls
> 
> __kernel_vsyscall_slow_ret:
>   ret $4

After returning, this will pop an extra word from __kernel_vsyscall() caller.
They don't expect that.


> __kernel_vsyscall_sysretl_target:
>   popl $ecx
>   ret
> 
> There is no sysexit.  Take that, Intel.
> 
> On sysenter, we copy regs->cx and regs->dx from user memory and then
> we increment regs->sp by 4 and point regs->eip to
> __kernel_vsyscall_for_int80.  On syscall, we copy regs->cx from user
> memory and point regs->eip to __kernel_vsyscall_for_int80.
> 
> On opportunistic sysretl, we do:
> 
> *regs->sp = regs->cx;  /* put_user or whatever */
> regs->eip = __kernel_vsyscall_sysretl_target
> ...
> sysretl
> 
> We never do sysexit or sysretl in any other code path.  That is, there
> is no really fast path anymore.

I still don't understand the purpose those "ret 4" insns.
They don't look right.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-25  9:28                 ` Ingo Molnar
@ 2015-03-25 15:03                   ` Denys Vlasenko
  2015-03-25 15:17                     ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: Denys Vlasenko @ 2015-03-25 15:03 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: Brian Gerst, Denys Vlasenko, Linus Torvalds, Steven Rostedt,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On 03/25/2015 10:28 AM, Ingo Molnar wrote:
> 
> * Andy Lutomirski <luto@amacapital.net> wrote:
> 
>> Now we can do a fun hack on top.  On Intel, we have 
>> sysenter/sysexitl and, on AMD, we have syscall/sysretl.  But, if I 
>> read the docs right, Intel has sysretl, too.  So we can ditch 
>> sysexit entirely, since this mechanism no longer has any need to 
>> keep the entry and exit conventions matching.
> 
> So this only affects 32-bit vdsos, because on 64-bit both Intel and 
> AMD have and use SYSCALL/SYSRET.
> 
> So my question would be: what's the performance difference between 
> INT80 and sysenter entries on 32-bit, on modern CPUs?
>
> If it's not too horrible (say below 100 cycles) then we could say that 
> we start out the simplification and robustification by switching Intel 
> over to INT80 + SYSRET on 32-bit, and once we know the 32-bit SYSRET 
> and all the other simplifications work fine we implement the 
> SYSENTER-hack on top of that?

int 0x80 is about 250 cycles slower than syscall/sysenter.
(I mean, the instruction per se, not the full round-trip).
This looks too horrible to ignore :(


> Is there any user-space code that relies on being able to execute an 
> open coded SYSENTER, or are we shielded via the vDSO?

Userspace can't use open-coded sysenter. It will return to a different
address.

Userspace _can_ do this:

my_sysenter:
        push %ecx
        push %edx
        push %ebp
        movl %esp,%ebp
        sysenter
/* end of my_sysenter() */

...
...
...

	call  my_sysenter

but this depends on matching stack layout with one used by vDSO.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-25 14:55                 ` Denys Vlasenko
@ 2015-03-25 15:12                   ` Andy Lutomirski
  0 siblings, 0 replies; 16+ messages in thread
From: Andy Lutomirski @ 2015-03-25 15:12 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Brian Gerst, Ingo Molnar, Denys Vlasenko, Linus Torvalds,
	Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Wed, Mar 25, 2015 at 7:55 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 03/24/2015 10:40 PM, Andy Lutomirski wrote:
>> The syscall and sysenter stuff is IMO really nasty.  Here's how I'd
>> like it to work:
>>
>> When you do "call __kernel_vsyscall", I want the net effect to be that
>> your eax, ebx, ecx, edx, esi, edi, and ebp at the time of the call end
>> up *verbatim* in pt_regs.  Your eip and rsp should be such that, if we
>> iret normally using pt_regs, we end up returning correctly to
>> userspace.  I want this to be true *regardless* of whether we're doing
>> a fast-path or slow-path system call.
>>
>> This means that we have, literally (see below for why ret $4):
>>
>> int $0x80
>> ret $4  <-- regs->eip points here
>>
>> Then we add an opportunistic return trampoline: if a special ti flag
>> is set (which we set on entry here) and the return eip and regs are
>> appropriate, then we change the return at the last minute to vdso code
>> that looks like:
>>
>> popl $ecx
>> popl $edx
>> ret
>
> I don't fully understand your intent.

The idea would be that syscall and sysenter would each be exactly
equivalent to a different sequence of instructions, each culminating
in a jmp to an int80, except that they would enable opportunistic exit
optimizations.  I think I made some mistakes below, though.

>
>> The vdso code would be something like (so untested it's not even funny):
>>
>> __kernel_vsyscall:
>>   ALTERNATIVE_2(something or other)
>>
>> __kernel_vsyscall_for_intel:
>>   pushl $edx
>>   pushl $ecx
>>   sysenter
>>   hlt  <-- just for clarity
>>
>> __kernel_vsyscall_for_amd:
>>   pushl $ecx
>>   syscall
>> __vsyscall_after_syscall_insn:
>>  ret $4 <-- for binary tracers only
>
> This ret would use former ecx value as return address?
>

Nope.  The idea is that syscall32 would be close enough to equivalent
to "mov (%esp),%ecx; int $0x80" that binary tracers would do the right
thing.  But I could easily be off a bit.  If I were to implement it,
that ret instruction would be the very last thing I added, since it
would depend on everything else.

>
>> __kernel_vsyscall_for_int80:
>>   int $0x80  <-- regs->eip points here during *all* vsyscalls
>>
>> __kernel_vsyscall_slow_ret:
>>   ret $4
>
> After returning, this will pop an extra word from __kernel_vsyscall() caller.
> They don't expect that.
>

Whoops.  I did say this was completely untested :).  I guess it would
be more like:

__kernel_vsyscall_for_int80:
  pushl $eax  /* dummy */
  pushl $eax  /* dummy */
  int $0x80  <-- regs->eip points here during *all* vsyscalls

__kernel_vsyscall_slow_ret:
  ret $8

since having the amount of extra scratch stack space vary by entry
type is probably unnecessarily confusing.

>
>> __kernel_vsyscall_sysretl_target:
>>   popl $ecx
>>   ret
>>
>> There is no sysexit.  Take that, Intel.
>>
>> On sysenter, we copy regs->cx and regs->dx from user memory and then
>> we increment regs->sp by 4 and point regs->eip to
>> __kernel_vsyscall_for_int80.  On syscall, we copy regs->cx from user
>> memory and point regs->eip to __kernel_vsyscall_for_int80.
>>
>> On opportunistic sysretl, we do:
>>
>> *regs->sp = regs->cx;  /* put_user or whatever */
>> regs->eip = __kernel_vsyscall_sysretl_target
>> ...
>> sysretl
>>
>> We never do sysexit or sysretl in any other code path.  That is, there
>> is no really fast path anymore.
>
> I still don't understand the purpose those "ret 4" insns.
> They don't look right.

They were wrong.  It's the idea that counts, I hope :)

--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss
  2015-03-25 15:03                   ` Denys Vlasenko
@ 2015-03-25 15:17                     ` Andy Lutomirski
  0 siblings, 0 replies; 16+ messages in thread
From: Andy Lutomirski @ 2015-03-25 15:17 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Brian Gerst, Denys Vlasenko, Linus Torvalds,
	Steven Rostedt, Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Wed, Mar 25, 2015 at 8:03 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 03/25/2015 10:28 AM, Ingo Molnar wrote:
>>
>> * Andy Lutomirski <luto@amacapital.net> wrote:
>>
>>> Now we can do a fun hack on top.  On Intel, we have
>>> sysenter/sysexitl and, on AMD, we have syscall/sysretl.  But, if I
>>> read the docs right, Intel has sysretl, too.  So we can ditch
>>> sysexit entirely, since this mechanism no longer has any need to
>>> keep the entry and exit conventions matching.
>>
>> So this only affects 32-bit vdsos, because on 64-bit both Intel and
>> AMD have and use SYSCALL/SYSRET.
>>
>> So my question would be: what's the performance difference between
>> INT80 and sysenter entries on 32-bit, on modern CPUs?
>>
>> If it's not too horrible (say below 100 cycles) then we could say that
>> we start out the simplification and robustification by switching Intel
>> over to INT80 + SYSRET on 32-bit, and once we know the 32-bit SYSRET
>> and all the other simplifications work fine we implement the
>> SYSENTER-hack on top of that?
>
> int 0x80 is about 250 cycles slower than syscall/sysenter.
> (I mean, the instruction per se, not the full round-trip).
> This looks too horrible to ignore :(

Agreed.

>
>
>> Is there any user-space code that relies on being able to execute an
>> open coded SYSENTER, or are we shielded via the vDSO?
>
> Userspace can't use open-coded sysenter. It will return to a different
> address.
>
> Userspace _can_ do this:
>
> my_sysenter:
>         push %ecx
>         push %edx
>         push %ebp
>         movl %esp,%ebp
>         sysenter
> /* end of my_sysenter() */
>
> ...
> ...
> ...
>
>         call  my_sysenter
>
> but this depends on matching stack layout with one used by vDSO.
>
>

I'd be rather surprised if anyone does that.  It'll die with SIGILL on
AMD systems.  Similarly, open-coded syscall instructions in 32-bit
code will die with SIGILL on Intel systems.

Gee thanks, anyone.

<with time machine>The correct way to do this ought to have been
straightforward.  Syscall should have stashed eip/rip in r8, r9, or
r10, and sysenter shouldn't exist in long mode.  All of this mess
would just disappear completely.</with time machine>

--Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-03-25 15:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-23 16:47 [PATCH] x86: vdso32/syscall.S: do not load __USER32_DS to %ss Denys Vlasenko
2015-03-23 19:37 ` Andy Lutomirski
2015-03-23 20:38   ` Andy Lutomirski
2015-03-23 21:55     ` Denys Vlasenko
2015-03-24  6:34       ` Ingo Molnar
2015-03-24 14:08         ` Denys Vlasenko
2015-03-24 15:50           ` Ingo Molnar
2015-03-24 16:55           ` Brian Gerst
2015-03-24 20:17             ` Denys Vlasenko
2015-03-24 21:40               ` Andy Lutomirski
2015-03-25  9:28                 ` Ingo Molnar
2015-03-25 15:03                   ` Denys Vlasenko
2015-03-25 15:17                     ` Andy Lutomirski
2015-03-25 14:55                 ` Denys Vlasenko
2015-03-25 15:12                   ` Andy Lutomirski
2015-03-25  0:59               ` Brian Gerst

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.