linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 PATCH 0/6] x86/entry: cleanups and consistent syscall number handling
@ 2021-05-10 18:53 H. Peter Anvin
  2021-05-10 18:53 ` [RFC v2 PATCH 1/7] x86/entry: unify definitions from calling.h and ptrace-abi.h H. Peter Anvin
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-10 18:53 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andy Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

This patchset:

1. Cleans up some duplications between <entry/calling.h> and <asm/ptrace-abi.h>.

2. Swaps the arguments to do_syscall_64() for consistency *and* speed.

3. Adds the maximum number of flags to MSR_SYSCALL_MASK; the previous
   is more of a minimum. The more flags that are masked, the less the
   likelihood of a control leak into the kernel.

4. Consistently treat the system call number as a signed int. This is
   what syscall_get_nr() already does, and therefore what all
   architecture-independent code (e.g. seccomp) already expects.

5. As per the defined semantics of syscall_get_nr(), only the value -1
   is defined as a non-system call, so comparing >= 0 is
   incorrect. Change to != -1.

6. Call sys_ni_syscall() for system calls which are out of range
   except for -1, which is used by ptrace and seccomp as a "skip
   system call" marker) just as for system call numbers that
   correspond to holes in the table.

7. In <entry/calling.h>, factor the PUSH_AND_CLEAR_REGS macro into
   separate PUSH_REGS and CLEAR_REGS macros which can be used
   separately if desired. This will be used by the FRED entry code at
   a later date.

Changes from v1:

* Only -1 should be a non-system call per the cross-architectural
  definition of sys_ni_syscall().
* Fix/improve patch descriptions.

--- 
 arch/x86/entry/calling.h       | 45 ++++++--------------------
 arch/x86/entry/common.c        | 71 ++++++++++++++++++++++++++++--------------
 arch/x86/entry/entry_64.S      |  4 +--
 arch/x86/include/asm/syscall.h | 13 ++++----
 arch/x86/kernel/cpu/common.c   | 12 +++++--
 arch/x86/kernel/head_64.S      |  6 ++--
 6 files changed, 77 insertions(+), 74 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC v2 PATCH 1/7] x86/entry: unify definitions from calling.h and ptrace-abi.h
  2021-05-10 18:53 [RFC v2 PATCH 0/6] x86/entry: cleanups and consistent syscall number handling H. Peter Anvin
@ 2021-05-10 18:53 ` H. Peter Anvin
  2021-05-12  9:23   ` [tip: x86/asm] x86/entry: Unify definitions from <asm/calling.h> and <asm/ptrace-abi.h> tip-bot2 for H. Peter Anvin (Intel)
  2021-05-10 18:53 ` [RFC v2 PATCH 2/7] x86/entry: reverse arguments to do_syscall_64() H. Peter Anvin
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-10 18:53 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andy Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

The register offsets in <asm/ptrace-abi.h> are duplicated in
entry/calling.h, but are formatted differently and therefore not
compatible. Use the version from <asm/ptrace-abi.h> consistently.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
---
 arch/x86/entry/calling.h  | 36 +-----------------------------------
 arch/x86/kernel/head_64.S |  6 +++---
 2 files changed, 4 insertions(+), 38 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 07a9331d55e7..7436d4a74ecb 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -6,6 +6,7 @@
 #include <asm/percpu.h>
 #include <asm/asm-offsets.h>
 #include <asm/processor-flags.h>
+#include <asm/ptrace-abi.h>
 
 /*
 
@@ -62,41 +63,6 @@ For 32-bit we have the following conventions - kernel is built with
  * for assembly code:
  */
 
-/* The layout forms the "struct pt_regs" on the stack: */
-/*
- * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
- * unless syscall needs a complete, fully filled "struct pt_regs".
- */
-#define R15		0*8
-#define R14		1*8
-#define R13		2*8
-#define R12		3*8
-#define RBP		4*8
-#define RBX		5*8
-/* These regs are callee-clobbered. Always saved on kernel entry. */
-#define R11		6*8
-#define R10		7*8
-#define R9		8*8
-#define R8		9*8
-#define RAX		10*8
-#define RCX		11*8
-#define RDX		12*8
-#define RSI		13*8
-#define RDI		14*8
-/*
- * On syscall entry, this is syscall#. On CPU exception, this is error code.
- * On hw interrupt, it's IRQ number:
- */
-#define ORIG_RAX	15*8
-/* Return frame for iretq */
-#define RIP		16*8
-#define CS		17*8
-#define EFLAGS		18*8
-#define RSP		19*8
-#define SS		20*8
-
-#define SIZEOF_PTREGS	21*8
-
 .macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax save_ret=0
 	.if \save_ret
 	pushq	%rsi		/* pt_regs->si */
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaaba8e2..d8b3ebd2bb85 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -62,7 +62,7 @@ SYM_CODE_START_NOALIGN(startup_64)
 	 */
 
 	/* Set up the stack for verify_cpu(), similar to initial_stack below */
-	leaq	(__end_init_task - SIZEOF_PTREGS)(%rip), %rsp
+	leaq	(__end_init_task - FRAME_SIZE)(%rip), %rsp
 
 	leaq	_text(%rip), %rdi
 	pushq	%rsi
@@ -343,10 +343,10 @@ SYM_DATA(initial_vc_handler,	.quad handle_vc_boot_ghcb)
 #endif
 
 /*
- * The SIZEOF_PTREGS gap is a convention which helps the in-kernel unwinder
+ * The FRAME_SIZE gap is a convention which helps the in-kernel unwinder
  * reliably detect the end of the stack.
  */
-SYM_DATA(initial_stack, .quad init_thread_union + THREAD_SIZE - SIZEOF_PTREGS)
+SYM_DATA(initial_stack, .quad init_thread_union + THREAD_SIZE - FRAME_SIZE)
 	__FINITDATA
 
 	__INIT
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 PATCH 2/7] x86/entry: reverse arguments to do_syscall_64()
  2021-05-10 18:53 [RFC v2 PATCH 0/6] x86/entry: cleanups and consistent syscall number handling H. Peter Anvin
  2021-05-10 18:53 ` [RFC v2 PATCH 1/7] x86/entry: unify definitions from calling.h and ptrace-abi.h H. Peter Anvin
@ 2021-05-10 18:53 ` H. Peter Anvin
  2021-05-12  9:23   ` [tip: x86/asm] x86/entry: Reverse " tip-bot2 for H. Peter Anvin (Intel)
  2021-05-10 18:53 ` [RFC v2 PATCH 3/7] x86/syscall: unconditionally prototype {ia32,x32}_sys_call_table[] H. Peter Anvin
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-10 18:53 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andy Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Reverse the order of arguments to do_syscall_64() so that the first
argument is the pt_regs pointer. This is not only consistent with
*all* other entry points from assembly, but it actually makes the
compiled code slightly better.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
---
 arch/x86/entry/common.c        | 2 +-
 arch/x86/entry/entry_64.S      | 4 ++--
 arch/x86/include/asm/syscall.h | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7b2542b13ebd..00da0f5420de 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -36,7 +36,7 @@
 #include <asm/irq_stack.h>
 
 #ifdef CONFIG_X86_64
-__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
+__visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
 {
 	add_random_kstack_offset();
 	nr = syscall_enter_from_user_mode(regs, nr);
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a16a5294d55f..1d9db15fdc69 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -107,8 +107,8 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
 
 	/* IRQs are off. */
-	movq	%rax, %rdi
-	movq	%rsp, %rsi
+	movq	%rsp, %rdi
+	movq	%rax, %rsi
 	call	do_syscall_64		/* returns with IRQs disabled */
 
 	/*
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 7cbf733d11af..4e20054d7533 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -160,7 +160,7 @@ static inline int syscall_get_arch(struct task_struct *task)
 		? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 }
 
-void do_syscall_64(unsigned long nr, struct pt_regs *regs);
+void do_syscall_64(struct pt_regs *regs, unsigned long nr);
 void do_int80_syscall_32(struct pt_regs *regs);
 long do_fast_syscall_32(struct pt_regs *regs);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 PATCH 3/7] x86/syscall: unconditionally prototype {ia32,x32}_sys_call_table[]
  2021-05-10 18:53 [RFC v2 PATCH 0/6] x86/entry: cleanups and consistent syscall number handling H. Peter Anvin
  2021-05-10 18:53 ` [RFC v2 PATCH 1/7] x86/entry: unify definitions from calling.h and ptrace-abi.h H. Peter Anvin
  2021-05-10 18:53 ` [RFC v2 PATCH 2/7] x86/entry: reverse arguments to do_syscall_64() H. Peter Anvin
@ 2021-05-10 18:53 ` H. Peter Anvin
  2021-05-12  9:23   ` [tip: x86/asm] x86/syscall: Unconditionally " tip-bot2 for H. Peter Anvin (Intel)
  2021-05-10 18:53 ` [RFC v2 PATCH 4/7] x86/syscall: maximize MSR_SYSCALL_MASK H. Peter Anvin
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-10 18:53 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andy Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Even if these APIs are disabled, and the arrays therefore do not
exist, having the prototypes allows us to use IS_ENABLED() rather than
using #ifdefs.

If something ends up trying to actually *use* these arrays a linker
error will ensue.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
---
 arch/x86/include/asm/syscall.h | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 4e20054d7533..f6593cafdbd9 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -21,13 +21,12 @@ extern const sys_call_ptr_t sys_call_table[];
 
 #if defined(CONFIG_X86_32)
 #define ia32_sys_call_table sys_call_table
-#endif
-
-#if defined(CONFIG_IA32_EMULATION)
+#else
+/*
+ * These may not exist, but still put the prototypes in so we
+ * can use IS_ENABLED().
+ */
 extern const sys_call_ptr_t ia32_sys_call_table[];
-#endif
-
-#ifdef CONFIG_X86_X32_ABI
 extern const sys_call_ptr_t x32_sys_call_table[];
 #endif
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 PATCH 4/7] x86/syscall: maximize MSR_SYSCALL_MASK
  2021-05-10 18:53 [RFC v2 PATCH 0/6] x86/entry: cleanups and consistent syscall number handling H. Peter Anvin
                   ` (2 preceding siblings ...)
  2021-05-10 18:53 ` [RFC v2 PATCH 3/7] x86/syscall: unconditionally prototype {ia32,x32}_sys_call_table[] H. Peter Anvin
@ 2021-05-10 18:53 ` H. Peter Anvin
  2021-05-12  9:23   ` [tip: x86/asm] x86/syscall: Maximize MSR_SYSCALL_MASK tip-bot2 for H. Peter Anvin (Intel)
  2021-05-10 18:53 ` [RFC v2 PATCH 5/7] x86/entry: split PUSH_AND_CLEAR_REGS into two submacros H. Peter Anvin
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-10 18:53 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andy Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

It is better to clear as many flags as possible when we do a system
call entry, as opposed to the other way around. The fewer flags we
keep, the lesser the possible interference between the kernel and user
space.

The flags changed are:

CF, PF, AF, ZF, SF, OF: these are arithmetic flags which affect
branches, possibly speculatively. They should be cleared for the same
reasons we now clear all GPRs on entry.

RF: suppresses a code breakpoint on the subsequent instruction. It is
probably impossible to enter the kernel with RF set, but if it is
somehow not, it would break a kernel debugger setting a breakpoint on
the entry point. Either way, user space should not be able to control
kernel behavior here.

ID: this flag has no direct effect (it is a scratch bit only.)
However, there is no reason to retain the user space value in the
kernel, and the standard should be to clear unless needed, not the
other way around.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
---
 arch/x86/kernel/cpu/common.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a1b756c49a93..6cf697574661 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1773,10 +1773,16 @@ void syscall_init(void)
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
 #endif
 
-	/* Flags to clear on syscall */
+	/*
+	 * Flags to clear on syscall; clear as much as possible
+	 * to minimize user space-kernel interference.
+	 */
 	wrmsrl(MSR_SYSCALL_MASK,
-	       X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
-	       X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
+	       X86_EFLAGS_CF|X86_EFLAGS_PF|X86_EFLAGS_AF|
+	       X86_EFLAGS_ZF|X86_EFLAGS_SF|X86_EFLAGS_TF|
+	       X86_EFLAGS_IF|X86_EFLAGS_DF|X86_EFLAGS_OF|
+	       X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_RF|
+	       X86_EFLAGS_AC|X86_EFLAGS_ID);
 }
 
 #else	/* CONFIG_X86_64 */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 PATCH 5/7] x86/entry: split PUSH_AND_CLEAR_REGS into two submacros
  2021-05-10 18:53 [RFC v2 PATCH 0/6] x86/entry: cleanups and consistent syscall number handling H. Peter Anvin
                   ` (3 preceding siblings ...)
  2021-05-10 18:53 ` [RFC v2 PATCH 4/7] x86/syscall: maximize MSR_SYSCALL_MASK H. Peter Anvin
@ 2021-05-10 18:53 ` H. Peter Anvin
  2021-05-12  9:23   ` [tip: x86/asm] x86/entry: Split " tip-bot2 for H. Peter Anvin (Intel)
  2021-05-10 18:53 ` [RFC v2 PATCH 6/7] x86/regs: syscall_get_nr() returns -1 for a non-system call H. Peter Anvin
  2021-05-10 18:53 ` [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs H. Peter Anvin
  6 siblings, 1 reply; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-10 18:53 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andy Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

PUSH_AND_CLEAR_REGS, as the name implies, performs two functions:
pushing registers and clearing registers. They don't necessarily have
to be performed in immediate sequence, although all current users
do. Split it into two macros for the case where that isn't desired;
the FRED enabling patchset will eventually make use of this.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
---
 arch/x86/entry/calling.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 7436d4a74ecb..a4c061fb7c6e 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -63,7 +63,7 @@ For 32-bit we have the following conventions - kernel is built with
  * for assembly code:
  */
 
-.macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax save_ret=0
+.macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0
 	.if \save_ret
 	pushq	%rsi		/* pt_regs->si */
 	movq	8(%rsp), %rsi	/* temporarily store the return address in %rsi */
@@ -90,7 +90,9 @@ For 32-bit we have the following conventions - kernel is built with
 	.if \save_ret
 	pushq	%rsi		/* return address on top of stack */
 	.endif
+.endm
 
+.macro CLEAR_REGS
 	/*
 	 * Sanitize registers of values that a speculation attack might
 	 * otherwise want to exploit. The lower registers are likely clobbered
@@ -112,6 +114,11 @@ For 32-bit we have the following conventions - kernel is built with
 
 .endm
 
+.macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax save_ret=0
+	PUSH_REGS rdx=\rdx, rax=\rax, save_ret=\save_ret
+	CLEAR_REGS
+.endm
+
 .macro POP_REGS pop_rdi=1 skip_r11rcx=0
 	popq %r15
 	popq %r14
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 PATCH 6/7] x86/regs: syscall_get_nr() returns -1 for a non-system call
  2021-05-10 18:53 [RFC v2 PATCH 0/6] x86/entry: cleanups and consistent syscall number handling H. Peter Anvin
                   ` (4 preceding siblings ...)
  2021-05-10 18:53 ` [RFC v2 PATCH 5/7] x86/entry: split PUSH_AND_CLEAR_REGS into two submacros H. Peter Anvin
@ 2021-05-10 18:53 ` H. Peter Anvin
  2021-05-12  9:23   ` [tip: x86/asm] x86/regs: Syscall_get_nr() " tip-bot2 for H. Peter Anvin
  2021-05-10 18:53 ` [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs H. Peter Anvin
  6 siblings, 1 reply; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-10 18:53 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andy Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List

syscall_get_nr() is defined to return -1 for a non-system call or a
ptrace/seccomp restart; not just any arbitrary number. See comment in
<asm-generic/syscall.h> for the official definition of this function.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
---
 arch/x86/kernel/ptrace.c | 2 +-
 arch/x86/kernel/signal.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 87a4143aa7d7..4c208ea3bd9f 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -911,7 +911,7 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
 		 * syscall with TS_COMPAT still set.
 		 */
 		regs->orig_ax = value;
-		if (syscall_get_nr(child, regs) >= 0)
+		if (syscall_get_nr(child, regs) != -1)
 			child->thread_info.status |= TS_I386_REGS_POKED;
 		break;
 
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index a06cb107c0e8..e12779a2714d 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -713,7 +713,7 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 		save_v86_state((struct kernel_vm86_regs *) regs, VM86_SIGNAL);
 
 	/* Are we from a system call? */
-	if (syscall_get_nr(current, regs) >= 0) {
+	if (syscall_get_nr(current, regs) != -1) {
 		/* If so, check system call restarting.. */
 		switch (syscall_get_error(current, regs)) {
 		case -ERESTART_RESTARTBLOCK:
@@ -793,7 +793,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
 	}
 
 	/* Did we come from a system call? */
-	if (syscall_get_nr(current, regs) >= 0) {
+	if (syscall_get_nr(current, regs) != -1) {
 		/* Restart the system call - no handlers present */
 		switch (syscall_get_error(current, regs)) {
 		case -ERESTARTNOHAND:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-10 18:53 [RFC v2 PATCH 0/6] x86/entry: cleanups and consistent syscall number handling H. Peter Anvin
                   ` (5 preceding siblings ...)
  2021-05-10 18:53 ` [RFC v2 PATCH 6/7] x86/regs: syscall_get_nr() returns -1 for a non-system call H. Peter Anvin
@ 2021-05-10 18:53 ` H. Peter Anvin
  2021-05-12  8:51   ` Ingo Molnar
  2021-05-12 12:09   ` Thomas Gleixner
  6 siblings, 2 replies; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-10 18:53 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andy Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List

From: "H. Peter Anvin (Intel)" <hpa@zytor.com>

Redefine the system call number consistently to be "int". The value -1
is a non-system call (which can be poked in by ptrace/seccomp to
indicate that no further processing should be done and that the return
value should be the current value in regs->ax, default to -ENOSYS; any
other value which does not correspond to a valid system call
unconditionally calls sys_ni_syscall() and returns -ENOSYS just like
any system call that corresponds to a hole in the system call table.

This is the defined semantics of syscall_get_nr(), so that is what all
the architecture-independent code already expects.  As documented in
<asm-generic/syscall.h> (which is simply the documentation file for
<asm/syscall.h>):

/**
 * syscall_get_nr - find what system call a task is executing
 * @task:       task of interest, must be blocked
 * @regs:       task_pt_regs() of @task
 *
 * If @task is executing a system call or is at system call
 * tracing about to attempt one, returns the system call number.
 * If @task is not executing a system call, i.e. it's blocked
 * inside the kernel for a fault or signal, returns -1.
 *
 * Note this returns int even on 64-bit machines.  Only 32 bits of
 * system call number can be meaningful.  If the actual arch value
 * is 64 bits, this truncates to 32 bits so 0xffffffff means -1.
 *
 * It's only valid to call this when @task is known to be blocked.
 */
int syscall_get_nr(struct task_struct *task, struct pt_regs *regs);

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
---
 arch/x86/entry/common.c        | 79 +++++++++++++++++++++++-----------
 arch/x86/entry/entry_64.S      |  2 +-
 arch/x86/include/asm/syscall.h |  2 +-
 3 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 00da0f5420de..bf1ccaf101d7 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -36,61 +36,89 @@
 #include <asm/irq_stack.h>
 
 #ifdef CONFIG_X86_64
-__visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
+
+static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
+{
+	unsigned long unr = nr;
+
+	if (likely(unr < NR_syscalls)) {
+		unr = array_index_nospec(unr, NR_syscalls);
+		regs->ax = sys_call_table[unr](regs);
+		return true;
+	}
+	return false;
+}
+
+static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
+{
+	unsigned long xnr = nr;
+
+	xnr -= __X32_SYSCALL_BIT;
+
+	if (IS_ENABLED(CONFIG_X86_X32_ABI) &&
+	    likely(xnr < X32_NR_syscalls)) {
+		xnr = array_index_nospec(xnr, X32_NR_syscalls);
+		regs->ax = x32_sys_call_table[xnr](regs);
+		return true;
+	}
+	return false;
+}
+
+__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
 {
 	add_random_kstack_offset();
 	nr = syscall_enter_from_user_mode(regs, nr);
 
 	instrumentation_begin();
-	if (likely(nr < NR_syscalls)) {
-		nr = array_index_nospec(nr, NR_syscalls);
-		regs->ax = sys_call_table[nr](regs);
-#ifdef CONFIG_X86_X32_ABI
-	} else if (likely((nr & __X32_SYSCALL_BIT) &&
-			  (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) {
-		nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT,
-					X32_NR_syscalls);
-		regs->ax = x32_sys_call_table[nr](regs);
-#endif
+
+	if (!do_syscall_x64(regs, nr) &&
+	    !do_syscall_x32(regs, nr) &&
+	    nr != -1) {
+		/* Invalid system call, but still a system call? */
+		regs->ax = __x64_sys_ni_syscall(regs);
 	}
+
 	instrumentation_end();
 	syscall_exit_to_user_mode(regs);
 }
 #endif
 
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
-static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
+static __always_inline int syscall_32_enter(struct pt_regs *regs)
 {
 	if (IS_ENABLED(CONFIG_IA32_EMULATION))
 		current_thread_info()->status |= TS_COMPAT;
 
-	return (unsigned int)regs->orig_ax;
+	return (int)regs->orig_ax;
 }
 
 /*
  * Invoke a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.
  */
-static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs,
-						  unsigned int nr)
+static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs, int nr)
 {
-	if (likely(nr < IA32_NR_syscalls)) {
-		nr = array_index_nospec(nr, IA32_NR_syscalls);
-		regs->ax = ia32_sys_call_table[nr](regs);
+	unsigned long unr = nr;
+
+	if (likely(unr < IA32_NR_syscalls)) {
+		unr = array_index_nospec(unr, IA32_NR_syscalls);
+		regs->ax = ia32_sys_call_table[unr](regs);
+	} else if (nr != -1) {
+		regs->ax = __ia32_sys_ni_syscall(regs);
 	}
 }
 
 /* Handles int $0x80 */
 __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 {
-	unsigned int nr = syscall_32_enter(regs);
+	int nr = syscall_32_enter(regs);
 
 	add_random_kstack_offset();
 	/*
-	 * Subtlety here: if ptrace pokes something larger than 2^32-1 into
-	 * orig_ax, the unsigned int return value truncates it.  This may
-	 * or may not be necessary, but it matches the old asm behavior.
+	 * Subtlety here: if ptrace pokes something larger than 2^31-1 into
+	 * orig_ax, the int return value truncates it. This matches
+	 * the semantics of syscall_get_nr().
 	 */
-	nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
+	nr = syscall_enter_from_user_mode(regs, nr);
 	instrumentation_begin();
 
 	do_syscall_32_irqs_on(regs, nr);
@@ -101,7 +129,7 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 
 static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 {
-	unsigned int nr = syscall_32_enter(regs);
+	int nr = syscall_32_enter(regs);
 	int res;
 
 	add_random_kstack_offset();
@@ -136,8 +164,7 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 		return false;
 	}
 
-	/* The case truncates any ptrace induced syscall nr > 2^32 -1 */
-	nr = (unsigned int)syscall_enter_from_user_mode_work(regs, nr);
+	nr = syscall_enter_from_user_mode_work(regs, nr);
 
 	/* Now this is just like a normal syscall. */
 	do_syscall_32_irqs_on(regs, nr);
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1d9db15fdc69..85f04ea0e368 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -108,7 +108,7 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 
 	/* IRQs are off. */
 	movq	%rsp, %rdi
-	movq	%rax, %rsi
+	movslq	%eax, %rsi
 	call	do_syscall_64		/* returns with IRQs disabled */
 
 	/*
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index f6593cafdbd9..f7e2d82d24fb 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -159,7 +159,7 @@ static inline int syscall_get_arch(struct task_struct *task)
 		? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 }
 
-void do_syscall_64(struct pt_regs *regs, unsigned long nr);
+void do_syscall_64(struct pt_regs *regs, int nr);
 void do_int80_syscall_32(struct pt_regs *regs);
 long do_fast_syscall_32(struct pt_regs *regs);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-10 18:53 ` [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs H. Peter Anvin
@ 2021-05-12  8:51   ` Ingo Molnar
  2021-05-12 17:50     ` H. Peter Anvin
  2021-05-12 12:09   ` Thomas Gleixner
  1 sibling, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2021-05-12  8:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andy Lutomirski,
	Linux Kernel Mailing List


* H. Peter Anvin <hpa@zytor.com> wrote:

> From: "H. Peter Anvin (Intel)" <hpa@zytor.com>
> 
> Redefine the system call number consistently to be "int". The value -1
> is a non-system call (which can be poked in by ptrace/seccomp to
> indicate that no further processing should be done and that the return
> value should be the current value in regs->ax, default to -ENOSYS; any
> other value which does not correspond to a valid system call
> unconditionally calls sys_ni_syscall() and returns -ENOSYS just like
> any system call that corresponds to a hole in the system call table.
> 
> This is the defined semantics of syscall_get_nr(), so that is what all
> the architecture-independent code already expects.  As documented in
> <asm-generic/syscall.h> (which is simply the documentation file for
> <asm/syscall.h>):
> 
> /**
>  * syscall_get_nr - find what system call a task is executing
>  * @task:       task of interest, must be blocked
>  * @regs:       task_pt_regs() of @task
>  *
>  * If @task is executing a system call or is at system call
>  * tracing about to attempt one, returns the system call number.
>  * If @task is not executing a system call, i.e. it's blocked
>  * inside the kernel for a fault or signal, returns -1.
>  *
>  * Note this returns int even on 64-bit machines.  Only 32 bits of
>  * system call number can be meaningful.  If the actual arch value
>  * is 64 bits, this truncates to 32 bits so 0xffffffff means -1.
>  *
>  * It's only valid to call this when @task is known to be blocked.
>  */
> int syscall_get_nr(struct task_struct *task, struct pt_regs *regs);

I've applied patches 1-6, thanks Peter!

Wrt. patch #7 - the changelog is hedging things a bit and the changes are 
non-trivial. Does this patch (intend to) change any actual observable 
behavior in the system call interface, and if yes, in which areas?

Or is this a pure cleanup with no observable changes expected?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [tip: x86/asm] x86/regs: Syscall_get_nr() returns -1 for a non-system call
  2021-05-10 18:53 ` [RFC v2 PATCH 6/7] x86/regs: syscall_get_nr() returns -1 for a non-system call H. Peter Anvin
@ 2021-05-12  9:23   ` tip-bot2 for H. Peter Anvin
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for H. Peter Anvin @ 2021-05-12  9:23 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: H. Peter Anvin, Ingo Molnar, x86, linux-kernel

The following commit has been merged into the x86/asm branch of tip:

Commit-ID:     9ddcb87b9218dec760e8d8a780bc8ad514c3d36a
Gitweb:        https://git.kernel.org/tip/9ddcb87b9218dec760e8d8a780bc8ad514c3d36a
Author:        H. Peter Anvin <hpa@zytor.com>
AuthorDate:    Mon, 10 May 2021 11:53:15 -07:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 12 May 2021 10:49:15 +02:00

x86/regs: Syscall_get_nr() returns -1 for a non-system call

syscall_get_nr() is defined to return -1 for a non-system call or a
ptrace/seccomp restart; not just any arbitrary number. See comment in
<asm-generic/syscall.h> for the official definition of this function.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510185316.3307264-7-hpa@zytor.com
---
 arch/x86/kernel/ptrace.c | 2 +-
 arch/x86/kernel/signal.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 87a4143..4c208ea 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -911,7 +911,7 @@ static int putreg32(struct task_struct *child, unsigned regno, u32 value)
 		 * syscall with TS_COMPAT still set.
 		 */
 		regs->orig_ax = value;
-		if (syscall_get_nr(child, regs) >= 0)
+		if (syscall_get_nr(child, regs) != -1)
 			child->thread_info.status |= TS_I386_REGS_POKED;
 		break;
 
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index a06cb10..e12779a 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -713,7 +713,7 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs)
 		save_v86_state((struct kernel_vm86_regs *) regs, VM86_SIGNAL);
 
 	/* Are we from a system call? */
-	if (syscall_get_nr(current, regs) >= 0) {
+	if (syscall_get_nr(current, regs) != -1) {
 		/* If so, check system call restarting.. */
 		switch (syscall_get_error(current, regs)) {
 		case -ERESTART_RESTARTBLOCK:
@@ -793,7 +793,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal)
 	}
 
 	/* Did we come from a system call? */
-	if (syscall_get_nr(current, regs) >= 0) {
+	if (syscall_get_nr(current, regs) != -1) {
 		/* Restart the system call - no handlers present */
 		switch (syscall_get_error(current, regs)) {
 		case -ERESTARTNOHAND:

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: x86/asm] x86/entry: Split PUSH_AND_CLEAR_REGS into two submacros
  2021-05-10 18:53 ` [RFC v2 PATCH 5/7] x86/entry: split PUSH_AND_CLEAR_REGS into two submacros H. Peter Anvin
@ 2021-05-12  9:23   ` tip-bot2 for H. Peter Anvin (Intel)
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for H. Peter Anvin (Intel) @ 2021-05-12  9:23 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: H. Peter Anvin (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the x86/asm branch of tip:

Commit-ID:     29e9758966f47004bd7245e6adadcb708386f36a
Gitweb:        https://git.kernel.org/tip/29e9758966f47004bd7245e6adadcb708386f36a
Author:        H. Peter Anvin (Intel) <hpa@zytor.com>
AuthorDate:    Mon, 10 May 2021 11:53:14 -07:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 12 May 2021 10:49:15 +02:00

x86/entry: Split PUSH_AND_CLEAR_REGS into two submacros

PUSH_AND_CLEAR_REGS, as the name implies, performs two functions:
pushing registers and clearing registers. They don't necessarily have
to be performed in immediate sequence, although all current users
do. Split it into two macros for the case where that isn't desired;
the FRED enabling patchset will eventually make use of this.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510185316.3307264-6-hpa@zytor.com
---
 arch/x86/entry/calling.h |  9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 7436d4a..a4c061f 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -63,7 +63,7 @@ For 32-bit we have the following conventions - kernel is built with
  * for assembly code:
  */
 
-.macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax save_ret=0
+.macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0
 	.if \save_ret
 	pushq	%rsi		/* pt_regs->si */
 	movq	8(%rsp), %rsi	/* temporarily store the return address in %rsi */
@@ -90,7 +90,9 @@ For 32-bit we have the following conventions - kernel is built with
 	.if \save_ret
 	pushq	%rsi		/* return address on top of stack */
 	.endif
+.endm
 
+.macro CLEAR_REGS
 	/*
 	 * Sanitize registers of values that a speculation attack might
 	 * otherwise want to exploit. The lower registers are likely clobbered
@@ -112,6 +114,11 @@ For 32-bit we have the following conventions - kernel is built with
 
 .endm
 
+.macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax save_ret=0
+	PUSH_REGS rdx=\rdx, rax=\rax, save_ret=\save_ret
+	CLEAR_REGS
+.endm
+
 .macro POP_REGS pop_rdi=1 skip_r11rcx=0
 	popq %r15
 	popq %r14

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: x86/asm] x86/syscall: Maximize MSR_SYSCALL_MASK
  2021-05-10 18:53 ` [RFC v2 PATCH 4/7] x86/syscall: maximize MSR_SYSCALL_MASK H. Peter Anvin
@ 2021-05-12  9:23   ` tip-bot2 for H. Peter Anvin (Intel)
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for H. Peter Anvin (Intel) @ 2021-05-12  9:23 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: H. Peter Anvin (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the x86/asm branch of tip:

Commit-ID:     6de4ac1d03f75248974a398110b15af0bfe65a11
Gitweb:        https://git.kernel.org/tip/6de4ac1d03f75248974a398110b15af0bfe65a11
Author:        H. Peter Anvin (Intel) <hpa@zytor.com>
AuthorDate:    Mon, 10 May 2021 11:53:13 -07:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 12 May 2021 10:49:15 +02:00

x86/syscall: Maximize MSR_SYSCALL_MASK

It is better to clear as many flags as possible when we do a system
call entry, as opposed to the other way around. The fewer flags we
keep, the lesser the possible interference between the kernel and user
space.

The flags changed are:

 - CF, PF, AF, ZF, SF, OF: these are arithmetic flags which affect
   branches, possibly speculatively. They should be cleared for the same
   reasons we now clear all GPRs on entry.

 - RF: suppresses a code breakpoint on the subsequent instruction. It is
   probably impossible to enter the kernel with RF set, but if it is
   somehow not, it would break a kernel debugger setting a breakpoint on
   the entry point. Either way, user space should not be able to control
   kernel behavior here.

 - ID: this flag has no direct effect (it is a scratch bit only.)
   However, there is no reason to retain the user space value in the
   kernel, and the standard should be to clear unless needed, not the
   other way around.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510185316.3307264-5-hpa@zytor.com
---
 arch/x86/kernel/cpu/common.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a1b756c..6cf6975 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1773,10 +1773,16 @@ void syscall_init(void)
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
 #endif
 
-	/* Flags to clear on syscall */
+	/*
+	 * Flags to clear on syscall; clear as much as possible
+	 * to minimize user space-kernel interference.
+	 */
 	wrmsrl(MSR_SYSCALL_MASK,
-	       X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
-	       X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
+	       X86_EFLAGS_CF|X86_EFLAGS_PF|X86_EFLAGS_AF|
+	       X86_EFLAGS_ZF|X86_EFLAGS_SF|X86_EFLAGS_TF|
+	       X86_EFLAGS_IF|X86_EFLAGS_DF|X86_EFLAGS_OF|
+	       X86_EFLAGS_IOPL|X86_EFLAGS_NT|X86_EFLAGS_RF|
+	       X86_EFLAGS_AC|X86_EFLAGS_ID);
 }
 
 #else	/* CONFIG_X86_64 */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: x86/asm] x86/entry: Reverse arguments to do_syscall_64()
  2021-05-10 18:53 ` [RFC v2 PATCH 2/7] x86/entry: reverse arguments to do_syscall_64() H. Peter Anvin
@ 2021-05-12  9:23   ` tip-bot2 for H. Peter Anvin (Intel)
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for H. Peter Anvin (Intel) @ 2021-05-12  9:23 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: H. Peter Anvin (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the x86/asm branch of tip:

Commit-ID:     3e5e7f7736b05d5fdf2cc4e0ba4f2d8bc42c630d
Gitweb:        https://git.kernel.org/tip/3e5e7f7736b05d5fdf2cc4e0ba4f2d8bc42c630d
Author:        H. Peter Anvin (Intel) <hpa@zytor.com>
AuthorDate:    Mon, 10 May 2021 11:53:11 -07:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 12 May 2021 10:49:14 +02:00

x86/entry: Reverse arguments to do_syscall_64()

Reverse the order of arguments to do_syscall_64() so that the first
argument is the pt_regs pointer. This is not only consistent with
*all* other entry points from assembly, but it actually makes the
compiled code slightly better.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510185316.3307264-3-hpa@zytor.com
---
 arch/x86/entry/common.c        | 2 +-
 arch/x86/entry/entry_64.S      | 4 ++--
 arch/x86/include/asm/syscall.h | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7b2542b..00da0f5 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -36,7 +36,7 @@
 #include <asm/irq_stack.h>
 
 #ifdef CONFIG_X86_64
-__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
+__visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
 {
 	add_random_kstack_offset();
 	nr = syscall_enter_from_user_mode(regs, nr);
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a16a529..1d9db15 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -107,8 +107,8 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
 	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
 
 	/* IRQs are off. */
-	movq	%rax, %rdi
-	movq	%rsp, %rsi
+	movq	%rsp, %rdi
+	movq	%rax, %rsi
 	call	do_syscall_64		/* returns with IRQs disabled */
 
 	/*
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 7cbf733..4e20054 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -160,7 +160,7 @@ static inline int syscall_get_arch(struct task_struct *task)
 		? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 }
 
-void do_syscall_64(unsigned long nr, struct pt_regs *regs);
+void do_syscall_64(struct pt_regs *regs, unsigned long nr);
 void do_int80_syscall_32(struct pt_regs *regs);
 long do_fast_syscall_32(struct pt_regs *regs);
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: x86/asm] x86/syscall: Unconditionally prototype {ia32,x32}_sys_call_table[]
  2021-05-10 18:53 ` [RFC v2 PATCH 3/7] x86/syscall: unconditionally prototype {ia32,x32}_sys_call_table[] H. Peter Anvin
@ 2021-05-12  9:23   ` tip-bot2 for H. Peter Anvin (Intel)
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for H. Peter Anvin (Intel) @ 2021-05-12  9:23 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: H. Peter Anvin (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the x86/asm branch of tip:

Commit-ID:     dce0aa3b2ef28900cc4c779c59a870f1b4bdadee
Gitweb:        https://git.kernel.org/tip/dce0aa3b2ef28900cc4c779c59a870f1b4bdadee
Author:        H. Peter Anvin (Intel) <hpa@zytor.com>
AuthorDate:    Mon, 10 May 2021 11:53:12 -07:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 12 May 2021 10:49:15 +02:00

x86/syscall: Unconditionally prototype {ia32,x32}_sys_call_table[]

Even if these APIs are disabled, and the arrays therefore do not
exist, having the prototypes allows us to use IS_ENABLED() rather than
using #ifdefs.

If something ends up trying to actually *use* these arrays a linker
error will ensue.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510185316.3307264-4-hpa@zytor.com
---
 arch/x86/include/asm/syscall.h | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 4e20054..f6593ca 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -21,13 +21,12 @@ extern const sys_call_ptr_t sys_call_table[];
 
 #if defined(CONFIG_X86_32)
 #define ia32_sys_call_table sys_call_table
-#endif
-
-#if defined(CONFIG_IA32_EMULATION)
+#else
+/*
+ * These may not exist, but still put the prototypes in so we
+ * can use IS_ENABLED().
+ */
 extern const sys_call_ptr_t ia32_sys_call_table[];
-#endif
-
-#ifdef CONFIG_X86_X32_ABI
 extern const sys_call_ptr_t x32_sys_call_table[];
 #endif
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [tip: x86/asm] x86/entry: Unify definitions from <asm/calling.h> and <asm/ptrace-abi.h>
  2021-05-10 18:53 ` [RFC v2 PATCH 1/7] x86/entry: unify definitions from calling.h and ptrace-abi.h H. Peter Anvin
@ 2021-05-12  9:23   ` tip-bot2 for H. Peter Anvin (Intel)
  0 siblings, 0 replies; 25+ messages in thread
From: tip-bot2 for H. Peter Anvin (Intel) @ 2021-05-12  9:23 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: H. Peter Anvin (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the x86/asm branch of tip:

Commit-ID:     6627eb25e40cc8d135d3f8d5391851d18ac497d7
Gitweb:        https://git.kernel.org/tip/6627eb25e40cc8d135d3f8d5391851d18ac497d7
Author:        H. Peter Anvin (Intel) <hpa@zytor.com>
AuthorDate:    Mon, 10 May 2021 11:53:10 -07:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Wed, 12 May 2021 10:49:13 +02:00

x86/entry: Unify definitions from <asm/calling.h> and <asm/ptrace-abi.h>

The register offsets in <asm/ptrace-abi.h> are duplicated in
entry/calling.h, but are formatted differently and therefore not
compatible. Use the version from <asm/ptrace-abi.h> consistently.

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510185316.3307264-2-hpa@zytor.com
---
 arch/x86/entry/calling.h  | 36 +-----------------------------------
 arch/x86/kernel/head_64.S |  6 +++---
 2 files changed, 4 insertions(+), 38 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 07a9331..7436d4a 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -6,6 +6,7 @@
 #include <asm/percpu.h>
 #include <asm/asm-offsets.h>
 #include <asm/processor-flags.h>
+#include <asm/ptrace-abi.h>
 
 /*
 
@@ -62,41 +63,6 @@ For 32-bit we have the following conventions - kernel is built with
  * for assembly code:
  */
 
-/* The layout forms the "struct pt_regs" on the stack: */
-/*
- * C ABI says these regs are callee-preserved. They aren't saved on kernel entry
- * unless syscall needs a complete, fully filled "struct pt_regs".
- */
-#define R15		0*8
-#define R14		1*8
-#define R13		2*8
-#define R12		3*8
-#define RBP		4*8
-#define RBX		5*8
-/* These regs are callee-clobbered. Always saved on kernel entry. */
-#define R11		6*8
-#define R10		7*8
-#define R9		8*8
-#define R8		9*8
-#define RAX		10*8
-#define RCX		11*8
-#define RDX		12*8
-#define RSI		13*8
-#define RDI		14*8
-/*
- * On syscall entry, this is syscall#. On CPU exception, this is error code.
- * On hw interrupt, it's IRQ number:
- */
-#define ORIG_RAX	15*8
-/* Return frame for iretq */
-#define RIP		16*8
-#define CS		17*8
-#define EFLAGS		18*8
-#define RSP		19*8
-#define SS		20*8
-
-#define SIZEOF_PTREGS	21*8
-
 .macro PUSH_AND_CLEAR_REGS rdx=%rdx rax=%rax save_ret=0
 	.if \save_ret
 	pushq	%rsi		/* pt_regs->si */
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 04bddaa..d8b3ebd 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -62,7 +62,7 @@ SYM_CODE_START_NOALIGN(startup_64)
 	 */
 
 	/* Set up the stack for verify_cpu(), similar to initial_stack below */
-	leaq	(__end_init_task - SIZEOF_PTREGS)(%rip), %rsp
+	leaq	(__end_init_task - FRAME_SIZE)(%rip), %rsp
 
 	leaq	_text(%rip), %rdi
 	pushq	%rsi
@@ -343,10 +343,10 @@ SYM_DATA(initial_vc_handler,	.quad handle_vc_boot_ghcb)
 #endif
 
 /*
- * The SIZEOF_PTREGS gap is a convention which helps the in-kernel unwinder
+ * The FRAME_SIZE gap is a convention which helps the in-kernel unwinder
  * reliably detect the end of the stack.
  */
-SYM_DATA(initial_stack, .quad init_thread_union + THREAD_SIZE - SIZEOF_PTREGS)
+SYM_DATA(initial_stack, .quad init_thread_union + THREAD_SIZE - FRAME_SIZE)
 	__FINITDATA
 
 	__INIT

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-10 18:53 ` [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs H. Peter Anvin
  2021-05-12  8:51   ` Ingo Molnar
@ 2021-05-12 12:09   ` Thomas Gleixner
  2021-05-12 18:21     ` H. Peter Anvin
  1 sibling, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2021-05-12 12:09 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar, Borislav Petkov, Andy Lutomirski
  Cc: H. Peter Anvin, Linux Kernel Mailing List

On Mon, May 10 2021 at 11:53, H. Peter Anvin wrote:
> Redefine the system call number consistently to be "int". The value -1
> is a non-system call (which can be poked in by ptrace/seccomp to
> indicate that no further processing should be done and that the return
> value should be the current value in regs->ax, default to -ENOSYS; any
> other value which does not correspond to a valid system call
> unconditionally calls sys_ni_syscall() and returns -ENOSYS just like
> any system call that corresponds to a hole in the system call table.

That sentence spawns 6 lines, has a unmatched ( inside and is confusing
at best. I know what you want to say, but heck...

> This is the defined semantics of syscall_get_nr(), so that is what all
> the architecture-independent code already expects.  As documented in
> <asm-generic/syscall.h> (which is simply the documentation file for
> <asm/syscall.h>):
>
> /**
>  * syscall_get_nr - find what system call a task is executing
>  * @task:       task of interest, must be blocked
>  * @regs:       task_pt_regs() of @task
>  *
>  * If @task is executing a system call or is at system call
>  * tracing about to attempt one, returns the system call number.
>  * If @task is not executing a system call, i.e. it's blocked
>  * inside the kernel for a fault or signal, returns -1.
>  *
>  * Note this returns int even on 64-bit machines.  Only 32 bits of
>  * system call number can be meaningful.  If the actual arch value
>  * is 64 bits, this truncates to 32 bits so 0xffffffff means -1.
>  *
>  * It's only valid to call this when @task is known to be blocked.
>  */
> int syscall_get_nr(struct task_struct *task, struct pt_regs *regs);

No need for copying this comment. Something like this is sufficient:

The syscall number has to be an 'int' as defined by syscall_get_nr().

Aside of that the subject says:

      x86/entry: use int for syscall number; handle all invalid syscall nrs

which suggests that something is not handled correctly today. But the
changelog does not say anything about it.

>  
>  #ifdef CONFIG_X86_64
> -__visible noinstr void do_syscall_64(struct pt_regs *regs, unsigned long nr)
> +
> +static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
> +{
> +	unsigned long unr = nr;

What's the point of this cast? Turn -1 into something larger than
NR_SYSCALLS, right? Comments exist for a reason.

Also why unsigned long? unsigned int is sufficient

> +	if (likely(unr < NR_syscalls)) {
> +		unr = array_index_nospec(unr, NR_syscalls);
> +		regs->ax = sys_call_table[unr](regs);
> +		return true;
> +	}
> +	return false;
> +}

Something like this:

static __always_inline bool do_syscall_x64(struct pt_regs *regs, unsigned int nr)
{
        /* nr is unsigned so it catches 
	if (likely(nr < NR_syscalls)) {
		nr = array_index_nospec(nr, NR_syscalls);
		regs->ax = sys_call_table[nr](regs);
		return true;
	}
	return false;
}

static __always_inline bool do_syscall_x32(struct pt_regs *regs, unsigned int nr)
{
        /*
         * If nr < __X32_SYSCALL_BIT then the result will be > __X32_SYSCALL_BIT
         * due to unsigned math.
         */
	nr -= __X32_SYSCALL_BIT;

	if (IS_ENABLED(CONFIG_X86_X32_ABI) && likely(nr < X32_NR_syscalls)) {
        	nr = array_index_nospec(nr, X32_NR_syscalls);
		regs->ax = x32_sys_call_table[nr](regs);
		return true;
	}
	return false;
}

> index 1d9db15fdc69..85f04ea0e368 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -108,7 +108,7 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
>  
>  	/* IRQs are off. */
>  	movq	%rsp, %rdi
> -	movq	%rax, %rsi
> +	movslq	%eax, %rsi

This is wrong.

  syscall(long number,...);

So the above turns syscall(UINT_MAX + N, ...) into syscall(N, ...).

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-12  8:51   ` Ingo Molnar
@ 2021-05-12 17:50     ` H. Peter Anvin
  0 siblings, 0 replies; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-12 17:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andy Lutomirski,
	Linux Kernel Mailing List

On 5/12/21 1:51 AM, Ingo Molnar wrote:
> 
> I've applied patches 1-6, thanks Peter!
> 
> Wrt. patch #7 - the changelog is hedging things a bit and the changes are
> non-trivial. Does this patch (intend to) change any actual observable
> behavior in the system call interface, and if yes, in which areas?
> 
> Or is this a pure cleanup with no observable changes expected?
> 

I'll clean up the comments a bit [including per tglx' review.] I'm 
writing this email in part to organize my own thoughts in how to better 
explain the motivation for the patch, and the extent of visible differences.

Right now, *some* code will treat e.g. 0x0000000100000001 as a system 
call and some will not. Some of the code, notably in ptrace, will treat 
0x000000018000000 as a system call and some will not. Finally, right 
now, e.g. 335 for x86-64 will force the exit code to be set to -ENOSYS 
even if poked by ptrace, but 548 will not, because there is an 
observable difference between an out of range system call and a system 
call number that falls outside the range of the table.

The use of a non-system-call number in a system call can come in in a 
few ways:

    1. Via ptrace;
    2. From seccomp;
    3. By explicitly passing %eax = -1 to a system call.

#3 isn't really a problem *unless* it for some reason confuses ptrace or 
seccomp users -- we could do an early-out for it.

For ptrace and seccomp, we enter with the return value (regs->ax) set to 
-ENOSYS, regardless of if the system call number is valid or not. 
ptrace/seccomp has the option of independently emulate a system call, 
then set regs->orig_ax to -1 and provide any chosen return value in 
regs->ax. In that case, we must *not* update regs->ax to -ENOSYS before 
returning.

The arch-independent code all assumes that a system call is "int" that 
the value -1 specifically and not just any negative value is used for a 
non-system call. This is the case on x86 as well when arch-independent 
code is involved. The arch-independent API is defined/documented (but 
not *implemented*!) in <asm-generic/syscall.h>, and what this patch is 
intended to do is to make the x86-specific code follow.

As everyone either does or should know, treating the same data in 
different ways in different flows is a security hole just waiting to 
happen.

Most of these are relatively recently introduced bugs in x86-64; the 
original assembly code version zero-extended %rax, compared it against 
the length of the system call table, and would unconditionally return 
-ENOSYS otherwise. Then *at least* on two separate occasions someone 
"optimized" it by removing "movl %eax, %eax" not understanding that this 
is not a noop in x86-64 but a zero-extend (perhaps gas ought to be able 
to allow movzlq as an alias...) introducing a critical security bug 
which was one of the motivators for the SMAP CPU feature.

On x86-64, the ABI is that the callee is responsible for extending 
parameters, so only examining the lower 32 bits is fully consistent with 
any "int" argument to any system call, e.g. regs->di for write(2).

Andy L. and I had a fairly long discussion about this, and some of this 
was updated after the first RFC, but I fully agree I'm not capturing it 
all that well. I hope the above points clear things up better and I'll 
rewrite this into a better patch description.

	-hpa

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-12 12:09   ` Thomas Gleixner
@ 2021-05-12 18:21     ` H. Peter Anvin
  2021-05-12 18:34       ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-12 18:21 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski
  Cc: Linux Kernel Mailing List

On 5/12/21 5:09 AM, Thomas Gleixner wrote:
> 
>> index 1d9db15fdc69..85f04ea0e368 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -108,7 +108,7 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
>>   
>>   	/* IRQs are off. */
>>   	movq	%rsp, %rdi
>> -	movq	%rax, %rsi
>> +	movslq	%eax, %rsi
> 
> This is wrong.
> 
>    syscall(long number,...);
> 
> So the above turns syscall(UINT_MAX + N, ...) into syscall(N, ...).
> 

That is intentional, as (again) system calls are int. As stated in my 
reply to Ingo, I'll clean the various descriptions and try to capture 
the discussion better.

	-hpa


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-12 18:21     ` H. Peter Anvin
@ 2021-05-12 18:34       ` Thomas Gleixner
  2021-05-12 22:09         ` H. Peter Anvin
  0 siblings, 1 reply; 25+ messages in thread
From: Thomas Gleixner @ 2021-05-12 18:34 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar, Borislav Petkov, Andy Lutomirski
  Cc: Linux Kernel Mailing List

On Wed, May 12 2021 at 11:21, H. Peter Anvin wrote:
> On 5/12/21 5:09 AM, Thomas Gleixner wrote:
>> 
>>> index 1d9db15fdc69..85f04ea0e368 100644
>>> --- a/arch/x86/entry/entry_64.S
>>> +++ b/arch/x86/entry/entry_64.S
>>> @@ -108,7 +108,7 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
>>>   
>>>   	/* IRQs are off. */
>>>   	movq	%rsp, %rdi
>>> -	movq	%rax, %rsi
>>> +	movslq	%eax, %rsi
>> 
>> This is wrong.
>> 
>>    syscall(long number,...);
>> 
>> So the above turns syscall(UINT_MAX + N, ...) into syscall(N, ...).
>> 
>
> That is intentional, as (again) system calls are int.

They are 'int' kernel internally, but _NOT_ at the user space visible
side. Again: man syscall

    syscall(long number,...);

So that results in a user ABI change.

> As stated in my reply to Ingo, I'll clean the various descriptions and
> try to capture the discussion better.

If we agree to go there then this wants to be a seperate commit which
does nothing else than changing this behaviour.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-12 18:34       ` Thomas Gleixner
@ 2021-05-12 22:09         ` H. Peter Anvin
  2021-05-12 22:22           ` Thomas Gleixner
  0 siblings, 1 reply; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-12 22:09 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski
  Cc: Linux Kernel Mailing List

On 5/12/21 11:34 AM, Thomas Gleixner wrote:
>>
>> That is intentional, as (again) system calls are int.
> 
> They are 'int' kernel internally, but _NOT_ at the user space visible
> side. Again: man syscall
> 
>      syscall(long number,...);
> 
> So that results in a user ABI change.
> 
>> As stated in my reply to Ingo, I'll clean the various descriptions and
>> try to capture the discussion better.
> 
> If we agree to go there then this wants to be a seperate commit which
> does nothing else than changing this behaviour.
> 

Good idea.

As far as this being a user ABI change, this is actually a revert to the 
original x86-64 ABI; see my message to Ingo.

	-hpa


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-12 22:09         ` H. Peter Anvin
@ 2021-05-12 22:22           ` Thomas Gleixner
  2021-05-12 22:24             ` H. Peter Anvin
  2021-05-14  0:38             ` H. Peter Anvin
  0 siblings, 2 replies; 25+ messages in thread
From: Thomas Gleixner @ 2021-05-12 22:22 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar, Borislav Petkov, Andy Lutomirski
  Cc: Linux Kernel Mailing List

On Wed, May 12 2021 at 15:09, H. Peter Anvin wrote:
> On 5/12/21 11:34 AM, Thomas Gleixner wrote:
>>>
>>> That is intentional, as (again) system calls are int.
>> 
>> They are 'int' kernel internally, but _NOT_ at the user space visible
>> side. Again: man syscall
>> 
>>      syscall(long number,...);
>> 
>> So that results in a user ABI change.
>> 
>>> As stated in my reply to Ingo, I'll clean the various descriptions and
>>> try to capture the discussion better.
>> 
>> If we agree to go there then this wants to be a seperate commit which
>> does nothing else than changing this behaviour.
>> 
>
> Good idea.
>
> As far as this being a user ABI change, this is actually a revert to the 
> original x86-64 ABI; see my message to Ingo.

I'm not against that change, but it has to be well justified and the
reasoning wants to be in the changelog. You know the drill :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-12 22:22           ` Thomas Gleixner
@ 2021-05-12 22:24             ` H. Peter Anvin
  2021-05-14  0:38             ` H. Peter Anvin
  1 sibling, 0 replies; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-12 22:24 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski
  Cc: Linux Kernel Mailing List

Yes, indeed. I hope my reply to Ingo clarifies it as I'm going to try to wordsmith that into a better piece of text.

On May 12, 2021 3:22:05 PM PDT, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Wed, May 12 2021 at 15:09, H. Peter Anvin wrote:
>> On 5/12/21 11:34 AM, Thomas Gleixner wrote:
>>>>
>>>> That is intentional, as (again) system calls are int.
>>> 
>>> They are 'int' kernel internally, but _NOT_ at the user space
>visible
>>> side. Again: man syscall
>>> 
>>>      syscall(long number,...);
>>> 
>>> So that results in a user ABI change.
>>> 
>>>> As stated in my reply to Ingo, I'll clean the various descriptions
>and
>>>> try to capture the discussion better.
>>> 
>>> If we agree to go there then this wants to be a seperate commit
>which
>>> does nothing else than changing this behaviour.
>>> 
>>
>> Good idea.
>>
>> As far as this being a user ABI change, this is actually a revert to
>the 
>> original x86-64 ABI; see my message to Ingo.
>
>I'm not against that change, but it has to be well justified and the
>reasoning wants to be in the changelog. You know the drill :)
>
>Thanks,
>
>        tglx

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-12 22:22           ` Thomas Gleixner
  2021-05-12 22:24             ` H. Peter Anvin
@ 2021-05-14  0:38             ` H. Peter Anvin
  2021-05-14  3:18               ` Andy Lutomirski
  1 sibling, 1 reply; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-14  0:38 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andy Lutomirski
  Cc: Linux Kernel Mailing List

On 5/12/21 3:22 PM, Thomas Gleixner wrote:
>>
>> As far as this being a user ABI change, this is actually a revert to the
>> original x86-64 ABI; see my message to Ingo.
> 
> I'm not against that change, but it has to be well justified and the
> reasoning wants to be in the changelog. You know the drill :)
> 

FYI:

So in the process of breaking up and better document this patch, I have 
looked at the syscall_numbering_64 (and have rewritten it to be more 
complete.)

I found that running it under strace fails, as strace (possibly ptrace, 
possibly the strace binary) causes %rax = 2^32 to be clobbered to zero 
already...

More motivation, I guess.

	-hpa


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-14  0:38             ` H. Peter Anvin
@ 2021-05-14  3:18               ` Andy Lutomirski
  2021-05-14  3:23                 ` H. Peter Anvin
  0 siblings, 1 reply; 25+ messages in thread
From: Andy Lutomirski @ 2021-05-14  3:18 UTC (permalink / raw)
  To: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski
  Cc: Linux Kernel Mailing List

On 5/13/21 5:38 PM, H. Peter Anvin wrote:
> On 5/12/21 3:22 PM, Thomas Gleixner wrote:
>>>
>>> As far as this being a user ABI change, this is actually a revert to the
>>> original x86-64 ABI; see my message to Ingo.
>>
>> I'm not against that change, but it has to be well justified and the
>> reasoning wants to be in the changelog. You know the drill :)
>>
> 
> FYI:
> 
> So in the process of breaking up and better document this patch, I have
> looked at the syscall_numbering_64 (and have rewritten it to be more
> complete.)
> 
> I found that running it under strace fails, as strace (possibly ptrace,
> possibly the strace binary) causes %rax = 2^32 to be clobbered to zero
> already...
> 
> More motivation, I guess.
> 

Indeed.

I would love to go back in time and switch to long, but there are plenty
of things that use int now.  I suppose we could try to make it long for
real, but seccomp has u32 baked into its ABI.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs
  2021-05-14  3:18               ` Andy Lutomirski
@ 2021-05-14  3:23                 ` H. Peter Anvin
  0 siblings, 0 replies; 25+ messages in thread
From: H. Peter Anvin @ 2021-05-14  3:23 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Andy Lutomirski
  Cc: Linux Kernel Mailing List

Yeah. Also, x32 long is 32 bits...

On May 13, 2021 8:18:37 PM PDT, Andy Lutomirski <andy@linux.luto.us> wrote:
>On 5/13/21 5:38 PM, H. Peter Anvin wrote:
>> On 5/12/21 3:22 PM, Thomas Gleixner wrote:
>>>>
>>>> As far as this being a user ABI change, this is actually a revert
>to the
>>>> original x86-64 ABI; see my message to Ingo.
>>>
>>> I'm not against that change, but it has to be well justified and the
>>> reasoning wants to be in the changelog. You know the drill :)
>>>
>> 
>> FYI:
>> 
>> So in the process of breaking up and better document this patch, I
>have
>> looked at the syscall_numbering_64 (and have rewritten it to be more
>> complete.)
>> 
>> I found that running it under strace fails, as strace (possibly
>ptrace,
>> possibly the strace binary) causes %rax = 2^32 to be clobbered to
>zero
>> already...
>> 
>> More motivation, I guess.
>> 
>
>Indeed.
>
>I would love to go back in time and switch to long, but there are
>plenty
>of things that use int now.  I suppose we could try to make it long for
>real, but seccomp has u32 baked into its ABI.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-05-14  3:23 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-10 18:53 [RFC v2 PATCH 0/6] x86/entry: cleanups and consistent syscall number handling H. Peter Anvin
2021-05-10 18:53 ` [RFC v2 PATCH 1/7] x86/entry: unify definitions from calling.h and ptrace-abi.h H. Peter Anvin
2021-05-12  9:23   ` [tip: x86/asm] x86/entry: Unify definitions from <asm/calling.h> and <asm/ptrace-abi.h> tip-bot2 for H. Peter Anvin (Intel)
2021-05-10 18:53 ` [RFC v2 PATCH 2/7] x86/entry: reverse arguments to do_syscall_64() H. Peter Anvin
2021-05-12  9:23   ` [tip: x86/asm] x86/entry: Reverse " tip-bot2 for H. Peter Anvin (Intel)
2021-05-10 18:53 ` [RFC v2 PATCH 3/7] x86/syscall: unconditionally prototype {ia32,x32}_sys_call_table[] H. Peter Anvin
2021-05-12  9:23   ` [tip: x86/asm] x86/syscall: Unconditionally " tip-bot2 for H. Peter Anvin (Intel)
2021-05-10 18:53 ` [RFC v2 PATCH 4/7] x86/syscall: maximize MSR_SYSCALL_MASK H. Peter Anvin
2021-05-12  9:23   ` [tip: x86/asm] x86/syscall: Maximize MSR_SYSCALL_MASK tip-bot2 for H. Peter Anvin (Intel)
2021-05-10 18:53 ` [RFC v2 PATCH 5/7] x86/entry: split PUSH_AND_CLEAR_REGS into two submacros H. Peter Anvin
2021-05-12  9:23   ` [tip: x86/asm] x86/entry: Split " tip-bot2 for H. Peter Anvin (Intel)
2021-05-10 18:53 ` [RFC v2 PATCH 6/7] x86/regs: syscall_get_nr() returns -1 for a non-system call H. Peter Anvin
2021-05-12  9:23   ` [tip: x86/asm] x86/regs: Syscall_get_nr() " tip-bot2 for H. Peter Anvin
2021-05-10 18:53 ` [RFC v2 PATCH 7/7] x86/entry: use int for syscall number; handle all invalid syscall nrs H. Peter Anvin
2021-05-12  8:51   ` Ingo Molnar
2021-05-12 17:50     ` H. Peter Anvin
2021-05-12 12:09   ` Thomas Gleixner
2021-05-12 18:21     ` H. Peter Anvin
2021-05-12 18:34       ` Thomas Gleixner
2021-05-12 22:09         ` H. Peter Anvin
2021-05-12 22:22           ` Thomas Gleixner
2021-05-12 22:24             ` H. Peter Anvin
2021-05-14  0:38             ` H. Peter Anvin
2021-05-14  3:18               ` Andy Lutomirski
2021-05-14  3:23                 ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).