linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ali Raza <aliraza@bu.edu>
To: linux-kernel@vger.kernel.org
Cc: corbet@lwn.net, masahiroy@kernel.org, michal.lkml@markovi.net,
	ndesaulniers@google.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
	luto@kernel.org, ebiederm@xmission.com, keescook@chromium.org,
	peterz@infradead.org, viro@zeniv.linux.org.uk, arnd@arndb.de,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, bristot@redhat.com,
	vschneid@redhat.com, pbonzini@redhat.com, jpoimboe@kernel.org,
	linux-doc@vger.kernel.org, linux-kbuild@vger.kernel.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-arch@vger.kernel.org, x86@kernel.org, rjones@redhat.com,
	munsoner@bu.edu, tommyu@bu.edu, drepper@redhat.com,
	lwoodman@redhat.com, mboydmcse@gmail.com, okrieg@bu.edu,
	rmancuso@bu.edu, Ali Raza <aliraza@bu.edu>,
	Daniel Bristot de Oliveira <bristot@kernel.org>
Subject: [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls
Date: Mon,  3 Oct 2022 18:21:27 -0400	[thread overview]
Message-ID: <20221003222133.20948-5-aliraza@bu.edu> (raw)
In-Reply-To: <20221003222133.20948-1-aliraza@bu.edu>

If a UKL application makes a system call, it won't go through with the
syscall assembly instruction. Instead, the application will use the call
instruction to go to the kernel entry point. Instead of adding checks to
the normal entry_SYSCALL_64 to see if we came here from a UKL task or a
normal application task, we create a totally new entry point called
ukl_entry_SYSCALL_64. This allows the normal entry point to be unchanged
and simplifies the UKL specific code as well.

ukl_entry_SYSCALL_64 is similar to entry_SYSCALL_64 except that it has to
populate %rcx with return address manually (syscall instruction does that
automatically for normal application tasks). This allows the pt_regs to be
correct. Also, we have to push the flags onto the user stack, because on
the return path, we first switch to user stack, then pop the flags and then
return. Popping the flags would restart interrupts, so we dont want to be
stuck on kernel stack when an interrupt hits. All this can be done with an
iret instruction, but call/iret pair performans way slower than a call/ret
pair.

Also, on the entry path, we make sure the context flag i.e., in_user is set
to 1 to indicate we are now in kernel context so any new interrupts dont
have to go through kernel entry code again. This is normally done with the
CS value on stack, but in UKL case that will always be a kernel value. On
the way back, the in_user is switched back to 2 to indicate that now
application context is being entered. All non-UKL tasks have the in_user
value set to 0.

The UKL application uses a slightly different value for CS, instead of
0x33, we use 0xC3. As most of the tests compare only the least significant
nibble, they behave as expected. The C value in the second nibble allows us
to distinguish between user space and UKL application code.

Rest of the code makes sure the above mentioned in_user context tracking is
done for all entry and exit cases i.e., for interrupts, exceptions etc.  If
its a UKL task, if in_user value is 2, we treat it as an application task,
and if it is 1, we treat it as coming from kernel context. We skip these
checks if in_user is 0.

swapgs_restore_regs_and_return_to_usermode changes also make sure that
in_user is correct and then we iret back.

Double fault handling is special case. Normally, if a user stack suffers a
page fault, hardware switches to a kernel stack and pushes a frame onto the
kernel stack. This switch only happens if the execution was in user
privilege level when the page fault occurred. For UKL, execution is always
in kernel level, so when the user stack suffers a page fault, no switch to
a pinned kernel stack happens, and hardware tries to push state on the
already faulting user stack. This generates a double fault. So we handle
this case in the double fault handler by assuming any double fault is
actually a user stack page fault. This can also be fixed by making all page
faults go through a pinned stack using the IST mechanism. We have tried and
tested that, but in the interest of touching as little code as possible, we
chose this option instead.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>

Co-developed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Co-developed-by: Thomas Unger <tommyu@bu.edu>
Signed-off-by: Thomas Unger <tommyu@bu.edu>
Co-developed-by: Ali Raza <aliraza@bu.edu>
Signed-off-by: Ali Raza <aliraza@bu.edu>
---
 arch/x86/entry/entry_64.S | 133 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9953d966d124..0194f43bc58e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -229,6 +229,80 @@ SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
 	int3
 SYM_CODE_END(entry_SYSCALL_64)
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+SYM_CODE_START(ukl_entry_SYSCALL_64)
+	/*
+	 * syscalls will always come from user code so we dont need to
+	 * check stack cs value. We will leave that as 0x10, because
+	 * kernel entry and exit code will always run on syscall path,
+	 * no need to check cs on stack
+	 */
+	UNWIND_HINT_EMPTY
+
+	pushq	%rax
+	call	enter_ukl_kernel
+	popq	%rax
+
+	/* tss.sp2 is scratch space. */
+	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+
+	/* Construct struct pt_regs on stack */
+	pushq	$__KERNEL_DS				/* pt_regs->ss */
+	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
+	/*
+	 * pushfq has correct flags because all instructions before it
+	 * don't touch the flags
+	 */
+	pushfq						/* pt_regs->flags */
+	pushq	$__KERNEL_CS				/* pt_regs->cs */
+	pushq	%rcx					/* pt_regs->ip */
+
+	pushq	%rax					/* pt_regs->orig_ax */
+
+	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
+
+	/*
+	 * Fixing up user rip because rcx contains garbage. That's
+	 * because we didn't come here through a syscall instruction,
+	 * we used call
+	 */
+	movq	RSP(%rsp), %rdi
+	movq	(%rdi), %rsi
+	movq	%rsi, RIP(%rsp)
+	subq	$8, %rdi
+	movq	EFLAGS(%rsp), %rsi	/* EFLAGS in rsi */
+	movq	%rsi, (%rdi)
+	movq	%rdi, RSP(%rsp)
+
+	/* IRQs are off. */
+	movq	%rsp, %rdi
+	/*
+	 * Sign extend the lower 32bit as syscall numbers are treated
+	 * as int
+	 */
+	movslq	%eax, %rsi
+	call	do_syscall_64		/* returns with IRQs disabled */
+
+	POP_REGS
+	/*
+	 * The stack is now user orig_ax, RIP, CS, EFLAGS, RSP, SS.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	addq	$8, %rsp
+
+	pushq	%rax
+	call	enter_ukl_user
+	popq	%rax
+
+	/* Swing to user stack and pop flags */
+	movq 	0x18(%rsp), %rsp
+	popfq
+	retq
+SYM_CODE_END(ukl_entry_SYSCALL_64)
+#endif
+
 /*
  * %rdi: prev task
  * %rsi: next task
@@ -465,6 +539,14 @@ SYM_CODE_START(\asmsym)
 	testb	$3, CS-ORIG_RAX(%rsp)
 	jnz	.Lfrom_usermode_switch_stack_\@
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+	pushq	%rax		/* save RAX so its not overwritten on return */
+	call	is_ukl_thread	/* Check our execution context */
+	cmpq	$2, %rax
+	popq	%rax
+	je	.Lfrom_usermode_switch_stack_\@
+#endif
+
 	/* paranoid_entry returns GS information for paranoid_exit in EBX. */
 	call	paranoid_entry
 
@@ -520,6 +602,14 @@ SYM_CODE_START(\asmsym)
 	testb	$3, CS-ORIG_RAX(%rsp)
 	jnz	.Lfrom_usermode_switch_stack_\@
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+	pushq %rax		/* save RAX so its not overwritten on return */
+	call	is_ukl_thread	/* Check execution context */
+	cmpq	$2, %rax
+	popq	%rax
+	je	.Lfrom_usermode_switch_stack_\@
+#endif
+
 	/*
 	 * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
 	 * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
@@ -577,6 +667,11 @@ SYM_CODE_START(\asmsym)
 	ASM_CLAC
 	cld
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+	movq	$0x2, (%rsp)
+	jmp	asm_exc_page_fault
+#endif
+
 	/* paranoid_entry returns GS information for paranoid_exit in EBX. */
 	call	paranoid_entry
 	UNWIND_HINT_REGS
@@ -655,6 +750,19 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 
 	/* Restore RDI. */
 	popq	%rdi
+
+#ifdef CONFIG_UNIKERNEL_LINUX
+	cmpq	$0x33, 8(%rsp)
+	je	1f
+
+	pushq	%rax
+	call	enter_ukl_user
+	popq	%rax
+
+	jmp	.Lnative_iret
+1:
+#endif
+
 	swapgs
 	jmp	.Lnative_iret
 
@@ -1044,15 +1152,34 @@ SYM_CODE_START_LOCAL(error_entry)
 	PUSH_AND_CLEAR_REGS save_ret=1
 	ENCODE_FRAME_POINTER 8
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+	testb	$3, CS+8(%rsp)
+	jnz	1f /* user threads */
+
+	pushq	%rax
+	call	is_ukl_thread
+	cmpq	$2, %rax
+	popq	%rax
+	jb	.Lerror_kernelspace
+
+	movq	$0xC3, CS+8(%rsp)
+	pushq	%rax
+	call	enter_ukl_kernel
+	popq	%rax
+	jmp	2f
+#else
 	testb	$3, CS+8(%rsp)
 	jz	.Lerror_kernelspace
+#endif
 
 	/*
 	 * We entered from user mode or we're pretending to have entered
 	 * from user mode due to an IRET fault.
 	 */
+1:
 	swapgs
 	FENCE_SWAPGS_USER_ENTRY
+2:
 	/* We have user CR3.  Change to kernel CR3. */
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	IBRS_ENTER
@@ -1129,6 +1256,12 @@ SYM_CODE_START_LOCAL(error_return)
 	DEBUG_ENTRY_ASSERT_IRQS_OFF
 	testb	$3, CS(%rsp)
 	jz	restore_regs_and_return_to_kernel
+
+	cmpq	$0xC3, CS(%rsp)
+	jne	1f
+	movq	$0x10, CS(%rsp)
+1:
+
 	jmp	swapgs_restore_regs_and_return_to_usermode
 SYM_CODE_END(error_return)
 
-- 
2.21.3


  parent reply	other threads:[~2022-10-03 22:22 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
2022-10-03 22:21 ` [RFC UKL 01/10] kbuild: Add sections and symbols to linker script for UKL support Ali Raza
2022-10-03 22:21 ` [RFC UKL 02/10] x86/boot: Load the PT_TLS segment for Unikernel configs Ali Raza
2022-10-04 17:30   ` Andy Lutomirski
2022-10-06 21:00     ` Ali Raza
2022-10-03 22:21 ` [RFC UKL 03/10] sched: Add task_struct tracking of kernel or application execution Ali Raza
2022-10-03 22:21 ` Ali Raza [this message]
2022-10-04 17:43   ` [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls Andy Lutomirski
2022-10-06 21:12     ` Ali Raza
2022-10-03 22:21 ` [RFC UKL 05/10] x86/uaccess: Make access_ok UKL aware Ali Raza
2022-10-04 17:36   ` Andy Lutomirski
2022-10-06 21:16     ` Ali Raza
2022-10-03 22:21 ` [RFC UKL 06/10] x86/fault: Skip checking kernel mode access to user address space for UKL Ali Raza
2022-10-03 22:21 ` [RFC UKL 07/10] x86/signal: Adjust signal handler register values and return frame Ali Raza
2022-10-04 17:34   ` Andy Lutomirski
2022-10-06 21:20     ` Ali Raza
2022-10-03 22:21 ` [RFC UKL 08/10] exec: Make exec path for starting UKL application Ali Raza
2022-10-03 22:21 ` [RFC UKL 09/10] exec: Give userspace a method for starting UKL process Ali Raza
2022-10-04 17:35   ` Andy Lutomirski
2022-10-06 21:25     ` Ali Raza
2022-10-03 22:21 ` [RFC UKL 10/10] Kconfig: Add config option for enabling and sample for testing UKL Ali Raza
2022-10-04  2:11   ` Bagas Sanjaya
2022-10-06 21:28     ` Ali Raza
2022-10-07 10:21       ` Masahiro Yamada
2022-10-13 17:08         ` Ali Raza
2022-10-06 21:27 ` [RFC UKL 00/10] Unikernel Linux (UKL) H. Peter Anvin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221003222133.20948-5-aliraza@bu.edu \
    --to=aliraza@bu.edu \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=bristot@kernel.org \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=drepper@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=hpa@zytor.com \
    --cc=jpoimboe@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=keescook@chromium.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kbuild@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=lwoodman@redhat.com \
    --cc=masahiroy@kernel.org \
    --cc=mboydmcse@gmail.com \
    --cc=mgorman@suse.de \
    --cc=michal.lkml@markovi.net \
    --cc=mingo@redhat.com \
    --cc=munsoner@bu.edu \
    --cc=ndesaulniers@google.com \
    --cc=okrieg@bu.edu \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rjones@redhat.com \
    --cc=rmancuso@bu.edu \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=tommyu@bu.edu \
    --cc=vincent.guittot@linaro.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vschneid@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).