linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC UKL 00/10] Unikernel Linux (UKL)
@ 2022-10-03 22:21 Ali Raza
  2022-10-03 22:21 ` [RFC UKL 01/10] kbuild: Add sections and symbols to linker script for UKL support Ali Raza
                   ` (10 more replies)
  0 siblings, 11 replies; 26+ messages in thread
From: Ali Raza @ 2022-10-03 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso,
	Ali Raza

Unikernel Linux (UKL) is a research project aimed at integrating
application specific optimizations to the Linux kernel. This RFC aims to
introduce this research to the community. Any feedback regarding the idea,
goals, implementation and research is highly appreciated.

Unikernels are specialized operating systems where an application is linked
directly with the kernel and runs in supervisor mode. This allows the
developers to implement application specific optimizations to the kernel,
which can be directly invoked by the application (without going through the
syscall path). An application can control scheduling and resource
management and directly access the hardware. Application and the kernel can
be co-optimized, e.g., through LTO, PGO, etc. All of these optimizations,
and others, provide applications with huge performance benefits over
general purpose operating systems.

Linux is the de-facto operating system of today. Applications depend on its
battle tested code base, large developer community, support for legacy
code, a huge ecosystem of tools and utilities, and a wide range of
compatible hardware and device drivers. Linux also allows some degree of
application specific optimizations through build time config options,
runtime configuration, and recently through eBPF. But still, there is a
need for even more fine-grained application specific optimizations, and
some developers resort to kernel bypass techniques.

Unikernel Linux (UKL) aims to get the best of both worlds by bringing
application specific optimizations to the Linux ecosystem. This way,
unmodified applications can keep getting the benefits of Linux while taking
advantage of the unikernel-style optimizations. Optionally, applications
can be modified to invoke deeper optimizations.

There are two steps to unikernel-izing Linux, i.e., first, equip Linux with
a unikernel model, and second, actually use that model to implement
application specific optimizations. This patch focuses on the first part.
Through this patch, unmodified applications can be built as Linux
unikernels, albeit with only modest performance advantages. Like
unikernels, UKL would allow an application to be statically linked into the
kernel and executed in supervisor mode. However, UKL preserves most of the
invariants and design of Linux, including a separate page-able application
portion of the address space and a pinned kernel portion, the ability to
run multiple processes, and distinct execution modes for application and
kernel code. Kernel execution mode and application execution mode are
different, e.g., the application execution mode allows application threads
to be scheduled, handle signals, etc., which do not apply to kernel
threads. Application built as a Linux unikernel will have its text and data
loaded with the kernel at boot time, while the rest of the address space
would remain unchanged. These applications invoke the system call
functionality through a function call into the kernel system call entry
point instead of through the syscall assembly instruction. UKL would
support a normal userspace so the UKL application can be started, managed,
profiled, etc., using normal command line utilities.

Once Linux has a unikernel model, different application specific
optimizations are possible. We have tried a few, e.g., fast system call
transitions, shared stacks to allow LTO, invoking kernel functions
directly, etc. We have seen huge performance benefits, details of which are
not relevant to this patch and can be found in our paper.
(https://arxiv.org/pdf/2206.00789.pdf)

UKL differs significantly from previous projects, e.g., UML, KML and LKL.
User Mode Linux (UML) is a virtual machine monitor implemented on syscall
interface, a very different goal from UKL. Kernel Mode Linux (KML) allows
applications to run in kernel mode and replaces syscalls with function
calls. While KML stops there, UKL goes further. UKL links applications and
kernel together which allows further optimizations e.g., fast system call
transitions, shared stacks to allow LTO, invoking kernel functions directly
etc. Details can be found in the paper linked above. Linux Kernel Library
(LKL) harvests arch independent code from Linux, takes it to userspace as a
library to be linked with applications. A host needs to provide arch
dependent functionality. This model is very different from UKL. A detailed
discussion of related work is present in the paper linked above.

See samples/ukl for a simple TCP echo server example which can be built as
a normal user space application and also as a UKL application. In the Linux
config options, a path to the compiled and partially linked application
binary can be specified. Kernel built with UKL enabled will search this
location for the binary and link with the kernel. Applications and required
libraries need to be compiled with -mno-red-zone -mcmodel=kernel flags
because kernel mode execution can trample on application red zones and in
order to link with the kernel and be loaded in the high end of the address
space, application should have the correct memory model. Examples of other
applications like Redis, Memcached etc along with glibc and libgcc etc.,
can be found at https://github.com/unikernelLinux/ukl

List of authors and contributors:
=================================

Ali Raza - aliraza@bu.edu
Thomas Unger - tommyu@bu.edu
Matthew Boyd - mboydmcse@gmail.com
Eric Munson - munsoner@bu.edu
Parul Sohal - psohal@bu.edu
Ulrich Drepper - drepper@redhat.com
Richard W.M. Jones - rjones@redhat.com
Daniel Bristot de Oliveira - bristot@kernel.org
Larry Woodman - lwoodman@redhat.com
Renato Mancuso - rmancuso@bu.edu
Jonathan Appavoo - jappavoo@bu.edu
Orran Krieger - okrieg@bu.edu

Ali Raza (9):
  kbuild: Add sections and symbols to linker script for UKL support
  x86/boot: Load the PT_TLS segment for Unikernel configs
  sched: Add task_struct tracking of kernel or application execution
  x86/entry: Create alternate entry path for system calls
  x86/uaccess: Make access_ok UKL aware
  x86/fault: Skip checking kernel mode access to user address space for
    UKL
  x86/signal: Adjust signal handler register values and return frame
  exec: Make exec path for starting UKL application
  Kconfig: Add config option for enabling and sample for testing UKL

Eric B Munson (1):
  exec: Give userspace a method for starting UKL process

 Documentation/index.rst           |   1 +
 Documentation/ukl/ukl.rst         | 104 +++++++++++++++++++++++
 Kconfig                           |   2 +
 Makefile                          |   4 +
 arch/x86/boot/compressed/misc.c   |   3 +
 arch/x86/entry/entry_64.S         | 133 ++++++++++++++++++++++++++++++
 arch/x86/include/asm/elf.h        |   9 +-
 arch/x86/include/asm/uaccess.h    |   8 ++
 arch/x86/kernel/process.c         |  13 +++
 arch/x86/kernel/process_64.c      |  49 ++++++++---
 arch/x86/kernel/signal.c          |  22 +++--
 arch/x86/kernel/vmlinux.lds.S     |  98 ++++++++++++++++++++++
 arch/x86/mm/fault.c               |   7 +-
 fs/binfmt_elf.c                   |  28 +++++++
 fs/exec.c                         |  75 +++++++++++++----
 include/asm-generic/sections.h    |   4 +
 include/asm-generic/vmlinux.lds.h |  32 ++++++-
 include/linux/sched.h             |  26 ++++++
 kernel/Kconfig.ukl                |  41 +++++++++
 samples/ukl/Makefile              |  16 ++++
 samples/ukl/README                |  17 ++++
 samples/ukl/syscall.S             |  28 +++++++
 samples/ukl/tcp_server.c          |  99 ++++++++++++++++++++++
 scripts/mod/modpost.c             |   4 +
 24 files changed, 785 insertions(+), 38 deletions(-)
 create mode 100644 Documentation/ukl/ukl.rst
 create mode 100644 kernel/Kconfig.ukl
 create mode 100644 samples/ukl/Makefile
 create mode 100644 samples/ukl/README
 create mode 100644 samples/ukl/syscall.S
 create mode 100644 samples/ukl/tcp_server.c


base-commit: 4fe89d07dcc2804c8b562f6c7896a45643d34b2f
-- 
2.21.3


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC UKL 01/10] kbuild: Add sections and symbols to linker script for UKL support
  2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
@ 2022-10-03 22:21 ` Ali Raza
  2022-10-03 22:21 ` [RFC UKL 02/10] x86/boot: Load the PT_TLS segment for Unikernel configs Ali Raza
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Ali Raza @ 2022-10-03 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso,
	Ali Raza

In order to link a user space executable we will need access to a few
section that are not normally used when linking the kernel.  Add these
sections when we have selected CONFIG_UNIKERNEL_LINUX.

Add case to not throw warnings for COMMON symbols from application code.

Make the KBUILD_VMLINUX_OBJS contain the application library when UKL is
enabled.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>

Co-developed-by: Thomas Unger <tommyu@bu.edu>
Signed-off-by: Thomas Unger <tommyu@bu.edu>
Co-developed-by: Matthew Boyd <mboydmcse@gmail.com>
Signed-off-by: Matthew Boyd <mboydmcse@gmail.com>
Co-developed-by: Eric B Munson <munsoner@bu.edu>
Signed-off-by: Eric B Munson <munsoner@bu.edu>
Co-developed-by: Ali Raza <aliraza@bu.edu>
Signed-off-by: Ali Raza <aliraza@bu.edu>
---
 Makefile                          |  4 ++
 arch/x86/kernel/vmlinux.lds.S     | 98 +++++++++++++++++++++++++++++++
 include/asm-generic/sections.h    |  4 ++
 include/asm-generic/vmlinux.lds.h | 32 +++++++++-
 scripts/mod/modpost.c             |  4 ++
 5 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 8478e13e9424..d072a52ed856 100644
--- a/Makefile
+++ b/Makefile
@@ -1129,6 +1129,10 @@ KBUILD_VMLINUX_LIBS := $(patsubst %/,%/lib.a, $(libs-y))
 endif
 KBUILD_VMLINUX_OBJS += $(patsubst %/,%/built-in.a, $(drivers-y))
 
+ifdef CONFIG_UNIKERNEL_LINUX
+KBUILD_VMLINUX_OBJS += $(CONFIG_UKL_ARCHIVE_PATH)
+endif
+
 export KBUILD_VMLINUX_OBJS KBUILD_VMLINUX_LIBS
 export KBUILD_LDS          := arch/$(SRCARCH)/kernel/vmlinux.lds
 # used by scripts/Makefile.package
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 15f29053cec4..cb8b33955969 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -101,6 +101,9 @@ jiffies = jiffies_64;
 
 PHDRS {
 	text PT_LOAD FLAGS(5);          /* R_E */
+#if defined(CONFIG_UNIKERNEL_LINUX) && defined(CONFIG_UKL_TLS)
+	tls PT_TLS FLAGS(6);            /* RW_ */
+#endif
 	data PT_LOAD FLAGS(6);          /* RW_ */
 #ifdef CONFIG_X86_64
 #ifdef CONFIG_SMP
@@ -146,6 +149,71 @@ SECTIONS
 #endif
 	} :text =0xcccc
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+	/* Added to preserve page alignment */
+	. = ALIGN(PAGE_SIZE);
+
+	/*  */
+	.rela.plt	:
+	{
+		*(.rela.plt)
+		PROVIDE_HIDDEN (__rela_iplt_start = .);
+		*(.rela.iplt)
+		PROVIDE_HIDDEN (__rela_iplt_end = .);
+	} :text =0xcccc
+	.preinit_array	:
+	{
+		PROVIDE_HIDDEN (__preinit_array_start = .);
+		KEEP (*(.preinit_array))
+		PROVIDE_HIDDEN (__preinit_array_end = .);
+	} :text =0xcccc
+	.init_array	:
+	{
+		PROVIDE_HIDDEN (__init_array_start = .);
+		KEEP (*(SORT_BY_INIT_PRIORITY(.init_array.*) SORT_BY_INIT_PRIORITY(.ctors.*)))
+		KEEP (*(.init_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o
+			*crtend.o *crtend?.o ) .ctors))
+		PROVIDE_HIDDEN (__init_array_end = .);
+	} :text =0xcccc
+	.fini_array	:
+	{
+		PROVIDE_HIDDEN (__fini_array_start = .);
+		KEEP (*(SORT_BY_INIT_PRIORITY(.fini_array.*) SORT_BY_INIT_PRIORITY(.dtors.*)))
+		KEEP (*(.fini_array EXCLUDE_FILE (*crtbegin.o *crtbegin?.o
+			*crtend.o *crtend?.o ) .dtors))
+		PROVIDE_HIDDEN (__fini_array_end = .);
+	} :text =0xcccc
+	.ctors		:
+	{
+		/* gcc uses crtbegin.o to find the start of
+		   the constructors, so we make sure it is
+		   first.  Because this is a wildcard, it
+		   doesn't matter if the user does not
+		   actually link against crtbegin.o; the
+		   linker won't look for a file to match a
+		   wildcard.  The wildcard also means that it
+		   doesn't matter which directory crtbegin.o
+		   is in.  */
+		KEEP (*crtbegin.o(.ctors))
+		KEEP (*crtbegin?.o(.ctors))
+		/* We don't want to include the .ctor section from
+		   the crtend.o file until after the sorted ctors.
+		   The .ctor section from the crtend file contains the
+		   end of ctors marker and it must be last */
+		KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o ) .ctors))
+		KEEP (*(SORT(.ctors.*)))
+		KEEP (*(.ctors))
+	} :text =0xcccc
+	.dtors		:
+	{
+		KEEP (*crtbegin.o(.dtors))
+		KEEP (*crtbegin?.o(.dtors))
+		KEEP (*(EXCLUDE_FILE (*crtend.o *crtend?.o ) .dtors))
+		KEEP (*(SORT(.dtors.*)))
+		KEEP (*(.dtors))
+	} :text =0xcccc
+#endif
+
 	/* End of text section, which should occupy whole number of pages */
 	_etext = .;
 	. = ALIGN(PAGE_SIZE);
@@ -208,6 +276,29 @@ SECTIONS
 
 	. = ALIGN(__vvar_page + PAGE_SIZE, PAGE_SIZE);
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+#ifdef CONFIG_UKL_TLS
+	/* Thread Local Storage sections */
+	. = ALIGN(PAGE_SIZE);
+	.tdata : ALIGN(0x200000){
+		__tls_data_start = .;
+		*(.tdata .tdata.* .gnu.linkonce.td.*)
+		__tls_data_end = .;
+	} :tls
+	.tbss : {
+		__tls_bss_start = .;
+		*(.tbss .tbss.* .gnu.linkonce.tb.*) *(.tcommon)
+		__tls_bss_end = .;
+	} :tls
+#else
+	. = ALIGN(PAGE_SIZE);
+	__tls_data_start = .;
+	__tls_data_end = .;
+	__tls_bss_start = .;
+	__tls_bss_end = .;
+#endif
+#endif
+
 	/* Init code and data - will be freed after init */
 	. = ALIGN(PAGE_SIZE);
 	.init.begin : AT(ADDR(.init.begin) - LOAD_OFFSET) {
@@ -380,8 +471,13 @@ SECTIONS
 		*(BSS_MAIN)
 		BSS_DECRYPTED
 		. = ALIGN(PAGE_SIZE);
+#ifdef CONFIG_UNIKERNEL_LINUX
+	}
+	__bss_stop = .;
+#else
 		__bss_stop = .;
 	}
+#endif
 
 	/*
 	 * The memory occupied from _text to here, __end_of_kernel_reserve, is
@@ -446,6 +542,7 @@ SECTIONS
 #endif
 	       "Unexpected GOT/PLT entries detected!")
 
+#ifndef CONFIG_UNIKERNEL_LINUX
 	/*
 	 * Sections that should stay zero sized, which is safer to
 	 * explicitly check instead of blindly discarding.
@@ -469,6 +566,7 @@ SECTIONS
 		*(.rela.*) *(.rela_*)
 	}
 	ASSERT(SIZEOF(.rela.dyn) == 0, "Unexpected run-time relocations (.rela) detected!")
+#endif
 }
 
 /*
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index db13bb620f52..42ebf251903c 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -35,6 +35,10 @@
 extern char _text[], _stext[], _etext[];
 extern char _data[], _sdata[], _edata[];
 extern char __bss_start[], __bss_stop[];
+#ifdef CONFIG_UNIKERNEL_LINUX
+extern char __tls_data_start[], __tls_data_end[];
+extern char __tls_bss_start[], __tls_bss_end[];
+#endif
 extern char __init_begin[], __init_end[];
 extern char _sinittext[], _einittext[];
 extern char __start_ro_after_init[], __end_ro_after_init[];
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 7c90b1ab3e00..4b0e4f3d4c39 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -568,6 +568,24 @@
  * code elimination is enabled, so these sections should be converted
  * to use ".." first.
  */
+#ifdef CONFIG_UNIKERNEL_LINUX
+#define TEXT_TEXT							\
+		ALIGN_FUNCTION();					\
+		*(.text.hot .text.hot.*)				\
+		*(TEXT_MAIN .text.fixup)				\
+		*(.stub .text.* .gnu.linkonce.t.*)			\
+		*(.text.unlikely .text.*_unlikely .text.unlikely.*)	\
+		*(.text.exit .text.exit.*)				\
+		*(.text.startup .text.startup.*)			\
+		*(.text.unknown .text.unknown.*)			\
+		NOINSTR_TEXT						\
+		*(.text..refcount)					\
+		*(.ref.text)						\
+		*(.text.asan.* .text.tsan.*)				\
+		TEXT_CFI_JT						\
+	MEM_KEEP(init.text*)						\
+	MEM_KEEP(exit.text*)
+#else
 #define TEXT_TEXT							\
 		ALIGN_FUNCTION();					\
 		*(.text.hot .text.hot.*)				\
@@ -580,7 +598,8 @@
 		*(.text.asan.* .text.tsan.*)				\
 		TEXT_CFI_JT						\
 	MEM_KEEP(init.text*)						\
-	MEM_KEEP(exit.text*)						\
+	MEM_KEEP(exit.text*)
+#endif
 
 
 /* sched.text is aling to function alignment to secure we have same
@@ -1029,12 +1048,23 @@
 	/* ld.bfd warns about .gnu.version* even when not emitted */	\
 	*(.gnu.version*)						\
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+#define DISCARDS							\
+	/DISCARD/ : {							\
+	EXIT_DISCARDS							\
+	EXIT_CALL							\
+	COMMON_DISCARDS							\
+	*(.gnu.glibc-stub.*)						\
+	*(.gnu.warning.*)						\
+	}
+#else
 #define DISCARDS							\
 	/DISCARD/ : {							\
 	EXIT_DISCARDS							\
 	EXIT_CALL							\
 	COMMON_DISCARDS							\
 	}
+#endif
 
 /**
  * PERCPU_INPUT - the percpu input sections
diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index 2c80da0220c3..a6023db6b630 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -626,6 +626,8 @@ static void handle_symbol(struct module *mod, struct elf_info *info,
 	case SHN_COMMON:
 		if (strstarts(symname, "__gnu_lto_")) {
 			/* Should warn here, but modpost runs before the linker */
+		} else if (strstarts(symname, "ukl_")) {
+			/* User code can have common symbols */
 		} else
 			warn("\"%s\" [%s] is COMMON symbol\n", symname, mod->name);
 		break;
@@ -774,6 +776,8 @@ static const char *const section_white_list[] =
 	".fmt_slot*",			/* EZchip */
 	".gnu.lto*",
 	".discard.*",
+	".gnu.warning.*",
+	".gnu.glibc-stub.*",
 	NULL
 };
 
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC UKL 02/10] x86/boot: Load the PT_TLS segment for Unikernel configs
  2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
  2022-10-03 22:21 ` [RFC UKL 01/10] kbuild: Add sections and symbols to linker script for UKL support Ali Raza
@ 2022-10-03 22:21 ` Ali Raza
  2022-10-04 17:30   ` Andy Lutomirski
  2022-10-03 22:21 ` [RFC UKL 03/10] sched: Add task_struct tracking of kernel or application execution Ali Raza
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Ali Raza @ 2022-10-03 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso,
	Ali Raza

The kernel normally skips loading this segment as it is not inlcuded in
standard builds. However, when linked with an application in the Unikernel
configuration the segment will be present. Load PT_TLS when configured as a
unikernel.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>

Signed-off-by: Ali Raza <aliraza@bu.edu>
---
 arch/x86/boot/compressed/misc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index cf690d8712f4..0d07b5661c9c 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -310,6 +310,9 @@ static void parse_elf(void *output)
 		phdr = &phdrs[i];
 
 		switch (phdr->p_type) {
+#ifdef CONFIG_UNIKERNEL_LINUX
+		case PT_TLS:
+#endif
 		case PT_LOAD:
 #ifdef CONFIG_X86_64
 			if ((phdr->p_align % 0x200000) != 0)
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC UKL 03/10] sched: Add task_struct tracking of kernel or application execution
  2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
  2022-10-03 22:21 ` [RFC UKL 01/10] kbuild: Add sections and symbols to linker script for UKL support Ali Raza
  2022-10-03 22:21 ` [RFC UKL 02/10] x86/boot: Load the PT_TLS segment for Unikernel configs Ali Raza
@ 2022-10-03 22:21 ` Ali Raza
  2022-10-03 22:21 ` [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls Ali Raza
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Ali Raza @ 2022-10-03 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso,
	Ali Raza, Daniel Bristot de Oliveira

Because UKL removes the barrier between kernel and user space, we need to
track if we are executing application code or kernel code to ensure that we
take the appropriate actions on transitions. When we transition to kernel
code, we need to handle RCU and on the way to user code we need to check if
scheduling needs to happen, etc.  We cannot use the CS value from the stack
because it will always be set to the kernel value.  These functions will be
used in a later change to entry_64.S to identify the execution context for
the current thread.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>

Co-developed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Co-developed-by: Ali Raza <aliraza@bu.edu>
Signed-off-by: Ali Raza <aliraza@bu.edu>
---
 arch/x86/kernel/process_64.c | 22 ++++++++++++++++++++++
 include/linux/sched.h        | 26 ++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 1962008fe743..e9e4a2946452 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -501,6 +501,28 @@ void x86_gsbase_write_task(struct task_struct *task, unsigned long gsbase)
 	task->thread.gsbase = gsbase;
 }
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+/*
+ * 0 = Non UKL thread
+ * 1 = UKL thread - in kernel code
+ * 2 = UKL thread - in application code
+ */
+int is_ukl_thread(void)
+{
+	return current->ukl_thread;
+}
+
+void enter_ukl_user(void)
+{
+	current->ukl_thread = UKL_APPLICATION;
+}
+
+void enter_ukl_kernel(void)
+{
+	current->ukl_thread = UKL_KERNEL;
+}
+#endif
+
 static void
 start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 		    unsigned long new_sp,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e7b2f8a5c711..b8bf50ae0fda 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -746,6 +746,13 @@ struct task_struct {
 	randomized_struct_fields_start
 
 	void				*stack;
+#ifdef CONFIG_UNIKERNEL_LINUX
+	/*
+	 * Indicator used for threads in a UKL application, 0 means non-UKL thread, 1 is UKL thread
+	 * in kernel text, 2 is UKL thread in application text
+	 */
+	int				ukl_thread;
+#endif
 	refcount_t			usage;
 	/* Per task flags (PF_*), defined further below: */
 	unsigned int			flags;
@@ -1529,6 +1536,25 @@ struct task_struct {
 	 */
 };
 
+/*
+ * 0 = Non UKL thread
+ * 1 = UKL thread - in kernel code
+ * 2 = UKL thread - in application code
+ */
+#define NON_UKL_THREAD 0
+#define UKL_KERNEL 1
+#define UKL_APPLICATION 2
+
+#ifdef CONFIG_UNIKERNEL_LINUX
+int is_ukl_thread(void);
+void enter_ukl_user(void);
+void enter_ukl_kernel(void);
+#else
+static inline int is_ukl_thread(void) { return NON_UKL_THREAD; }
+static inline void enter_ukl_user(void) {}
+static inline void enter_ukl_kernel(void) {}
+#endif
+
 static inline struct pid *task_pid(struct task_struct *task)
 {
 	return task->thread_pid;
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls
  2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
                   ` (2 preceding siblings ...)
  2022-10-03 22:21 ` [RFC UKL 03/10] sched: Add task_struct tracking of kernel or application execution Ali Raza
@ 2022-10-03 22:21 ` Ali Raza
  2022-10-04 17:43   ` Andy Lutomirski
  2022-10-03 22:21 ` [RFC UKL 05/10] x86/uaccess: Make access_ok UKL aware Ali Raza
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Ali Raza @ 2022-10-03 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso,
	Ali Raza, Daniel Bristot de Oliveira

If a UKL application makes a system call, it won't go through with the
syscall assembly instruction. Instead, the application will use the call
instruction to go to the kernel entry point. Instead of adding checks to
the normal entry_SYSCALL_64 to see if we came here from a UKL task or a
normal application task, we create a totally new entry point called
ukl_entry_SYSCALL_64. This allows the normal entry point to be unchanged
and simplifies the UKL specific code as well.

ukl_entry_SYSCALL_64 is similar to entry_SYSCALL_64 except that it has to
populate %rcx with return address manually (syscall instruction does that
automatically for normal application tasks). This allows the pt_regs to be
correct. Also, we have to push the flags onto the user stack, because on
the return path, we first switch to user stack, then pop the flags and then
return. Popping the flags would restart interrupts, so we dont want to be
stuck on kernel stack when an interrupt hits. All this can be done with an
iret instruction, but call/iret pair performans way slower than a call/ret
pair.

Also, on the entry path, we make sure the context flag i.e., in_user is set
to 1 to indicate we are now in kernel context so any new interrupts dont
have to go through kernel entry code again. This is normally done with the
CS value on stack, but in UKL case that will always be a kernel value. On
the way back, the in_user is switched back to 2 to indicate that now
application context is being entered. All non-UKL tasks have the in_user
value set to 0.

The UKL application uses a slightly different value for CS, instead of
0x33, we use 0xC3. As most of the tests compare only the least significant
nibble, they behave as expected. The C value in the second nibble allows us
to distinguish between user space and UKL application code.

Rest of the code makes sure the above mentioned in_user context tracking is
done for all entry and exit cases i.e., for interrupts, exceptions etc.  If
its a UKL task, if in_user value is 2, we treat it as an application task,
and if it is 1, we treat it as coming from kernel context. We skip these
checks if in_user is 0.

swapgs_restore_regs_and_return_to_usermode changes also make sure that
in_user is correct and then we iret back.

Double fault handling is special case. Normally, if a user stack suffers a
page fault, hardware switches to a kernel stack and pushes a frame onto the
kernel stack. This switch only happens if the execution was in user
privilege level when the page fault occurred. For UKL, execution is always
in kernel level, so when the user stack suffers a page fault, no switch to
a pinned kernel stack happens, and hardware tries to push state on the
already faulting user stack. This generates a double fault. So we handle
this case in the double fault handler by assuming any double fault is
actually a user stack page fault. This can also be fixed by making all page
faults go through a pinned stack using the IST mechanism. We have tried and
tested that, but in the interest of touching as little code as possible, we
chose this option instead.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>

Co-developed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Co-developed-by: Thomas Unger <tommyu@bu.edu>
Signed-off-by: Thomas Unger <tommyu@bu.edu>
Co-developed-by: Ali Raza <aliraza@bu.edu>
Signed-off-by: Ali Raza <aliraza@bu.edu>
---
 arch/x86/entry/entry_64.S | 133 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 133 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9953d966d124..0194f43bc58e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -229,6 +229,80 @@ SYM_INNER_LABEL(entry_SYSRETQ_end, SYM_L_GLOBAL)
 	int3
 SYM_CODE_END(entry_SYSCALL_64)
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+SYM_CODE_START(ukl_entry_SYSCALL_64)
+	/*
+	 * syscalls will always come from user code so we dont need to
+	 * check stack cs value. We will leave that as 0x10, because
+	 * kernel entry and exit code will always run on syscall path,
+	 * no need to check cs on stack
+	 */
+	UNWIND_HINT_EMPTY
+
+	pushq	%rax
+	call	enter_ukl_kernel
+	popq	%rax
+
+	/* tss.sp2 is scratch space. */
+	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+
+	/* Construct struct pt_regs on stack */
+	pushq	$__KERNEL_DS				/* pt_regs->ss */
+	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
+	/*
+	 * pushfq has correct flags because all instructions before it
+	 * don't touch the flags
+	 */
+	pushfq						/* pt_regs->flags */
+	pushq	$__KERNEL_CS				/* pt_regs->cs */
+	pushq	%rcx					/* pt_regs->ip */
+
+	pushq	%rax					/* pt_regs->orig_ax */
+
+	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
+
+	/*
+	 * Fixing up user rip because rcx contains garbage. That's
+	 * because we didn't come here through a syscall instruction,
+	 * we used call
+	 */
+	movq	RSP(%rsp), %rdi
+	movq	(%rdi), %rsi
+	movq	%rsi, RIP(%rsp)
+	subq	$8, %rdi
+	movq	EFLAGS(%rsp), %rsi	/* EFLAGS in rsi */
+	movq	%rsi, (%rdi)
+	movq	%rdi, RSP(%rsp)
+
+	/* IRQs are off. */
+	movq	%rsp, %rdi
+	/*
+	 * Sign extend the lower 32bit as syscall numbers are treated
+	 * as int
+	 */
+	movslq	%eax, %rsi
+	call	do_syscall_64		/* returns with IRQs disabled */
+
+	POP_REGS
+	/*
+	 * The stack is now user orig_ax, RIP, CS, EFLAGS, RSP, SS.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	addq	$8, %rsp
+
+	pushq	%rax
+	call	enter_ukl_user
+	popq	%rax
+
+	/* Swing to user stack and pop flags */
+	movq 	0x18(%rsp), %rsp
+	popfq
+	retq
+SYM_CODE_END(ukl_entry_SYSCALL_64)
+#endif
+
 /*
  * %rdi: prev task
  * %rsi: next task
@@ -465,6 +539,14 @@ SYM_CODE_START(\asmsym)
 	testb	$3, CS-ORIG_RAX(%rsp)
 	jnz	.Lfrom_usermode_switch_stack_\@
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+	pushq	%rax		/* save RAX so its not overwritten on return */
+	call	is_ukl_thread	/* Check our execution context */
+	cmpq	$2, %rax
+	popq	%rax
+	je	.Lfrom_usermode_switch_stack_\@
+#endif
+
 	/* paranoid_entry returns GS information for paranoid_exit in EBX. */
 	call	paranoid_entry
 
@@ -520,6 +602,14 @@ SYM_CODE_START(\asmsym)
 	testb	$3, CS-ORIG_RAX(%rsp)
 	jnz	.Lfrom_usermode_switch_stack_\@
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+	pushq %rax		/* save RAX so its not overwritten on return */
+	call	is_ukl_thread	/* Check execution context */
+	cmpq	$2, %rax
+	popq	%rax
+	je	.Lfrom_usermode_switch_stack_\@
+#endif
+
 	/*
 	 * paranoid_entry returns SWAPGS flag for paranoid_exit in EBX.
 	 * EBX == 0 -> SWAPGS, EBX == 1 -> no SWAPGS
@@ -577,6 +667,11 @@ SYM_CODE_START(\asmsym)
 	ASM_CLAC
 	cld
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+	movq	$0x2, (%rsp)
+	jmp	asm_exc_page_fault
+#endif
+
 	/* paranoid_entry returns GS information for paranoid_exit in EBX. */
 	call	paranoid_entry
 	UNWIND_HINT_REGS
@@ -655,6 +750,19 @@ SYM_INNER_LABEL(swapgs_restore_regs_and_return_to_usermode, SYM_L_GLOBAL)
 
 	/* Restore RDI. */
 	popq	%rdi
+
+#ifdef CONFIG_UNIKERNEL_LINUX
+	cmpq	$0x33, 8(%rsp)
+	je	1f
+
+	pushq	%rax
+	call	enter_ukl_user
+	popq	%rax
+
+	jmp	.Lnative_iret
+1:
+#endif
+
 	swapgs
 	jmp	.Lnative_iret
 
@@ -1044,15 +1152,34 @@ SYM_CODE_START_LOCAL(error_entry)
 	PUSH_AND_CLEAR_REGS save_ret=1
 	ENCODE_FRAME_POINTER 8
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+	testb	$3, CS+8(%rsp)
+	jnz	1f /* user threads */
+
+	pushq	%rax
+	call	is_ukl_thread
+	cmpq	$2, %rax
+	popq	%rax
+	jb	.Lerror_kernelspace
+
+	movq	$0xC3, CS+8(%rsp)
+	pushq	%rax
+	call	enter_ukl_kernel
+	popq	%rax
+	jmp	2f
+#else
 	testb	$3, CS+8(%rsp)
 	jz	.Lerror_kernelspace
+#endif
 
 	/*
 	 * We entered from user mode or we're pretending to have entered
 	 * from user mode due to an IRET fault.
 	 */
+1:
 	swapgs
 	FENCE_SWAPGS_USER_ENTRY
+2:
 	/* We have user CR3.  Change to kernel CR3. */
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	IBRS_ENTER
@@ -1129,6 +1256,12 @@ SYM_CODE_START_LOCAL(error_return)
 	DEBUG_ENTRY_ASSERT_IRQS_OFF
 	testb	$3, CS(%rsp)
 	jz	restore_regs_and_return_to_kernel
+
+	cmpq	$0xC3, CS(%rsp)
+	jne	1f
+	movq	$0x10, CS(%rsp)
+1:
+
 	jmp	swapgs_restore_regs_and_return_to_usermode
 SYM_CODE_END(error_return)
 
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC UKL 05/10] x86/uaccess: Make access_ok UKL aware
  2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
                   ` (3 preceding siblings ...)
  2022-10-03 22:21 ` [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls Ali Raza
@ 2022-10-03 22:21 ` Ali Raza
  2022-10-04 17:36   ` Andy Lutomirski
  2022-10-03 22:21 ` [RFC UKL 06/10] x86/fault: Skip checking kernel mode access to user address space for UKL Ali Raza
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Ali Raza @ 2022-10-03 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso,
	Ali Raza

When configured for UKL, access_ok needs to account for the unified address
space that is used by the kernel and the process being run. To do this,
they need to check the task struct field added earlier to determine where
the execution that is making the check is running. For a zero value, the
normal boundary definitions apply, but non-zero value indicates a UKL
thread and a shared address space should be assumed.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>

Signed-off-by: Ali Raza <aliraza@bu.edu>
---
 arch/x86/include/asm/uaccess.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 913e593a3b45..adef521b2e59 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -37,11 +37,19 @@ static inline bool pagefault_disabled(void);
  * Return: true (nonzero) if the memory block may be valid, false (zero)
  * if it is definitely invalid.
  */
+#ifdef CONFIG_UNIKERNEL_LINUX
+#define access_ok(addr, size)					\
+({									\
+	WARN_ON_IN_IRQ();						\
+	(is_ukl_thread() ? 1 : likely(__access_ok(addr, size)));	\
+})
+#else
 #define access_ok(addr, size)					\
 ({									\
 	WARN_ON_IN_IRQ();						\
 	likely(__access_ok(addr, size));				\
 })
+#endif
 
 #include <asm-generic/access_ok.h>
 
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC UKL 06/10] x86/fault: Skip checking kernel mode access to user address space for UKL
  2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
                   ` (4 preceding siblings ...)
  2022-10-03 22:21 ` [RFC UKL 05/10] x86/uaccess: Make access_ok UKL aware Ali Raza
@ 2022-10-03 22:21 ` Ali Raza
  2022-10-03 22:21 ` [RFC UKL 07/10] x86/signal: Adjust signal handler register values and return frame Ali Raza
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Ali Raza @ 2022-10-03 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso,
	Ali Raza

Normally, this check ensures that a kernel task has not ended up somehow
raising a page fault in the user part of address space. This is done by
checking if the CS value on stack. UKL always has the kernel value so this
check will always fail. This change makes sure that this check is only done
for non-UKL tasks by checking the in_user flag.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>

Signed-off-by: Ali Raza <aliraza@bu.edu>
---
 arch/x86/mm/fault.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fa71a5d12e87..26de3556ca2c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1328,7 +1328,9 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * on well-defined single instructions listed in the exception
 	 * tables.  But, an erroneous kernel fault occurring outside one of
 	 * those areas which also holds mmap_lock might deadlock attempting
-	 * to validate the fault against the address space.
+	 * to validate the fault against the address space. However, if we
+	 * are configured as a unikernel and the fauling thread is the UKL
+	 * application code we can proceed as normal.
 	 *
 	 * Only do the expensive exception table search when we might be at
 	 * risk of a deadlock.  This happens if we
@@ -1336,7 +1338,8 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * 2. The access did not originate in userspace.
 	 */
 	if (unlikely(!mmap_read_trylock(mm))) {
-		if (!user_mode(regs) && !search_exception_tables(regs->ip)) {
+		if (!user_mode(regs) &&	!search_exception_tables(regs->ip) &&
+				!is_ukl_thread()) {
 			/*
 			 * Fault from code in kernel from
 			 * which we do not expect faults.
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC UKL 07/10] x86/signal: Adjust signal handler register values and return frame
  2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
                   ` (5 preceding siblings ...)
  2022-10-03 22:21 ` [RFC UKL 06/10] x86/fault: Skip checking kernel mode access to user address space for UKL Ali Raza
@ 2022-10-03 22:21 ` Ali Raza
  2022-10-04 17:34   ` Andy Lutomirski
  2022-10-03 22:21 ` [RFC UKL 08/10] exec: Make exec path for starting UKL application Ali Raza
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Ali Raza @ 2022-10-03 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso,
	Ali Raza

For a UKL thread, returning to a signal handler is not done with iret or
sysret.  This means we need to adjust the way the return stack frame is
handled for these threads.  When constructing the signal frame, we leave
the previous frame in place because we will return to it from the signal
handler.  We also leave space for pushing eflags and the return address.
UKL threads will only use the __KERNEL_DS value in the ss register and 0xC3
in the cs register.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>

Co-developed-by: Eric B Munson <munsoner@bu.edu>
Signed-off-by: Eric B Munson <munsoner@bu.edu>
Co-developed-by: Ali Raza <aliraza@bu.edu>
Signed-off-by: Ali Raza <aliraza@bu.edu>
---
 arch/x86/kernel/signal.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 9c7265b524c7..a95c12f6dac6 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -121,8 +121,10 @@ static bool restore_sigcontext(struct pt_regs *regs,
 #endif /* CONFIG_X86_64 */
 
 	/* Get CS/SS and force CPL3 */
-	regs->cs = sc.cs | 0x03;
-	regs->ss = sc.ss | 0x03;
+	if (!is_ukl_thread()) {
+		regs->cs = sc.cs | 0x03;
+		regs->ss = sc.ss | 0x03;
+	}
 
 	regs->flags = (regs->flags & ~FIX_EFLAGS) | (sc.flags & FIX_EFLAGS);
 	/* disable syscall checks */
@@ -522,10 +524,15 @@ static int __setup_rt_frame(int sig, struct ksignal *ksig,
 	 * a trampoline.)  So we do our best: if the old SS was valid,
 	 * we keep it.  Otherwise we replace it.
 	 */
-	regs->cs = __USER_CS;
+	if (!is_ukl_thread()) {
+		regs->cs = __USER_CS;
 
-	if (unlikely(regs->ss != __USER_DS))
-		force_valid_ss(regs);
+		if (unlikely(regs->ss != __USER_DS))
+			force_valid_ss(regs);
+	} else {
+		regs->cs = 0xC3;
+		regs->ss = __KERNEL_DS;
+	}
 
 	return 0;
 
@@ -662,7 +669,10 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	sigset_t set;
 	unsigned long uc_flags;
 
-	frame = (struct rt_sigframe __user *)(regs->sp - sizeof(long));
+	if (is_ukl_thread())
+		frame = (struct rt_sigframe __user *)(regs->sp + sizeof(long));
+	else
+		frame = (struct rt_sigframe __user *)(regs->sp - sizeof(long));
 	if (!access_ok(frame, sizeof(*frame)))
 		goto badframe;
 	if (__get_user(*(__u64 *)&set, (__u64 __user *)&frame->uc.uc_sigmask))
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC UKL 08/10] exec: Make exec path for starting UKL application
  2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
                   ` (6 preceding siblings ...)
  2022-10-03 22:21 ` [RFC UKL 07/10] x86/signal: Adjust signal handler register values and return frame Ali Raza
@ 2022-10-03 22:21 ` Ali Raza
  2022-10-03 22:21 ` [RFC UKL 09/10] exec: Give userspace a method for starting UKL process Ali Raza
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Ali Raza @ 2022-10-03 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso,
	Ali Raza

The UKL application still relies on much of the setup done to start a
standard user space process, so we still need to use much of that path.
There are several areas that the UKL application doesn't need or want so we
bypass them in the case of UKL. These are: ELF loading, because it is part
of the kernel image; and segments register value initialization.  We need
to record a starting location for the application heap, this normally is
the end of the ELF binary, once loaded. We choose an arbitrary low address
because there is no binary to load. We also hardcode the entry point for
the application to ukl__start which is the entry point for glibc plus the
'ukl_' prefix.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>

Suggested-by: Thomas Unger <tommyu@bu.edu>
Signed-off-by: Ali Raza <aliraza@bu.edu>
---
 arch/x86/include/asm/elf.h   |  9 ++++--
 arch/x86/kernel/process.c    | 13 +++++++++
 arch/x86/kernel/process_64.c | 27 ++++++++++--------
 fs/binfmt_elf.c              | 28 ++++++++++++++++++
 fs/exec.c                    | 55 ++++++++++++++++++++++++++----------
 5 files changed, 103 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index cb0ff1055ab1..91b6efafb46f 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -6,6 +6,7 @@
  * ELF register definitions..
  */
 #include <linux/thread_info.h>
+#include <linux/sched.h>
 
 #include <asm/ptrace.h>
 #include <asm/user.h>
@@ -164,9 +165,11 @@ static inline void elf_common_init(struct thread_struct *t,
 	regs->si = regs->di = regs->bp = 0;
 	regs->r8 = regs->r9 = regs->r10 = regs->r11 = 0;
 	regs->r12 = regs->r13 = regs->r14 = regs->r15 = 0;
-	t->fsbase = t->gsbase = 0;
-	t->fsindex = t->gsindex = 0;
-	t->ds = t->es = ds;
+	if (!is_ukl_thread()) {
+		t->fsbase = t->gsbase = 0;
+		t->fsindex = t->gsindex = 0;
+		t->ds = t->es = ds;
+	}
 }
 
 #define ELF_PLAT_INIT(_r, load_addr)			\
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 58a6ea472db9..8395fc0c3398 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -192,6 +192,19 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	frame->bx = 0;
 	*childregs = *current_pt_regs();
 	childregs->ax = 0;
+
+#ifdef CONFIG_UNIKERNEL_LINUX
+	/*
+	 * UKL leaves return address and flags on user stack. This works
+	 * fine for clone (i.e., VM shared) but not for 'fork' style
+	 * clone (i.e., VM not shared). This is where we clean those extra
+	 * elements from user stack.
+	 */
+	if (is_ukl_thread() & !(clone_flags & CLONE_VM)) {
+		childregs->sp += 2*(sizeof(long));
+	}
+#endif
+
 	if (sp)
 		childregs->sp = sp;
 
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e9e4a2946452..cf007b95d684 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -530,21 +530,26 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
 {
 	WARN_ON_ONCE(regs != current_pt_regs());
 
-	if (static_cpu_has(X86_BUG_NULL_SEG)) {
-		/* Loading zero below won't clear the base. */
-		loadsegment(fs, __USER_DS);
-		load_gs_index(__USER_DS);
-	}
+	if (!is_ukl_thread()) {
+		if (static_cpu_has(X86_BUG_NULL_SEG)) {
+			/* Loading zero below won't clear the base. */
+			loadsegment(fs, __USER_DS);
+			load_gs_index(__USER_DS);
+		}
 
-	loadsegment(fs, 0);
-	loadsegment(es, _ds);
-	loadsegment(ds, _ds);
-	load_gs_index(0);
+		loadsegment(fs, 0);
+		loadsegment(es, _ds);
+		loadsegment(ds, _ds);
+		load_gs_index(0);
 
+		regs->cs		= _cs;
+		regs->ss		= _ss;
+	} else {
+		regs->cs		= __KERNEL_CS;
+		regs->ss		= __KERNEL_DS;
+	}
 	regs->ip		= new_ip;
 	regs->sp		= new_sp;
-	regs->cs		= _cs;
-	regs->ss		= _ss;
 	regs->flags		= X86_EFLAGS_IF;
 }
 
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 63c7ebb0da89..1c91f1179398 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -845,6 +845,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	struct pt_regs *regs;
 
 	retval = -ENOEXEC;
+
+	if (is_ukl_thread())
+		goto UKL_SKIP_READING_ELF;
+
 	/* First of all, some simple consistency checks */
 	if (memcmp(elf_ex->e_ident, ELFMAG, SELFMAG) != 0)
 		goto out;
@@ -998,6 +1002,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	if (retval)
 		goto out_free_dentry;
 
+UKL_SKIP_READING_ELF:
 	/* Flush all traces of the currently running executable */
 	retval = begin_new_exec(bprm);
 	if (retval)
@@ -1029,6 +1034,17 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	start_data = 0;
 	end_data = 0;
 
+	if (is_ukl_thread()) {
+		/*
+		 * load_bias needs to ensure that we push the heap start
+		 * past the end of the executable, but in this case, it is
+		 * already mapped with the kernel text.  So we select an
+		 * address that is "high enough"
+		 */
+		load_bias = 0x405000;
+		goto UKL_SKIP_LOADING_ELF;
+	}
+
 	/* Now we do a little grungy work by mmapping the ELF image into
 	   the correct location in memory. */
 	for(i = 0, elf_ppnt = elf_phdata;
@@ -1224,6 +1240,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		}
 	}
 
+UKL_SKIP_LOADING_ELF:
 	e_entry = elf_ex->e_entry + load_bias;
 	phdr_addr += load_bias;
 	elf_bss += load_bias;
@@ -1246,6 +1263,16 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		goto out_free_dentry;
 	}
 
+	if (is_ukl_thread()) {
+		/*
+		 * We know that this symbol exists and that it is the entry
+		 * point for the linked application.
+		 */
+		extern void ukl__start(void);
+		elf_entry = (unsigned long) ukl__start;
+		goto UKL_SKIP_FINDING_ELF_ENTRY;
+	}
+
 	if (interpreter) {
 		elf_entry = load_elf_interp(interp_elf_ex,
 					    interpreter,
@@ -1283,6 +1310,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 
 	set_binfmt(&elf_format);
 
+UKL_SKIP_FINDING_ELF_ENTRY:
 #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
 	retval = ARCH_SETUP_ADDITIONAL_PAGES(bprm, elf_ex, !!interpreter);
 	if (retval < 0)
diff --git a/fs/exec.c b/fs/exec.c
index d046dbb9cbd0..4ae06fcf7436 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1246,9 +1246,11 @@ int begin_new_exec(struct linux_binprm * bprm)
 	int retval;
 
 	/* Once we are committed compute the creds */
-	retval = bprm_creds_from_file(bprm);
-	if (retval)
-		return retval;
+	if (!is_ukl_thread()) {
+		retval = bprm_creds_from_file(bprm);
+		if (retval)
+			return retval;
+	}
 
 	/*
 	 * Ensure all future errors are fatal.
@@ -1282,9 +1284,11 @@ int begin_new_exec(struct linux_binprm * bprm)
 		goto out;
 
 	/* If the binary is not readable then enforce mm->dumpable=0 */
-	would_dump(bprm, bprm->file);
-	if (bprm->have_execfd)
-		would_dump(bprm, bprm->executable);
+	if (!is_ukl_thread()) {
+		would_dump(bprm, bprm->file);
+		if (bprm->have_execfd)
+			would_dump(bprm, bprm->executable);
+	}
 
 	/*
 	 * Release all of the old mmap stuff
@@ -1509,6 +1513,11 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename)
 	if (!bprm)
 		goto out;
 
+	if (is_ukl_thread()) {
+		bprm->filename = "UKL";
+		goto out_ukl;
+	}
+
 	if (fd == AT_FDCWD || filename->name[0] == '/') {
 		bprm->filename = filename->name;
 	} else {
@@ -1522,6 +1531,8 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename)
 
 		bprm->filename = bprm->fdpath;
 	}
+
+out_ukl:
 	bprm->interp = bprm->filename;
 
 	retval = bprm_mm_init(bprm);
@@ -1708,6 +1719,15 @@ static int search_binary_handler(struct linux_binprm *bprm)
 	struct linux_binfmt *fmt;
 	int retval;
 
+	if (is_ukl_thread()) {
+		list_for_each_entry(fmt, &formats, lh) {
+			retval = fmt->load_binary(bprm);
+			if (retval == 0)
+				return retval;
+		}
+		goto out_ukl;
+	}
+
 	retval = prepare_binprm(bprm);
 	if (retval < 0)
 		return retval;
@@ -1717,7 +1737,7 @@ static int search_binary_handler(struct linux_binprm *bprm)
 		return retval;
 
 	retval = -ENOENT;
- retry:
+retry:
 	read_lock(&binfmt_lock);
 	list_for_each_entry(fmt, &formats, lh) {
 		if (!try_module_get(fmt->module))
@@ -1745,6 +1765,7 @@ static int search_binary_handler(struct linux_binprm *bprm)
 		goto retry;
 	}
 
+out_ukl:
 	return retval;
 }
 
@@ -1799,7 +1820,7 @@ static int exec_binprm(struct linux_binprm *bprm)
 static int bprm_execve(struct linux_binprm *bprm,
 		       int fd, struct filename *filename, int flags)
 {
-	struct file *file;
+	struct file *file = NULL;
 	int retval;
 
 	retval = prepare_bprm_creds(bprm);
@@ -1809,10 +1830,12 @@ static int bprm_execve(struct linux_binprm *bprm,
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
 
-	file = do_open_execat(fd, filename, flags);
-	retval = PTR_ERR(file);
-	if (IS_ERR(file))
-		goto out_unmark;
+	if (!is_ukl_thread()) {
+		file = do_open_execat(fd, filename, flags);
+		retval = PTR_ERR(file);
+		if (IS_ERR(file))
+			goto out_unmark;
+	}
 
 	sched_exec();
 
@@ -1830,9 +1853,11 @@ static int bprm_execve(struct linux_binprm *bprm,
 		bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
 
 	/* Set the unchanging part of bprm->cred */
-	retval = security_bprm_creds_for_exec(bprm);
-	if (retval)
-		goto out;
+	if (!is_ukl_thread()) {
+		retval = security_bprm_creds_for_exec(bprm);
+		if (retval)
+			goto out;
+	}
 
 	retval = exec_binprm(bprm);
 	if (retval < 0)
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC UKL 09/10] exec: Give userspace a method for starting UKL process
  2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
                   ` (7 preceding siblings ...)
  2022-10-03 22:21 ` [RFC UKL 08/10] exec: Make exec path for starting UKL application Ali Raza
@ 2022-10-03 22:21 ` Ali Raza
  2022-10-04 17:35   ` Andy Lutomirski
  2022-10-03 22:21 ` [RFC UKL 10/10] Kconfig: Add config option for enabling and sample for testing UKL Ali Raza
  2022-10-06 21:27 ` [RFC UKL 00/10] Unikernel Linux (UKL) H. Peter Anvin
  10 siblings, 1 reply; 26+ messages in thread
From: Ali Raza @ 2022-10-03 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso,
	Ali Raza

From: Eric B Munson <munsoner@bu.edu>

From: Eric B Munson <munsoner@bu.edu>

The UKL process might depend on setup that is to be done by user space
prior to its initialization.  We need a way to let userspace signal that it
is ready for the UKL process to run. We will have setup a special name for
this process in the kernel config and if this name is passed to exec that
will start the UKL process. This way, if user space setup is required we
can be sure that the process doesn't run until explicitly started.

If a more traditional unikernel execution is desired, set the init= boot
param to the UKL process name.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>

Suggested-by: Thomas Unger <tommyu@bu.edu>
Signed-off-by: Eric B Munson <munsoner@bu.edu>
Signed-off-by: Ali Raza <aliraza@bu.edu>
---
 fs/exec.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 4ae06fcf7436..e30c6beb209b 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1888,6 +1888,22 @@ static int bprm_execve(struct linux_binprm *bprm,
 	return retval;
 }
 
+#ifdef CONFIG_UNIKERNEL_LINUX
+static void check_ukl_exec(const char *name)
+{
+	if (!strcmp(name, CONFIG_UKL_NAME)) {
+		pr_debug("In PID %d and current->ukl_thread is %d\nGoing to create UKL here.\n",
+				current->pid, is_ukl_thread());
+		enter_ukl_kernel();
+	}
+}
+#else
+static void check_ukl_exec(const char *name)
+{
+	(void)name;
+}
+#endif
+
 static int do_execveat_common(int fd, struct filename *filename,
 			      struct user_arg_ptr argv,
 			      struct user_arg_ptr envp,
@@ -1899,6 +1915,8 @@ static int do_execveat_common(int fd, struct filename *filename,
 	if (IS_ERR(filename))
 		return PTR_ERR(filename);
 
+	check_ukl_exec(filename->name);
+
 	/*
 	 * We move the actual failure in case of RLIMIT_NPROC excess from
 	 * set*uid() to execve() because too many poorly written programs
@@ -1985,6 +2003,8 @@ int kernel_execve(const char *kernel_filename,
 	if (WARN_ON_ONCE(current->flags & PF_KTHREAD))
 		return -EINVAL;
 
+	check_ukl_exec(kernel_filename);
+
 	filename = getname_kernel(kernel_filename);
 	if (IS_ERR(filename))
 		return PTR_ERR(filename);
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC UKL 10/10] Kconfig: Add config option for enabling and sample for testing UKL
  2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
                   ` (8 preceding siblings ...)
  2022-10-03 22:21 ` [RFC UKL 09/10] exec: Give userspace a method for starting UKL process Ali Raza
@ 2022-10-03 22:21 ` Ali Raza
  2022-10-04  2:11   ` Bagas Sanjaya
  2022-10-06 21:27 ` [RFC UKL 00/10] Unikernel Linux (UKL) H. Peter Anvin
  10 siblings, 1 reply; 26+ messages in thread
From: Ali Raza @ 2022-10-03 22:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso,
	Ali Raza

Add the KConfig file that will enable building UKL. Documentation
introduces the technical details for how UKL works and the motivations
behind why it is useful. Sample provides a simple program that still uses
the standard system call interface, but does not require a modified C
library.

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>

Co-developed-by: Eric B Munson <munsoner@bu.edu>
Signed-off-by: Eric B Munson <munsoner@bu.edu>
Co-developed-by: Ali Raza <aliraza@bu.edu>
Signed-off-by: Ali Raza <aliraza@bu.edu>
---
 Documentation/index.rst   |   1 +
 Documentation/ukl/ukl.rst | 104 ++++++++++++++++++++++++++++++++++++++
 Kconfig                   |   2 +
 kernel/Kconfig.ukl        |  41 +++++++++++++++
 samples/ukl/Makefile      |  16 ++++++
 samples/ukl/README        |  17 +++++++
 samples/ukl/syscall.S     |  28 ++++++++++
 samples/ukl/tcp_server.c  |  99 ++++++++++++++++++++++++++++++++++++
 8 files changed, 308 insertions(+)
 create mode 100644 Documentation/ukl/ukl.rst
 create mode 100644 kernel/Kconfig.ukl
 create mode 100644 samples/ukl/Makefile
 create mode 100644 samples/ukl/README
 create mode 100644 samples/ukl/syscall.S
 create mode 100644 samples/ukl/tcp_server.c

diff --git a/Documentation/index.rst b/Documentation/index.rst
index 4737c18c97ff..42f8cb7d4cae 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -167,6 +167,7 @@ to ReStructured Text format, or are simply too old.
 
    tools/index
    staging/index
+   ukl/ukl.rst
 
 
 Translations
diff --git a/Documentation/ukl/ukl.rst b/Documentation/ukl/ukl.rst
new file mode 100644
index 000000000000..a07ebb51169e
--- /dev/null
+++ b/Documentation/ukl/ukl.rst
@@ -0,0 +1,104 @@
+SPDX-License-Identifier: GPL-2.0
+
+Unikernel Linux (UKL)
+=====================
+
+Unikernel Linux (UKL) is a research project aimed at integrating
+application specific optimizations to the Linux kernel. This RFC aims to
+introduce this research to the community. Any feedback regarding the idea,
+goals, implementation and research is highly appreciated.
+
+Unikernels are specialized operating systems where an application is linked
+directly with the kernel and runs in supervisor mode. This allows the
+developers to implement application specific optimizations to the kernel,
+which can be directly invoked by the application (without going through the
+syscall path). An application can control scheduling and resource
+management and directly access the hardware. Application and the kernel can
+be co-optimized, e.g., through LTO, PGO, etc. All of these optimizations,
+and others, provide applications with huge performance benefits over
+general purpose operating systems.
+
+Linux is the de-facto operating system of today. Applications depend on its
+battle tested code base, large developer community, support for legacy
+code, a huge ecosystem of tools and utilities, and a wide range of
+compatible hardware and device drivers. Linux also allows some degree of
+application specific optimizations through build time config options,
+runtime configuration, and recently through eBPF. But still, there is a
+need for even more fine-grained application specific optimizations, and
+some developers resort to kernel bypass techniques.
+
+Unikernel Linux (UKL) aims to get the best of both worlds by bringing
+application specific optimizations to the Linux ecosystem. This way,
+unmodified applications can keep getting the benefits of Linux while taking
+advantage of the unikernel-style optimizations. Optionally, applications
+can be modified to invoke deeper optimizations.
+
+There are two steps to unikernel-izing Linux, i.e., first, equip Linux with
+a unikernel model, and second, actually use that model to implement
+application specific optimizations. This patch focuses on the first part.
+Through this patch, unmodified applications can be built as Linux
+unikernels, albeit with only modest performance advantages. Like
+unikernels, UKL would allow an application to be statically linked into the
+kernel and executed in supervisor mode. However, UKL preserves most of the
+invariants and design of Linux, including a separate page-able application
+portion of the address space and a pinned kernel portion, the ability to
+run multiple processes, and distinct execution modes for application and
+kernel code. Kernel execution mode and application execution mode are
+different, e.g., the application execution mode allows application threads
+to be scheduled, handle signals, etc., which do not apply to kernel
+threads. Application built as a Linux unikernel will have its text and data
+loaded with the kernel at boot time, while the rest of the address space
+would remain unchanged. These applications invoke the system call
+functionality through a function call into the kernel system call entry
+point instead of through the syscall assembly instruction. UKL would
+support a normal userspace so the UKL application can be started, managed,
+profiled, etc., using normal command line utilities.
+
+Once Linux has a unikernel model, different application specific
+optimizations are possible. We have tried a few, e.g., fast system call
+transitions, shared stacks to allow LTO, invoking kernel functions
+directly, etc. We have seen huge performance benefits, details of which are
+not relevant to this patch and can be found in our paper.
+(https://arxiv.org/pdf/2206.00789.pdf)
+
+UKL differs significantly from previous projects, e.g., UML, KML and LKL.
+User Mode Linux (UML) is a virtual machine monitor implemented on syscall
+interface, a very different goal from UKL. Kernel Mode Linux (KML) allows
+applications to run in kernel mode and replaces syscalls with function
+calls. While KML stops there, UKL goes further. UKL links applications and
+kernel together which allows further optimizations e.g., fast system call
+transitions, shared stacks to allow LTO, invoking kernel functions directly
+etc. Details can be found in the paper linked above. Linux Kernel Library
+(LKL) harvests arch independent code from Linux, takes it to userspace as a
+library to be linked with applications. A host needs to provide arch
+dependent functionality. This model is very different from UKL. A detailed
+discussion of related work is present in the paper linked above.
+
+See samples/ukl for a simple TCP echo server example which can be built as
+a normal user space application and also as a UKL application. In the Linux
+config options, a path to the compiled and partially linked application
+binary can be specified. Kernel built with UKL enabled will search this
+location for the binary and link with the kernel. Applications and required
+libraries need to be compiled with -mno-red-zone -mcmodel=kernel flags
+because kernel mode execution can trample on application red zones and in
+order to link with the kernel and be loaded in the high end of the address
+space, application should have the correct memory model. Examples of other
+applications like Redis, Memcached etc along with glibc and libgcc etc.,
+can be found at https://github.com/unikernelLinux/ukl
+
+List of authors and contributors:
+=================================
+
+Ali Raza - aliraza@bu.edu
+Thomas Unger - tommyu@bu.edu
+Matthew Boyd - mboydmcse@gmail.com
+Eric Munson - munsoner@bu.edu
+Parul Sohal - psohal@bu.edu
+Ulrich Drepper - drepper@redhat.com
+Richard Jones - rjones@redhat.com
+Daniel Bristot de Oliveira - bristot@kernel.org
+Larry Woodman - lwoodman@redhat.com
+Renato Mancuso - rmancuso@bu.edu
+Jonathan Appavoo - jappavoo@bu.edu
+Orran Krieger - okrieg@bu.edu
+
diff --git a/Kconfig b/Kconfig
index 745bc773f567..2a4594ae472c 100644
--- a/Kconfig
+++ b/Kconfig
@@ -29,4 +29,6 @@ source "lib/Kconfig"
 
 source "lib/Kconfig.debug"
 
+source "kernel/Kconfig.ukl"
+
 source "Documentation/Kconfig"
diff --git a/kernel/Kconfig.ukl b/kernel/Kconfig.ukl
new file mode 100644
index 000000000000..c2c5e1003605
--- /dev/null
+++ b/kernel/Kconfig.ukl
@@ -0,0 +1,41 @@
+menuconfig UNIKERNEL_LINUX
+	bool "Unikernel Linux"
+	depends on X86_64 && !RANDOMIZE_BASE && !PAGE_TABLE_ISOLATION
+	help
+	    Unikernel Linux allows for a single, privileged process to be
+	    linked with the kernel binary and be executed inplace of or
+	    along side a more traditional user space.
+
+	    If you don't know what this is, say N.
+
+config UKL_TLS
+	bool "Enable TLS for UKL application"
+	depends on UNIKERNEL_LINUX
+	default Y
+	help
+	    Not all applications will make use of thread local storage,
+	    but we need to account for it in the linker script if used.
+	    For the application in samples/ this should be disabled, but
+	    if you are working with glibc this should be 'Y'.
+
+	    If unsure say 'Y' here
+
+config UKL_NAME
+	string "UKL Exec target"
+	depends on UNIKERNEL_LINUX
+	default "/UKL"
+	help
+	    We need a way to trigger the start of the UKL application,
+	    either by the kernel inplace of init or userspace when setup
+	    is finished. The value given here is compared against the
+	    filename passed to exec and if they match UKL is started.
+	    For a more 'traditional' unikernel model, the value set here
+	    should be given to the init= boot parameter.
+
+config UKL_ARCHIVE_PATH
+	string "Path static application archive"
+	depends on UNIKERNEL_LINUX
+	default "../UKL.a"
+	help
+	    Where the linker should look for the statically linked application
+	    and dependency archive.
diff --git a/samples/ukl/Makefile b/samples/ukl/Makefile
new file mode 100644
index 000000000000..93beb7750d4b
--- /dev/null
+++ b/samples/ukl/Makefile
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS += -I usr/include -fno-PIC -mno-red-zone -mcmodel=kernel
+
+UKL.a: tcp_server.o syscall.o userspace
+	$(AR) cr UKL.a tcp_server.o syscall.o
+	objcopy --prefix-symbols=ukl_ UKL.a
+
+tcp_server.o: tcp_server.c
+syscall.o: syscall.S
+
+userspace:
+	gcc -o tcp_server tcp_server.c
+
+clean:
+	rm -f UKL.a tcp_server.o syscall.o tcp_server
diff --git a/samples/ukl/README b/samples/ukl/README
new file mode 100644
index 000000000000..fbb771da033a
--- /dev/null
+++ b/samples/ukl/README
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+UKL test program
+================
+
+tcp_server.c is a epoll based TCP echo server written in C which uses port
+no. 5555 by default. syscall.S translates syscall() function to a call
+instruction in assembly. Normally, C libraries provide syscall() function
+that translate into syscall assembly instruction. Run `make` and it will
+create a UKL.a and a tcp_server. UKL.a can then be copied to where UKL
+Linux build expects it to be present. This can be changed through the Linux
+config options (by running `make menuconfig` etc.) The resulting Linux
+kernel can be run, and once the userspace comes up, the echo server can be
+started by running the UKL exec command, again chosen through the Linux
+config options. tcp_server is a userspace binary of the same echo server
+which can be run normally. This is meant to show that UKL can run code
+which can also be run as a userspace binary without modification.
diff --git a/samples/ukl/syscall.S b/samples/ukl/syscall.S
new file mode 100644
index 000000000000..95d1c177fb05
--- /dev/null
+++ b/samples/ukl/syscall.S
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+	.global _start
+_start:
+	jmp main
+
+	.global syscall
+
+/* Usage: long syscall (syscall_number, arg1, arg2, arg3, arg4, arg5, arg6)
+   We need to do some arg shifting, the syscall_number will be in
+   rax.  */
+
+	.text
+syscall:
+	movq %rdi, %rax		/* Syscall number -> rax.  */
+	movq %rsi, %rdi		/* shift arg1 - arg5.  */
+	movq %rdx, %rsi
+	movq %rcx, %rdx
+	movq %r8, %r10
+	movq %r9, %r8
+	movq 8(%rsp),%r9	/* arg6 is on the stack.  */
+	call entry_SYSCALL_64	/* Do the system call.  */
+	cmpq $-4095, %rax	/* Check %rax for error.  */
+	jae loop	/* Jump to error handler if error.  */
+	ret			/* Return to caller.  */
+
+loop:
+	jmp loop
diff --git a/samples/ukl/tcp_server.c b/samples/ukl/tcp_server.c
new file mode 100644
index 000000000000..abf1a0e2bb79
--- /dev/null
+++ b/samples/ukl/tcp_server.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <sys/epoll.h>
+#include <arpa/inet.h>
+#include <netinet/tcp.h>
+
+#define BACKLOG 512
+#define MAX_EVENTS 128
+#define MAX_MESSAGE_LEN 2048
+
+void error(char *msg);
+extern long syscall(long number, ...);
+
+int main(void)
+{
+	// some variables we need
+	struct sockaddr_in server_addr, client_addr;
+	socklen_t client_len = sizeof(client_addr);
+	int bytes_received;
+	char buffer[MAX_MESSAGE_LEN];
+	int on;
+	int result;
+	int sock_listen_fd, newsockfd;
+
+	// setup socket
+	sock_listen_fd = syscall(41, AF_INET, SOCK_STREAM, 0);
+	if (sock_listen_fd < 0)
+		error("Error creating socket..\n");
+
+	server_addr.sin_family = AF_INET;
+	server_addr.sin_port = 45845; //htons(portno);
+	server_addr.sin_addr.s_addr = INADDR_ANY;
+
+	// set TCP NODELAY
+	on = 1;
+	result = syscall(54, sock_listen_fd, IPPROTO_TCP, TCP_NODELAY, &on, sizeof(on));
+	if (result < 0)
+		error("Can't set TCP_NODELAY to on");
+
+	// bind socket and listen for connections
+	if (syscall(49, sock_listen_fd, (struct sockaddr *)&server_addr, sizeof(server_addr)) < 0)
+		error("Error binding socket..\n");
+
+	if (syscall(50, sock_listen_fd, BACKLOG) < 0)
+		error("Error listening..\n");
+
+	struct epoll_event ev, events[MAX_EVENTS];
+	int new_events, sock_conn_fd, epollfd;
+
+	epollfd = syscall(213, MAX_EVENTS);
+	if (epollfd < 0)
+		error("Error creating epoll..\n");
+
+	ev.events = EPOLLIN;
+	ev.data.fd = sock_listen_fd;
+
+	if (syscall(233, epollfd, EPOLL_CTL_ADD, sock_listen_fd, &ev) == -1)
+		error("Error adding new listeding socket to epoll..\n");
+
+	while (1) {
+		new_events = syscall(232, epollfd, events, MAX_EVENTS, -1);
+
+		if (new_events == -1)
+			error("Error in epoll_wait..\n");
+
+		for (int i = 0; i < new_events; ++i) {
+			if (events[i].data.fd == sock_listen_fd) {
+				sock_conn_fd = syscall(288, sock_listen_fd,
+						(struct sockaddr *)&client_addr,
+						&client_len, SOCK_NONBLOCK);
+				if (sock_conn_fd == -1)
+					error("Error accepting new connection..\n");
+
+				ev.events = EPOLLIN | EPOLLET;
+				ev.data.fd = sock_conn_fd;
+				if (syscall(233, epollfd, EPOLL_CTL_ADD, sock_conn_fd, &ev) == -1)
+					error("Error adding new event to epoll..\n");
+			} else {
+				newsockfd = events[i].data.fd;
+				bytes_received = syscall(45, newsockfd, buffer, MAX_MESSAGE_LEN,
+						0, NULL, NULL);
+				if (bytes_received <= 0) {
+					syscall(233, epollfd, EPOLL_CTL_DEL, newsockfd, NULL);
+					syscall(48, newsockfd, SHUT_RDWR);
+				} else {
+					syscall(44, newsockfd, buffer, bytes_received, 0, NULL, 0);
+				}
+			}
+		}
+	}
+}
+
+void error(char *msg)
+{
+	syscall(1, 1, msg, 15);
+	syscall(60, 1);
+}
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 10/10] Kconfig: Add config option for enabling and sample for testing UKL
  2022-10-03 22:21 ` [RFC UKL 10/10] Kconfig: Add config option for enabling and sample for testing UKL Ali Raza
@ 2022-10-04  2:11   ` Bagas Sanjaya
  2022-10-06 21:28     ` Ali Raza
  0 siblings, 1 reply; 26+ messages in thread
From: Bagas Sanjaya @ 2022-10-04  2:11 UTC (permalink / raw)
  To: Ali Raza, linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso

On 10/4/22 05:21, Ali Raza wrote:
> Add the KConfig file that will enable building UKL. Documentation
> introduces the technical details for how UKL works and the motivations
> behind why it is useful. Sample provides a simple program that still uses
> the standard system call interface, but does not require a modified C
> library.
> 
<snipped>
>  Documentation/index.rst   |   1 +
>  Documentation/ukl/ukl.rst | 104 ++++++++++++++++++++++++++++++++++++++
>  Kconfig                   |   2 +
>  kernel/Kconfig.ukl        |  41 +++++++++++++++
>  samples/ukl/Makefile      |  16 ++++++
>  samples/ukl/README        |  17 +++++++
>  samples/ukl/syscall.S     |  28 ++++++++++
>  samples/ukl/tcp_server.c  |  99 ++++++++++++++++++++++++++++++++++++
>  8 files changed, 308 insertions(+)
>  create mode 100644 Documentation/ukl/ukl.rst
>  create mode 100644 kernel/Kconfig.ukl
>  create mode 100644 samples/ukl/Makefile
>  create mode 100644 samples/ukl/README
>  create mode 100644 samples/ukl/syscall.S
>  create mode 100644 samples/ukl/tcp_server.c

Shouldn't the documentation be split into its own patch?

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 02/10] x86/boot: Load the PT_TLS segment for Unikernel configs
  2022-10-03 22:21 ` [RFC UKL 02/10] x86/boot: Load the PT_TLS segment for Unikernel configs Ali Raza
@ 2022-10-04 17:30   ` Andy Lutomirski
  2022-10-06 21:00     ` Ali Raza
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2022-10-04 17:30 UTC (permalink / raw)
  To: Ali Raza, Linux Kernel Mailing List
  Cc: Jonathan Corbet, masahiroy, michal.lkml, Nick Desaulniers,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Eric W. Biederman, Kees Cook,
	Peter Zijlstra (Intel),
	Al Viro, Arnd Bergmann, juri.lelli, vincent.guittot,
	dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot,
	vschneid, Paolo Bonzini, jpoimboe, linux-doc, linux-kbuild,
	linux-mm, linux-fsdevel, linux-arch, the arch/x86 maintainers,
	rjones, munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg,
	rmancuso

On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
> The kernel normally skips loading this segment as it is not inlcuded in
> standard builds. However, when linked with an application in the Unikernel
> configuration the segment will be present. Load PT_TLS when configured as a
> unikernel.
>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Masahiro Yamada <masahiroy@kernel.org>
> Cc: Michal Marek <michal.lkml@markovi.net>
> Cc: Nick Desaulniers <ndesaulniers@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Josh Poimboeuf <jpoimboe@kernel.org>
>
> Signed-off-by: Ali Raza <aliraza@bu.edu>
> ---
>  arch/x86/boot/compressed/misc.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index cf690d8712f4..0d07b5661c9c 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -310,6 +310,9 @@ static void parse_elf(void *output)
>  		phdr = &phdrs[i];
> 
>  		switch (phdr->p_type) {
> +#ifdef CONFIG_UNIKERNEL_LINUX
> +		case PT_TLS:
> +#endif

Can you explain why exactly a Linux boot image would have a TLS segment?  What does it do?

>  		case PT_LOAD:
>  #ifdef CONFIG_X86_64
>  			if ((phdr->p_align % 0x200000) != 0)
> -- 
> 2.21.3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 07/10] x86/signal: Adjust signal handler register values and return frame
  2022-10-03 22:21 ` [RFC UKL 07/10] x86/signal: Adjust signal handler register values and return frame Ali Raza
@ 2022-10-04 17:34   ` Andy Lutomirski
  2022-10-06 21:20     ` Ali Raza
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2022-10-04 17:34 UTC (permalink / raw)
  To: Ali Raza, Linux Kernel Mailing List
  Cc: Jonathan Corbet, masahiroy, michal.lkml, Nick Desaulniers,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Eric W. Biederman, Kees Cook,
	Peter Zijlstra (Intel),
	Al Viro, Arnd Bergmann, juri.lelli, vincent.guittot,
	dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot,
	vschneid, Paolo Bonzini, jpoimboe, linux-doc, linux-kbuild,
	linux-mm, linux-fsdevel, linux-arch, the arch/x86 maintainers,
	rjones, munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg,
	rmancuso



On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
> For a UKL thread, returning to a signal handler is not done with iret or
> sysret.  This means we need to adjust the way the return stack frame is
> handled for these threads.  When constructing the signal frame, we leave
> the previous frame in place because we will return to it from the signal
> handler.  We also leave space for pushing eflags and the return address.
> UKL threads will only use the __KERNEL_DS value in the ss register and 0xC3
> in the cs register.

This is unclear.  Are you taking about returning from the kernel fault code *to* the signal handler or are you talking about returning *from* the user signal hander to the user code that was running when the signal happened?

In any case, I don't see what this has to do with iret or sysret.  Surely UKL can use a sigreturn() just like regular Linux.

The part where a UKL thread has permission to return to a CPL0 context should be a separate patch.

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 09/10] exec: Give userspace a method for starting UKL process
  2022-10-03 22:21 ` [RFC UKL 09/10] exec: Give userspace a method for starting UKL process Ali Raza
@ 2022-10-04 17:35   ` Andy Lutomirski
  2022-10-06 21:25     ` Ali Raza
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2022-10-04 17:35 UTC (permalink / raw)
  To: Ali Raza, Linux Kernel Mailing List
  Cc: Jonathan Corbet, masahiroy, michal.lkml, Nick Desaulniers,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Eric W. Biederman, Kees Cook,
	Peter Zijlstra (Intel),
	Al Viro, Arnd Bergmann, juri.lelli, vincent.guittot,
	dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot,
	vschneid, Paolo Bonzini, jpoimboe, linux-doc, linux-kbuild,
	linux-mm, linux-fsdevel, linux-arch, the arch/x86 maintainers,
	rjones, munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg,
	rmancuso

On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
> From: Eric B Munson <munsoner@bu.edu>
>
> From: Eric B Munson <munsoner@bu.edu>
>
> The UKL process might depend on setup that is to be done by user space
> prior to its initialization.  We need a way to let userspace signal that it
> is ready for the UKL process to run. We will have setup a special name for
> this process in the kernel config and if this name is passed to exec that
> will start the UKL process. This way, if user space setup is required we
> can be sure that the process doesn't run until explicitly started.

This is just bizarre IMO.  Why is there one single UKL process?

How about having a way to start a UKL process and then, if desired, start *another* UKL process?  (And obviously there would be a security mode in which only a UKL process that is actually part of the kernel image can run or that only a UKL process with a hash that's part of the kernel image can run.)

--Andy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 05/10] x86/uaccess: Make access_ok UKL aware
  2022-10-03 22:21 ` [RFC UKL 05/10] x86/uaccess: Make access_ok UKL aware Ali Raza
@ 2022-10-04 17:36   ` Andy Lutomirski
  2022-10-06 21:16     ` Ali Raza
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2022-10-04 17:36 UTC (permalink / raw)
  To: Ali Raza, Linux Kernel Mailing List
  Cc: Jonathan Corbet, masahiroy, michal.lkml, Nick Desaulniers,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Eric W. Biederman, Kees Cook,
	Peter Zijlstra (Intel),
	Al Viro, Arnd Bergmann, juri.lelli, vincent.guittot,
	dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot,
	vschneid, Paolo Bonzini, jpoimboe, linux-doc, linux-kbuild,
	linux-mm, linux-fsdevel, linux-arch, the arch/x86 maintainers,
	rjones, munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg,
	rmancuso



On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
> When configured for UKL, access_ok needs to account for the unified address
> space that is used by the kernel and the process being run. To do this,
> they need to check the task struct field added earlier to determine where
> the execution that is making the check is running. For a zero value, the
> normal boundary definitions apply, but non-zero value indicates a UKL
> thread and a shared address space should be assumed.

I think this is just wrong.  Why should a UKL process be able to read() to kernel (high-half) memory?

set_fs() is gone.  Please keep it gone.

>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Masahiro Yamada <masahiroy@kernel.org>
> Cc: Michal Marek <michal.lkml@markovi.net>
> Cc: Nick Desaulniers <ndesaulniers@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Josh Poimboeuf <jpoimboe@kernel.org>
>
> Signed-off-by: Ali Raza <aliraza@bu.edu>
> ---
>  arch/x86/include/asm/uaccess.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index 913e593a3b45..adef521b2e59 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -37,11 +37,19 @@ static inline bool pagefault_disabled(void);
>   * Return: true (nonzero) if the memory block may be valid, false (zero)
>   * if it is definitely invalid.
>   */
> +#ifdef CONFIG_UNIKERNEL_LINUX
> +#define access_ok(addr, size)					\
> +({									\
> +	WARN_ON_IN_IRQ();						\
> +	(is_ukl_thread() ? 1 : likely(__access_ok(addr, size)));	\
> +})
> +#else
>  #define access_ok(addr, size)					\
>  ({									\
>  	WARN_ON_IN_IRQ();						\
>  	likely(__access_ok(addr, size));				\
>  })
> +#endif
> 
>  #include <asm-generic/access_ok.h>
> 
> -- 
> 2.21.3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls
  2022-10-03 22:21 ` [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls Ali Raza
@ 2022-10-04 17:43   ` Andy Lutomirski
  2022-10-06 21:12     ` Ali Raza
  0 siblings, 1 reply; 26+ messages in thread
From: Andy Lutomirski @ 2022-10-04 17:43 UTC (permalink / raw)
  To: Ali Raza, Linux Kernel Mailing List
  Cc: Jonathan Corbet, masahiroy, michal.lkml, Nick Desaulniers,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Eric W. Biederman, Kees Cook,
	Peter Zijlstra (Intel),
	Al Viro, Arnd Bergmann, juri.lelli, vincent.guittot,
	dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot,
	vschneid, Paolo Bonzini, jpoimboe, linux-doc, linux-kbuild,
	linux-mm, linux-fsdevel, linux-arch, the arch/x86 maintainers,
	rjones, munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg,
	rmancuso, Daniel Bristot de Oliveira



On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
> If a UKL application makes a system call, it won't go through with the
> syscall assembly instruction. Instead, the application will use the call
> instruction to go to the kernel entry point. Instead of adding checks to
> the normal entry_SYSCALL_64 to see if we came here from a UKL task or a
> normal application task, we create a totally new entry point called
> ukl_entry_SYSCALL_64. This allows the normal entry point to be unchanged
> and simplifies the UKL specific code as well.
>
> ukl_entry_SYSCALL_64 is similar to entry_SYSCALL_64 except that it has to
> populate %rcx with return address manually (syscall instruction does that
> automatically for normal application tasks). This allows the pt_regs to be
> correct. Also, we have to push the flags onto the user stack, because on
> the return path, we first switch to user stack, then pop the flags and then
> return. Popping the flags would restart interrupts, so we dont want to be
> stuck on kernel stack when an interrupt hits. All this can be done with an
> iret instruction, but call/iret pair performans way slower than a call/ret
> pair.
>
> Also, on the entry path, we make sure the context flag i.e., in_user is set
> to 1 to indicate we are now in kernel context so any new interrupts dont
> have to go through kernel entry code again. This is normally done with the
> CS value on stack, but in UKL case that will always be a kernel value. On
> the way back, the in_user is switched back to 2 to indicate that now
> application context is being entered. All non-UKL tasks have the in_user
> value set to 0.


>
> The UKL application uses a slightly different value for CS, instead of
> 0x33, we use 0xC3. As most of the tests compare only the least significant
> nibble, they behave as expected. The C value in the second nibble allows us
> to distinguish between user space and UKL application code.

My intuition would be to try this the other way around.  Use an actual honest CS (specifically _KERNEL_CS) for pt_regs->cs.  Translate at the user ABI boundary instead.  After all, a UKL task is essentially just a kernel thread that happens to have a pt_regs area.


>
> Rest of the code makes sure the above mentioned in_user context tracking is
> done for all entry and exit cases i.e., for interrupts, exceptions etc.  If
> its a UKL task, if in_user value is 2, we treat it as an application task,
> and if it is 1, we treat it as coming from kernel context. We skip these
> checks if in_user is 0.

By "context tracking" are you referring to RCU?  Since a UKL task is essentially a kernel thread, what "entry" is there other than setting up pt_regs?

>
> swapgs_restore_regs_and_return_to_usermode changes also make sure that
> in_user is correct and then we iret back.
>
> Double fault handling is special case. Normally, if a user stack suffers a
> page fault, hardware switches to a kernel stack and pushes a frame onto the
> kernel stack. This switch only happens if the execution was in user
> privilege level when the page fault occurred. For UKL, execution is always
> in kernel level, so when the user stack suffers a page fault, no switch to
> a pinned kernel stack happens, and hardware tries to push state on the
> already faulting user stack. This generates a double fault. So we handle
> this case in the double fault handler by assuming any double fault is
> actually a user stack page fault. This can also be fixed by making all page
> faults go through a pinned stack using the IST mechanism. We have tried and
> tested that, but in the interest of touching as little code as possible, we
> chose this option instead.

Eww.  I guess this is a real problem, but eww.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 02/10] x86/boot: Load the PT_TLS segment for Unikernel configs
  2022-10-04 17:30   ` Andy Lutomirski
@ 2022-10-06 21:00     ` Ali Raza
  0 siblings, 0 replies; 26+ messages in thread
From: Ali Raza @ 2022-10-06 21:00 UTC (permalink / raw)
  To: Andy Lutomirski, Linux Kernel Mailing List
  Cc: Jonathan Corbet, masahiroy, michal.lkml, Nick Desaulniers,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Eric W. Biederman, Kees Cook,
	Peter Zijlstra (Intel),
	Al Viro, Arnd Bergmann, juri.lelli, vincent.guittot,
	dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot,
	vschneid, Paolo Bonzini, jpoimboe, linux-doc, linux-kbuild,
	linux-mm, linux-fsdevel, linux-arch, the arch/x86 maintainers,
	rjones, munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg,
	rmancuso

On 10/4/22 13:30, Andy Lutomirski wrote:
> On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
>> The kernel normally skips loading this segment as it is not inlcuded in
>> standard builds. However, when linked with an application in the Unikernel
>> configuration the segment will be present. Load PT_TLS when configured as a
>> unikernel.
>>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Masahiro Yamada <masahiroy@kernel.org>
>> Cc: Michal Marek <michal.lkml@markovi.net>
>> Cc: Nick Desaulniers <ndesaulniers@google.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Eric Biederman <ebiederm@xmission.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ben Segall <bsegall@google.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
>> Cc: Valentin Schneider <vschneid@redhat.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Josh Poimboeuf <jpoimboe@kernel.org>
>>
>> Signed-off-by: Ali Raza <aliraza@bu.edu>
>> ---
>>  arch/x86/boot/compressed/misc.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
>> index cf690d8712f4..0d07b5661c9c 100644
>> --- a/arch/x86/boot/compressed/misc.c
>> +++ b/arch/x86/boot/compressed/misc.c
>> @@ -310,6 +310,9 @@ static void parse_elf(void *output)
>>  		phdr = &phdrs[i];
>>
>>  		switch (phdr->p_type) {
>> +#ifdef CONFIG_UNIKERNEL_LINUX
>> +		case PT_TLS:
>> +#endif
> 
> Can you explain why exactly a Linux boot image would have a TLS segment?  What does it do?

Thank you for taking the time to review the patch. 

A UKL boot image will have a TLS segment if an application has it, or is
linked with glibc, and the resulting binary is then linked with the
kernel. This will allow applications depending on TLS to function
without modification in the UKL setting.

That is why, the first patch in this series adds TLS section to the
kernel linker script. Also, if you use an application binary that does
not have a TLS section (like the one given with this patchset in
samples/ukl), you can turn it off through the CONFIG_UKL_TLS option.
This means the size of the TLS section would be zero and this code will
effectively not load anything.

> 
>>  		case PT_LOAD:
>>  #ifdef CONFIG_X86_64
>>  			if ((phdr->p_align % 0x200000) != 0)
>> -- 
>> 2.21.3


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls
  2022-10-04 17:43   ` Andy Lutomirski
@ 2022-10-06 21:12     ` Ali Raza
  0 siblings, 0 replies; 26+ messages in thread
From: Ali Raza @ 2022-10-06 21:12 UTC (permalink / raw)
  To: Andy Lutomirski, Linux Kernel Mailing List
  Cc: Jonathan Corbet, masahiroy, michal.lkml, Nick Desaulniers,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Eric W. Biederman, Kees Cook,
	Peter Zijlstra (Intel),
	Al Viro, Arnd Bergmann, juri.lelli, vincent.guittot,
	dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot,
	vschneid, Paolo Bonzini, jpoimboe, linux-doc, linux-kbuild,
	linux-mm, linux-fsdevel, linux-arch, the arch/x86 maintainers,
	rjones, munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg,
	rmancuso, Daniel Bristot de Oliveira

On 10/4/22 13:43, Andy Lutomirski wrote:
> 
> 
> On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
>> If a UKL application makes a system call, it won't go through with the
>> syscall assembly instruction. Instead, the application will use the call
>> instruction to go to the kernel entry point. Instead of adding checks to
>> the normal entry_SYSCALL_64 to see if we came here from a UKL task or a
>> normal application task, we create a totally new entry point called
>> ukl_entry_SYSCALL_64. This allows the normal entry point to be unchanged
>> and simplifies the UKL specific code as well.
>>
>> ukl_entry_SYSCALL_64 is similar to entry_SYSCALL_64 except that it has to
>> populate %rcx with return address manually (syscall instruction does that
>> automatically for normal application tasks). This allows the pt_regs to be
>> correct. Also, we have to push the flags onto the user stack, because on
>> the return path, we first switch to user stack, then pop the flags and then
>> return. Popping the flags would restart interrupts, so we dont want to be
>> stuck on kernel stack when an interrupt hits. All this can be done with an
>> iret instruction, but call/iret pair performans way slower than a call/ret
>> pair.
>>
>> Also, on the entry path, we make sure the context flag i.e., in_user is set
>> to 1 to indicate we are now in kernel context so any new interrupts dont
>> have to go through kernel entry code again. This is normally done with the
>> CS value on stack, but in UKL case that will always be a kernel value. On
>> the way back, the in_user is switched back to 2 to indicate that now
>> application context is being entered. All non-UKL tasks have the in_user
>> value set to 0.
> 
> 
>>
>> The UKL application uses a slightly different value for CS, instead of
>> 0x33, we use 0xC3. As most of the tests compare only the least significant
>> nibble, they behave as expected. The C value in the second nibble allows us
>> to distinguish between user space and UKL application code.
> 
> My intuition would be to try this the other way around.  Use an actual honest CS (specifically _KERNEL_CS) for pt_regs->cs.  Translate at the user ABI boundary instead.  After all, a UKL task is essentially just a kernel thread that happens to have a pt_regs area.

Yes I agree, we can use _KERNEL_CS for UKL threads and then
differentiate between kernel and UKL threads based on a call to
is_ukl_thread. Thank you for pointing that out.

> 
> 
>>
>> Rest of the code makes sure the above mentioned in_user context tracking is
>> done for all entry and exit cases i.e., for interrupts, exceptions etc.  If
>> its a UKL task, if in_user value is 2, we treat it as an application task,
>> and if it is 1, we treat it as coming from kernel context. We skip these
>> checks if in_user is 0.
> 
> By "context tracking" are you referring to RCU?  Since a UKL task is essentially a kernel thread, what "entry" is there other than setting up pt_regs?

Yes, a UKL thread is a kernel thread in that it always executes in
kernel mode. But it is also different than a kernel thread in that it
executes application code as well. Application code requires scheduling,
signal handling etc to work. RCU work needs to be done as well. So the
entry from application code, be it for system calls (without the syscall
instruction), exceptions, interrupts etc., would involve RCU context
tracking. And exit for all these paths would include everything
syscall_exit_to_user_mode does. A UKL thread interrupted while running
kernel code will be dealt like a normal kernel thread.

Put differently, UKL is decoupling user code from user mode, and kernel
code from kernel mode. The user/kernel code is tracked through the
in_user flag in task_struct, while UKL always remains in kernel mode.

> 
>>
>> swapgs_restore_regs_and_return_to_usermode changes also make sure that
>> in_user is correct and then we iret back.
>>
>> Double fault handling is special case. Normally, if a user stack suffers a
>> page fault, hardware switches to a kernel stack and pushes a frame onto the
>> kernel stack. This switch only happens if the execution was in user
>> privilege level when the page fault occurred. For UKL, execution is always
>> in kernel level, so when the user stack suffers a page fault, no switch to
>> a pinned kernel stack happens, and hardware tries to push state on the
>> already faulting user stack. This generates a double fault. So we handle
>> this case in the double fault handler by assuming any double fault is
>> actually a user stack page fault. This can also be fixed by making all page
>> faults go through a pinned stack using the IST mechanism. We have tried and
>> tested that, but in the interest of touching as little code as possible, we
>> chose this option instead.
> 
> Eww.  I guess this is a real problem, but eww.

Yes, I agree.

What might make it less eww would be using the IST mechanism. That would
include setting up a separate stack for all page faults so that we are
guaranteed a fresh stack by hardware every time a page fault occurs.
That would modify the normal path for non UKL page faults as well, and
also touch more code (IDT set up and some boot up code etc.). But we
have implemented and tested it on our end, and would be happy to share
that code as well.

> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 05/10] x86/uaccess: Make access_ok UKL aware
  2022-10-04 17:36   ` Andy Lutomirski
@ 2022-10-06 21:16     ` Ali Raza
  0 siblings, 0 replies; 26+ messages in thread
From: Ali Raza @ 2022-10-06 21:16 UTC (permalink / raw)
  To: Andy Lutomirski, Linux Kernel Mailing List
  Cc: Jonathan Corbet, masahiroy, michal.lkml, Nick Desaulniers,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Eric W. Biederman, Kees Cook,
	Peter Zijlstra (Intel),
	Al Viro, Arnd Bergmann, juri.lelli, vincent.guittot,
	dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot,
	vschneid, Paolo Bonzini, jpoimboe, linux-doc, linux-kbuild,
	linux-mm, linux-fsdevel, linux-arch, the arch/x86 maintainers,
	rjones, munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg,
	rmancuso

On 10/4/22 13:36, Andy Lutomirski wrote:
> 
> 
> On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
>> When configured for UKL, access_ok needs to account for the unified address
>> space that is used by the kernel and the process being run. To do this,
>> they need to check the task struct field added earlier to determine where
>> the execution that is making the check is running. For a zero value, the
>> normal boundary definitions apply, but non-zero value indicates a UKL
>> thread and a shared address space should be assumed.
> 
> I think this is just wrong.  Why should a UKL process be able to read() to kernel (high-half) memory?
> 
> set_fs() is gone.  Please keep it gone.

UKL needs access to kernel memory because the UKL application is linked
with the kernel, so its data lives along with kernel data in the kernel
half of memory. So any thing which involves a check to see if user
pointer indeed lives in user part of memory would fail. For example,
anything which invokes copy_to_user or copy_from_user would involve a
call to access_ok. This would fail because the UKL user pointer will
have a kernel address.

> 
>>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Masahiro Yamada <masahiroy@kernel.org>
>> Cc: Michal Marek <michal.lkml@markovi.net>
>> Cc: Nick Desaulniers <ndesaulniers@google.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Eric Biederman <ebiederm@xmission.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Juri Lelli <juri.lelli@redhat.com>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Ben Segall <bsegall@google.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
>> Cc: Valentin Schneider <vschneid@redhat.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Josh Poimboeuf <jpoimboe@kernel.org>
>>
>> Signed-off-by: Ali Raza <aliraza@bu.edu>
>> ---
>>  arch/x86/include/asm/uaccess.h | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
>> index 913e593a3b45..adef521b2e59 100644
>> --- a/arch/x86/include/asm/uaccess.h
>> +++ b/arch/x86/include/asm/uaccess.h
>> @@ -37,11 +37,19 @@ static inline bool pagefault_disabled(void);
>>   * Return: true (nonzero) if the memory block may be valid, false (zero)
>>   * if it is definitely invalid.
>>   */
>> +#ifdef CONFIG_UNIKERNEL_LINUX
>> +#define access_ok(addr, size)					\
>> +({									\
>> +	WARN_ON_IN_IRQ();						\
>> +	(is_ukl_thread() ? 1 : likely(__access_ok(addr, size)));	\
>> +})
>> +#else
>>  #define access_ok(addr, size)					\
>>  ({									\
>>  	WARN_ON_IN_IRQ();						\
>>  	likely(__access_ok(addr, size));				\
>>  })
>> +#endif
>>
>>  #include <asm-generic/access_ok.h>
>>
>> -- 
>> 2.21.3


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 07/10] x86/signal: Adjust signal handler register values and return frame
  2022-10-04 17:34   ` Andy Lutomirski
@ 2022-10-06 21:20     ` Ali Raza
  0 siblings, 0 replies; 26+ messages in thread
From: Ali Raza @ 2022-10-06 21:20 UTC (permalink / raw)
  To: Andy Lutomirski, Linux Kernel Mailing List
  Cc: Jonathan Corbet, masahiroy, michal.lkml, Nick Desaulniers,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Eric W. Biederman, Kees Cook,
	Peter Zijlstra (Intel),
	Al Viro, Arnd Bergmann, juri.lelli, vincent.guittot,
	dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot,
	vschneid, Paolo Bonzini, jpoimboe, linux-doc, linux-kbuild,
	linux-mm, linux-fsdevel, linux-arch, the arch/x86 maintainers,
	rjones, munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg,
	rmancuso

On 10/4/22 13:34, Andy Lutomirski wrote:
> 
> 
> On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
>> For a UKL thread, returning to a signal handler is not done with iret or
>> sysret.  This means we need to adjust the way the return stack frame is
>> handled for these threads.  When constructing the signal frame, we leave
>> the previous frame in place because we will return to it from the signal
>> handler.  We also leave space for pushing eflags and the return address.
>> UKL threads will only use the __KERNEL_DS value in the ss register and 0xC3
>> in the cs register.
> 
> This is unclear.  Are you taking about returning from the kernel fault code *to* the signal handler or are you talking about returning *from* the user signal hander to the user code that was running when the signal happened?
> 
> In any case, I don't see what this has to do with iret or sysret.  Surely UKL can use a sigreturn() just like regular Linux.
> 
> The part where a UKL thread has permission to return to a CPL0 context should be a separate patch.
> 
> --Andy

Yes, the commit message should have been clearer. 

The changes in __setup_rt_frame make sure that in case of a UKL thread,
the new frame should have the UKL specific regs->cs and regs->ds values,
and not have them overwritten with __USER_CS and __USER_DS. This helps
creating the correct iret frame in the interrupt return case where an
iret is used.

After the signal handler is invoked, user code calls sigreturn() as it
normally would. Once inside the rt_sigreturn() system call, UKL case is
handled a little different than normal. This is because UKL invokes
systems calls as function calls, so user stack gets a return address.
Also, UKL stores eflags on the user stack. This is used on return from
system calls in UKL, where we first switch to the user stack, then
restore flags through popfq. This restarts the interrupts so it is
important to have already switched to user stack from kernel stack. Once
flags are restored, we do a ret instead of iret. 

So, in rt_sigreturn() system call, we calculate the correct UKL regs->sp
by allowing space for the flags and return address on stack. Second, in
restore_sigcontext(), we again make sure that regs->cs and regs->ss are
only updated to user values for non UKL case.

Since, this patch involves both the signal handling and sigreturn case,
yes this can be broken into two patches.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 09/10] exec: Give userspace a method for starting UKL process
  2022-10-04 17:35   ` Andy Lutomirski
@ 2022-10-06 21:25     ` Ali Raza
  0 siblings, 0 replies; 26+ messages in thread
From: Ali Raza @ 2022-10-06 21:25 UTC (permalink / raw)
  To: Andy Lutomirski, Linux Kernel Mailing List
  Cc: Jonathan Corbet, masahiroy, michal.lkml, Nick Desaulniers,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Eric W. Biederman, Kees Cook,
	Peter Zijlstra (Intel),
	Al Viro, Arnd Bergmann, juri.lelli, vincent.guittot,
	dietmar.eggemann, Steven Rostedt, Ben Segall, mgorman, bristot,
	vschneid, Paolo Bonzini, jpoimboe, linux-doc, linux-kbuild,
	linux-mm, linux-fsdevel, linux-arch, the arch/x86 maintainers,
	rjones, munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg,
	rmancuso

On 10/4/22 13:35, Andy Lutomirski wrote:
> On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote:
>> From: Eric B Munson <munsoner@bu.edu>
>>
>> From: Eric B Munson <munsoner@bu.edu>
>>
>> The UKL process might depend on setup that is to be done by user space
>> prior to its initialization.  We need a way to let userspace signal that it
>> is ready for the UKL process to run. We will have setup a special name for
>> this process in the kernel config and if this name is passed to exec that
>> will start the UKL process. This way, if user space setup is required we
>> can be sure that the process doesn't run until explicitly started.
> 
> This is just bizarre IMO.  Why is there one single UKL process?
> 
> How about having a way to start a UKL process and then, if desired, start *another* UKL process?  (And obviously there would be a security mode in which only a UKL process that is actually part of the kernel image can run or that only a UKL process with a hash that's part of the kernel image can run.)
> 
> --Andy

Again, the commit message could have been worded better.

There can be two cases here, one where a UKL process forks or a new UKL
process is run once the first finishes. In this case, there a single UKL
application being run multiple times. The second case is where two
different UKL applications (both linked with the kernel) are run in
different processes, concurrently or one after the other. Lets look at
both of these cases.

For case 1, there is no restriction on how many UKL processes can run.
UKL allows forking, so there can be multiple processes but they will
have to share the text and data which is loaded along with the kernel
text and data. In the future, one can borrow ideas from how glibc
handles TLS i.e., where each UKL process would copy data into its user
half of memory. But we have not designed or implemented that yet. We
have tested applications that fork/clone. We have not tested running the
same UKL process again after an earlier UKL process exits, but there is
nothing stopping that from working.

For case 2, we have not yet implemented that. But for discussion's sake,
we can have two or more mutually trusting applications, all linked with
the kernel. And if you do /UKL1 or /UKL2 (or some proper names), you
should be able to run them concurrently. Again, although much of the
plumbing for this is there, we haven't implemented it fully yet.

Thank you again for the detailed feedback.

--Ali

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 00/10] Unikernel Linux (UKL)
  2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
                   ` (9 preceding siblings ...)
  2022-10-03 22:21 ` [RFC UKL 10/10] Kconfig: Add config option for enabling and sample for testing UKL Ali Raza
@ 2022-10-06 21:27 ` H. Peter Anvin
  10 siblings, 0 replies; 26+ messages in thread
From: H. Peter Anvin @ 2022-10-06 21:27 UTC (permalink / raw)
  To: Ali Raza, linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso

On October 3, 2022 3:21:23 PM PDT, Ali Raza <aliraza@bu.edu> wrote:
>Unikernel Linux (UKL) is a research project aimed at integrating
>application specific optimizations to the Linux kernel. This RFC aims to
>introduce this research to the community. Any feedback regarding the idea,
>goals, implementation and research is highly appreciated.
>
>Unikernels are specialized operating systems where an application is linked
>directly with the kernel and runs in supervisor mode. This allows the
>developers to implement application specific optimizations to the kernel,
>which can be directly invoked by the application (without going through the
>syscall path). An application can control scheduling and resource
>management and directly access the hardware. Application and the kernel can
>be co-optimized, e.g., through LTO, PGO, etc. All of these optimizations,
>and others, provide applications with huge performance benefits over
>general purpose operating systems.
>
>Linux is the de-facto operating system of today. Applications depend on its
>battle tested code base, large developer community, support for legacy
>code, a huge ecosystem of tools and utilities, and a wide range of
>compatible hardware and device drivers. Linux also allows some degree of
>application specific optimizations through build time config options,
>runtime configuration, and recently through eBPF. But still, there is a
>need for even more fine-grained application specific optimizations, and
>some developers resort to kernel bypass techniques.
>
>Unikernel Linux (UKL) aims to get the best of both worlds by bringing
>application specific optimizations to the Linux ecosystem. This way,
>unmodified applications can keep getting the benefits of Linux while taking
>advantage of the unikernel-style optimizations. Optionally, applications
>can be modified to invoke deeper optimizations.
>
>There are two steps to unikernel-izing Linux, i.e., first, equip Linux with
>a unikernel model, and second, actually use that model to implement
>application specific optimizations. This patch focuses on the first part.
>Through this patch, unmodified applications can be built as Linux
>unikernels, albeit with only modest performance advantages. Like
>unikernels, UKL would allow an application to be statically linked into the
>kernel and executed in supervisor mode. However, UKL preserves most of the
>invariants and design of Linux, including a separate page-able application
>portion of the address space and a pinned kernel portion, the ability to
>run multiple processes, and distinct execution modes for application and
>kernel code. Kernel execution mode and application execution mode are
>different, e.g., the application execution mode allows application threads
>to be scheduled, handle signals, etc., which do not apply to kernel
>threads. Application built as a Linux unikernel will have its text and data
>loaded with the kernel at boot time, while the rest of the address space
>would remain unchanged. These applications invoke the system call
>functionality through a function call into the kernel system call entry
>point instead of through the syscall assembly instruction. UKL would
>support a normal userspace so the UKL application can be started, managed,
>profiled, etc., using normal command line utilities.
>
>Once Linux has a unikernel model, different application specific
>optimizations are possible. We have tried a few, e.g., fast system call
>transitions, shared stacks to allow LTO, invoking kernel functions
>directly, etc. We have seen huge performance benefits, details of which are
>not relevant to this patch and can be found in our paper.
>(https://arxiv.org/pdf/2206.00789.pdf)
>
>UKL differs significantly from previous projects, e.g., UML, KML and LKL.
>User Mode Linux (UML) is a virtual machine monitor implemented on syscall
>interface, a very different goal from UKL. Kernel Mode Linux (KML) allows
>applications to run in kernel mode and replaces syscalls with function
>calls. While KML stops there, UKL goes further. UKL links applications and
>kernel together which allows further optimizations e.g., fast system call
>transitions, shared stacks to allow LTO, invoking kernel functions directly
>etc. Details can be found in the paper linked above. Linux Kernel Library
>(LKL) harvests arch independent code from Linux, takes it to userspace as a
>library to be linked with applications. A host needs to provide arch
>dependent functionality. This model is very different from UKL. A detailed
>discussion of related work is present in the paper linked above.
>
>See samples/ukl for a simple TCP echo server example which can be built as
>a normal user space application and also as a UKL application. In the Linux
>config options, a path to the compiled and partially linked application
>binary can be specified. Kernel built with UKL enabled will search this
>location for the binary and link with the kernel. Applications and required
>libraries need to be compiled with -mno-red-zone -mcmodel=kernel flags
>because kernel mode execution can trample on application red zones and in
>order to link with the kernel and be loaded in the high end of the address
>space, application should have the correct memory model. Examples of other
>applications like Redis, Memcached etc along with glibc and libgcc etc.,
>can be found at https://github.com/unikernelLinux/ukl
>
>List of authors and contributors:
>=================================
>
>Ali Raza - aliraza@bu.edu
>Thomas Unger - tommyu@bu.edu
>Matthew Boyd - mboydmcse@gmail.com
>Eric Munson - munsoner@bu.edu
>Parul Sohal - psohal@bu.edu
>Ulrich Drepper - drepper@redhat.com
>Richard W.M. Jones - rjones@redhat.com
>Daniel Bristot de Oliveira - bristot@kernel.org
>Larry Woodman - lwoodman@redhat.com
>Renato Mancuso - rmancuso@bu.edu
>Jonathan Appavoo - jappavoo@bu.edu
>Orran Krieger - okrieg@bu.edu
>
>Ali Raza (9):
>  kbuild: Add sections and symbols to linker script for UKL support
>  x86/boot: Load the PT_TLS segment for Unikernel configs
>  sched: Add task_struct tracking of kernel or application execution
>  x86/entry: Create alternate entry path for system calls
>  x86/uaccess: Make access_ok UKL aware
>  x86/fault: Skip checking kernel mode access to user address space for
>    UKL
>  x86/signal: Adjust signal handler register values and return frame
>  exec: Make exec path for starting UKL application
>  Kconfig: Add config option for enabling and sample for testing UKL
>
>Eric B Munson (1):
>  exec: Give userspace a method for starting UKL process
>
> Documentation/index.rst           |   1 +
> Documentation/ukl/ukl.rst         | 104 +++++++++++++++++++++++
> Kconfig                           |   2 +
> Makefile                          |   4 +
> arch/x86/boot/compressed/misc.c   |   3 +
> arch/x86/entry/entry_64.S         | 133 ++++++++++++++++++++++++++++++
> arch/x86/include/asm/elf.h        |   9 +-
> arch/x86/include/asm/uaccess.h    |   8 ++
> arch/x86/kernel/process.c         |  13 +++
> arch/x86/kernel/process_64.c      |  49 ++++++++---
> arch/x86/kernel/signal.c          |  22 +++--
> arch/x86/kernel/vmlinux.lds.S     |  98 ++++++++++++++++++++++
> arch/x86/mm/fault.c               |   7 +-
> fs/binfmt_elf.c                   |  28 +++++++
> fs/exec.c                         |  75 +++++++++++++----
> include/asm-generic/sections.h    |   4 +
> include/asm-generic/vmlinux.lds.h |  32 ++++++-
> include/linux/sched.h             |  26 ++++++
> kernel/Kconfig.ukl                |  41 +++++++++
> samples/ukl/Makefile              |  16 ++++
> samples/ukl/README                |  17 ++++
> samples/ukl/syscall.S             |  28 +++++++
> samples/ukl/tcp_server.c          |  99 ++++++++++++++++++++++
> scripts/mod/modpost.c             |   4 +
> 24 files changed, 785 insertions(+), 38 deletions(-)
> create mode 100644 Documentation/ukl/ukl.rst
> create mode 100644 kernel/Kconfig.ukl
> create mode 100644 samples/ukl/Makefile
> create mode 100644 samples/ukl/README
> create mode 100644 samples/ukl/syscall.S
> create mode 100644 samples/ukl/tcp_server.c
>
>
>base-commit: 4fe89d07dcc2804c8b562f6c7896a45643d34b2f

This is basically taking Linux and turning it into a whole new operating system, while expecting the Linux kernel community to carry the support burden thereof.

We have seen this before, notably with Xen. It is *expensive* and *painful* for the maintenance of the mainstream kernel.

Linux already has a notion of "kernel mode applications", they are called kernel modules and kernel threads. It seems to me that you are trying to introduce a user space compatibility layer into the kernel, with the only benefit being avoiding the syscall overhead. The latter is bigger than we would like, which is why we are changing the x86 hardware architecture to improve it.

In my opinion, this would require *enormous* justification to put it into mainline.




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 10/10] Kconfig: Add config option for enabling and sample for testing UKL
  2022-10-04  2:11   ` Bagas Sanjaya
@ 2022-10-06 21:28     ` Ali Raza
  2022-10-07 10:21       ` Masahiro Yamada
  0 siblings, 1 reply; 26+ messages in thread
From: Ali Raza @ 2022-10-06 21:28 UTC (permalink / raw)
  To: Bagas Sanjaya, linux-kernel
  Cc: corbet, masahiroy, michal.lkml, ndesaulniers, tglx, mingo, bp,
	dave.hansen, hpa, luto, ebiederm, keescook, peterz, viro, arnd,
	juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, pbonzini, jpoimboe, linux-doc,
	linux-kbuild, linux-mm, linux-fsdevel, linux-arch, x86, rjones,
	munsoner, tommyu, drepper, lwoodman, mboydmcse, okrieg, rmancuso

On 10/3/22 22:11, Bagas Sanjaya wrote:
> On 10/4/22 05:21, Ali Raza wrote:
>> Add the KConfig file that will enable building UKL. Documentation
>> introduces the technical details for how UKL works and the motivations
>> behind why it is useful. Sample provides a simple program that still uses
>> the standard system call interface, but does not require a modified C
>> library.
>>
> <snipped>
>>  Documentation/index.rst   |   1 +
>>  Documentation/ukl/ukl.rst | 104 ++++++++++++++++++++++++++++++++++++++
>>  Kconfig                   |   2 +
>>  kernel/Kconfig.ukl        |  41 +++++++++++++++
>>  samples/ukl/Makefile      |  16 ++++++
>>  samples/ukl/README        |  17 +++++++
>>  samples/ukl/syscall.S     |  28 ++++++++++
>>  samples/ukl/tcp_server.c  |  99 ++++++++++++++++++++++++++++++++++++
>>  8 files changed, 308 insertions(+)
>>  create mode 100644 Documentation/ukl/ukl.rst
>>  create mode 100644 kernel/Kconfig.ukl
>>  create mode 100644 samples/ukl/Makefile
>>  create mode 100644 samples/ukl/README
>>  create mode 100644 samples/ukl/syscall.S
>>  create mode 100644 samples/ukl/tcp_server.c
> 
> Shouldn't the documentation be split into its own patch?
> 
Thanks for pointing that out.

--Ali


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 10/10] Kconfig: Add config option for enabling and sample for testing UKL
  2022-10-06 21:28     ` Ali Raza
@ 2022-10-07 10:21       ` Masahiro Yamada
  2022-10-13 17:08         ` Ali Raza
  0 siblings, 1 reply; 26+ messages in thread
From: Masahiro Yamada @ 2022-10-07 10:21 UTC (permalink / raw)
  To: Ali Raza
  Cc: Bagas Sanjaya, linux-kernel, corbet, michal.lkml, ndesaulniers,
	tglx, mingo, bp, dave.hansen, hpa, luto, ebiederm, keescook,
	peterz, viro, arnd, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	pbonzini, jpoimboe, linux-doc, linux-kbuild, linux-mm,
	linux-fsdevel, linux-arch, x86, rjones, munsoner, tommyu,
	drepper, lwoodman, mboydmcse, okrieg, rmancuso

On Fri, Oct 7, 2022 at 6:29 AM Ali Raza <aliraza@bu.edu> wrote:
>
> On 10/3/22 22:11, Bagas Sanjaya wrote:
> > On 10/4/22 05:21, Ali Raza wrote:
> >> Add the KConfig file that will enable building UKL. Documentation
> >> introduces the technical details for how UKL works and the motivations
> >> behind why it is useful. Sample provides a simple program that still uses
> >> the standard system call interface, but does not require a modified C
> >> library.
> >>
> > <snipped>
> >>  Documentation/index.rst   |   1 +
> >>  Documentation/ukl/ukl.rst | 104 ++++++++++++++++++++++++++++++++++++++
> >>  Kconfig                   |   2 +
> >>  kernel/Kconfig.ukl        |  41 +++++++++++++++
> >>  samples/ukl/Makefile      |  16 ++++++
> >>  samples/ukl/README        |  17 +++++++
> >>  samples/ukl/syscall.S     |  28 ++++++++++
> >>  samples/ukl/tcp_server.c  |  99 ++++++++++++++++++++++++++++++++++++
> >>  8 files changed, 308 insertions(+)
> >>  create mode 100644 Documentation/ukl/ukl.rst
> >>  create mode 100644 kernel/Kconfig.ukl
> >>  create mode 100644 samples/ukl/Makefile
> >>  create mode 100644 samples/ukl/README
> >>  create mode 100644 samples/ukl/syscall.S
> >>  create mode 100644 samples/ukl/tcp_server.c
> >
> > Shouldn't the documentation be split into its own patch?
> >
> Thanks for pointing that out.
>
> --Ali
>


The commit subject "Kconfig:" is used for changes
under scripts/kconfig/.

Please use something else.


-- 
Best Regards
Masahiro Yamada

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC UKL 10/10] Kconfig: Add config option for enabling and sample for testing UKL
  2022-10-07 10:21       ` Masahiro Yamada
@ 2022-10-13 17:08         ` Ali Raza
  0 siblings, 0 replies; 26+ messages in thread
From: Ali Raza @ 2022-10-13 17:08 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Bagas Sanjaya, linux-kernel, corbet, michal.lkml, ndesaulniers,
	tglx, mingo, bp, dave.hansen, hpa, luto, ebiederm, keescook,
	peterz, viro, arnd, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid,
	pbonzini, jpoimboe, linux-doc, linux-kbuild, linux-mm,
	linux-fsdevel, linux-arch, x86, rjones, munsoner, tommyu,
	drepper, lwoodman, mboydmcse, okrieg, rmancuso

On 10/7/22 06:21, Masahiro Yamada wrote:
> On Fri, Oct 7, 2022 at 6:29 AM Ali Raza <aliraza@bu.edu> wrote:
>>
>> On 10/3/22 22:11, Bagas Sanjaya wrote:
>>> On 10/4/22 05:21, Ali Raza wrote:
>>>> Add the KConfig file that will enable building UKL. Documentation
>>>> introduces the technical details for how UKL works and the motivations
>>>> behind why it is useful. Sample provides a simple program that still uses
>>>> the standard system call interface, but does not require a modified C
>>>> library.
>>>>
>>> <snipped>
>>>>  Documentation/index.rst   |   1 +
>>>>  Documentation/ukl/ukl.rst | 104 ++++++++++++++++++++++++++++++++++++++
>>>>  Kconfig                   |   2 +
>>>>  kernel/Kconfig.ukl        |  41 +++++++++++++++
>>>>  samples/ukl/Makefile      |  16 ++++++
>>>>  samples/ukl/README        |  17 +++++++
>>>>  samples/ukl/syscall.S     |  28 ++++++++++
>>>>  samples/ukl/tcp_server.c  |  99 ++++++++++++++++++++++++++++++++++++
>>>>  8 files changed, 308 insertions(+)
>>>>  create mode 100644 Documentation/ukl/ukl.rst
>>>>  create mode 100644 kernel/Kconfig.ukl
>>>>  create mode 100644 samples/ukl/Makefile
>>>>  create mode 100644 samples/ukl/README
>>>>  create mode 100644 samples/ukl/syscall.S
>>>>  create mode 100644 samples/ukl/tcp_server.c
>>>
>>> Shouldn't the documentation be split into its own patch?
>>>
>> Thanks for pointing that out.
>>
>> --Ali
>>
> 
> 
> The commit subject "Kconfig:" is used for changes
> under scripts/kconfig/.
> 
> Please use something else.
> 
> 
Will do, thank you!

--Ali

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-10-13 17:09 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-03 22:21 [RFC UKL 00/10] Unikernel Linux (UKL) Ali Raza
2022-10-03 22:21 ` [RFC UKL 01/10] kbuild: Add sections and symbols to linker script for UKL support Ali Raza
2022-10-03 22:21 ` [RFC UKL 02/10] x86/boot: Load the PT_TLS segment for Unikernel configs Ali Raza
2022-10-04 17:30   ` Andy Lutomirski
2022-10-06 21:00     ` Ali Raza
2022-10-03 22:21 ` [RFC UKL 03/10] sched: Add task_struct tracking of kernel or application execution Ali Raza
2022-10-03 22:21 ` [RFC UKL 04/10] x86/entry: Create alternate entry path for system calls Ali Raza
2022-10-04 17:43   ` Andy Lutomirski
2022-10-06 21:12     ` Ali Raza
2022-10-03 22:21 ` [RFC UKL 05/10] x86/uaccess: Make access_ok UKL aware Ali Raza
2022-10-04 17:36   ` Andy Lutomirski
2022-10-06 21:16     ` Ali Raza
2022-10-03 22:21 ` [RFC UKL 06/10] x86/fault: Skip checking kernel mode access to user address space for UKL Ali Raza
2022-10-03 22:21 ` [RFC UKL 07/10] x86/signal: Adjust signal handler register values and return frame Ali Raza
2022-10-04 17:34   ` Andy Lutomirski
2022-10-06 21:20     ` Ali Raza
2022-10-03 22:21 ` [RFC UKL 08/10] exec: Make exec path for starting UKL application Ali Raza
2022-10-03 22:21 ` [RFC UKL 09/10] exec: Give userspace a method for starting UKL process Ali Raza
2022-10-04 17:35   ` Andy Lutomirski
2022-10-06 21:25     ` Ali Raza
2022-10-03 22:21 ` [RFC UKL 10/10] Kconfig: Add config option for enabling and sample for testing UKL Ali Raza
2022-10-04  2:11   ` Bagas Sanjaya
2022-10-06 21:28     ` Ali Raza
2022-10-07 10:21       ` Masahiro Yamada
2022-10-13 17:08         ` Ali Raza
2022-10-06 21:27 ` [RFC UKL 00/10] Unikernel Linux (UKL) H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).