linux-um.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] Rework stub syscall and page table handling
@ 2024-04-18  9:23 benjamin
  2024-04-18  9:23 ` [PATCH 01/12] um: Remove stub-data.h include from common-offsets.h benjamin
                   ` (12 more replies)
  0 siblings, 13 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin.berg@intel.com>

This patchset reworks the stub syscall handling and also redos how page
table updates are tracked and synchronized. Some of this originated in
the SECCOMP patchset, but it became clear that these refactorings make
sense independently as they result in a considerably fewer page faults.

As an example, these changes bring the runtime of one of the hostapd
hwsim test modules from 121 seconds down to 103 seconds. Actual results
will depend heavily on how pagefault heavy the workload is.

Lowering the amount of pagefaults is due to proactively syncing PTEs
that are set (and were previously unset) and also delaying
synchronization so that multiple updates can be done in one step rather
than requiring multiple task switches.

This refactoring also fixes various odd corner cases in how UML was
handling memory and cloning of MMs. As part of this work, support for
LDTs has been dropped. My expectation is that this is not a problem as
it should only be needed by legacy applications.

Benjamin Berg (12):
  um: Remove stub-data.h include from common-offsets.h
  um: Create signal stack memory assignment in stub_data
  um: Add generic stub_syscall6 function
  um: Rework syscall handling
  um: compress memory related stub syscalls while adding them
  um: remove LDT support
  um: remove copy_context_skas0
  um: Delay flushing syscalls until the thread is restarted
  um: Do not flush MM in flush_thread
  um: remove force_flush_all from fork_handler
  um: simplify and consolidate TLB updates
  um: refactor TLB update handling

 arch/um/include/asm/mmu.h               |  10 +-
 arch/um/include/asm/mmu_context.h       |   2 -
 arch/um/include/asm/pgtable.h           |  32 ++
 arch/um/include/asm/tlbflush.h          |  46 +-
 arch/um/include/shared/as-layout.h      |   2 +-
 arch/um/include/shared/common-offsets.h |   5 -
 arch/um/include/shared/os.h             |  27 +-
 arch/um/include/shared/skas/mm_id.h     |   2 +-
 arch/um/include/shared/skas/skas.h      |   2 +
 arch/um/include/shared/skas/stub-data.h |  35 +-
 arch/um/include/shared/user.h           |   8 +
 arch/um/kernel/exec.c                   |   9 -
 arch/um/kernel/process.c                |   2 -
 arch/um/kernel/skas/Makefile            |   9 +-
 arch/um/kernel/skas/clone.c             |  48 ---
 arch/um/kernel/skas/mmu.c               |  33 +-
 arch/um/kernel/skas/process.c           |  18 +
 arch/um/kernel/skas/stub.c              |  69 +++
 arch/um/kernel/tlb.c                    | 551 ++++--------------------
 arch/um/kernel/trap.c                   |  15 +-
 arch/um/os-Linux/skas/mem.c             | 248 ++++++-----
 arch/um/os-Linux/skas/process.c         | 121 +-----
 arch/um/os-Linux/start_up.c             |   1 +
 arch/x86/um/Makefile                    |   4 +-
 arch/x86/um/asm/mm_context.h            |  70 ---
 arch/x86/um/ldt.c                       | 380 ----------------
 arch/x86/um/shared/sysdep/stub.h        |   1 +
 arch/x86/um/shared/sysdep/stub_32.h     |  29 +-
 arch/x86/um/shared/sysdep/stub_64.h     |  23 +-
 arch/x86/um/stub_32.S                   |  56 ---
 arch/x86/um/stub_64.S                   |  50 ---
 arch/x86/um/tls_32.c                    |   1 +
 32 files changed, 543 insertions(+), 1366 deletions(-)
 delete mode 100644 arch/um/kernel/skas/clone.c
 create mode 100644 arch/um/kernel/skas/stub.c
 delete mode 100644 arch/x86/um/asm/mm_context.h
 delete mode 100644 arch/x86/um/ldt.c
 delete mode 100644 arch/x86/um/stub_32.S
 delete mode 100644 arch/x86/um/stub_64.S

-- 
2.44.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/12] um: Remove stub-data.h include from common-offsets.h
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-18  9:23 ` [PATCH 02/12] um: Create signal stack memory assignment in stub_data benjamin
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin@sipsolutions.net>

Further commits will require values from common-offsets.h inside
stub-data.h. Resolve the possible circular dependency and simply use
offsetof() inside stub_32.h and stub_64.h.

Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
---
 arch/um/include/shared/common-offsets.h | 5 -----
 arch/x86/um/shared/sysdep/stub_32.h     | 7 ++++---
 arch/x86/um/shared/sysdep/stub_64.h     | 7 ++++---
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/um/include/shared/common-offsets.h b/arch/um/include/shared/common-offsets.h
index 96195483fbd0..579ed946a3a9 100644
--- a/arch/um/include/shared/common-offsets.h
+++ b/arch/um/include/shared/common-offsets.h
@@ -1,6 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /* for use by sys-$SUBARCH/kernel-offsets.c */
-#include <stub-data.h>
 
 DEFINE(KERNEL_MADV_REMOVE, MADV_REMOVE);
 
@@ -30,7 +29,3 @@ DEFINE(UML_CONFIG_64BIT, CONFIG_64BIT);
 DEFINE(UML_CONFIG_UML_TIME_TRAVEL_SUPPORT, CONFIG_UML_TIME_TRAVEL_SUPPORT);
 #endif
 
-/* for stub */
-DEFINE(UML_STUB_FIELD_OFFSET, offsetof(struct stub_data, offset));
-DEFINE(UML_STUB_FIELD_CHILD_ERR, offsetof(struct stub_data, child_err));
-DEFINE(UML_STUB_FIELD_FD, offsetof(struct stub_data, fd));
diff --git a/arch/x86/um/shared/sysdep/stub_32.h b/arch/x86/um/shared/sysdep/stub_32.h
index ea8b5a2d67af..2748b7ee031a 100644
--- a/arch/x86/um/shared/sysdep/stub_32.h
+++ b/arch/x86/um/shared/sysdep/stub_32.h
@@ -6,6 +6,7 @@
 #ifndef __SYSDEP_STUB_H
 #define __SYSDEP_STUB_H
 
+#include <stddef.h>
 #include <asm/ptrace.h>
 #include <generated/asm-offsets.h>
 
@@ -98,9 +99,9 @@ static __always_inline void remap_stack_and_trap(void)
 		: :
 		"g" (~(STUB_DATA_PAGES * UM_KERN_PAGE_SIZE - 1)),
 		"g" (STUB_MMAP_NR),
-		"g" (UML_STUB_FIELD_FD),
-		"g" (UML_STUB_FIELD_OFFSET),
-		"g" (UML_STUB_FIELD_CHILD_ERR),
+		"g" (offsetof(struct stub_data, fd)),
+		"g" (offsetof(struct stub_data, offset)),
+		"g" (offsetof(struct stub_data, child_err)),
 		"c" (STUB_DATA_PAGES * UM_KERN_PAGE_SIZE),
 		"d" (PROT_READ | PROT_WRITE),
 		"S" (MAP_FIXED | MAP_SHARED)
diff --git a/arch/x86/um/shared/sysdep/stub_64.h b/arch/x86/um/shared/sysdep/stub_64.h
index b24168ef0ac4..50c5e0529dfb 100644
--- a/arch/x86/um/shared/sysdep/stub_64.h
+++ b/arch/x86/um/shared/sysdep/stub_64.h
@@ -6,6 +6,7 @@
 #ifndef __SYSDEP_STUB_H
 #define __SYSDEP_STUB_H
 
+#include <stddef.h>
 #include <sysdep/ptrace_user.h>
 #include <generated/asm-offsets.h>
 #include <linux/stddef.h>
@@ -101,9 +102,9 @@ static __always_inline void remap_stack_and_trap(void)
 		"g" (STUB_MMAP_NR),
 		"g" (~(STUB_DATA_PAGES * UM_KERN_PAGE_SIZE - 1)),
 		"g" (MAP_FIXED | MAP_SHARED),
-		"g" (UML_STUB_FIELD_FD),
-		"g" (UML_STUB_FIELD_OFFSET),
-		"g" (UML_STUB_FIELD_CHILD_ERR),
+		"g" (offsetof(struct stub_data, fd)),
+		"g" (offsetof(struct stub_data, offset)),
+		"g" (offsetof(struct stub_data, child_err)),
 		"S" (STUB_DATA_PAGES * UM_KERN_PAGE_SIZE),
 		"d" (PROT_READ | PROT_WRITE)
 		:
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 02/12] um: Create signal stack memory assignment in stub_data
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
  2024-04-18  9:23 ` [PATCH 01/12] um: Remove stub-data.h include from common-offsets.h benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-18  9:23 ` [PATCH 03/12] um: Add generic stub_syscall6 function benjamin
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin@sipsolutions.net>

When we switch to use seccomp, we need both the signal stack and other
data (i.e. syscall information) to co-exist in the stub data. To
facilitate this, start by defining separate memory areas for the stack
and syscall data.

This moves the signal stack onto a new page as the memory area is not
sufficient to hold both signal stack and syscall information.

Only change the signal stack setup for now, as the syscall code will be
reworked later.

Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
---
 arch/um/include/shared/as-layout.h      |  2 +-
 arch/um/include/shared/skas/stub-data.h |  9 +++++++++
 arch/um/kernel/skas/clone.c             |  6 ++++--
 arch/um/kernel/skas/mmu.c               |  4 ++++
 arch/um/os-Linux/skas/process.c         | 11 ++++++-----
 5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/um/include/shared/as-layout.h b/arch/um/include/shared/as-layout.h
index 9ec3015bc5e2..4ef98c0339fa 100644
--- a/arch/um/include/shared/as-layout.h
+++ b/arch/um/include/shared/as-layout.h
@@ -23,7 +23,7 @@
 #define STUB_START stub_start
 #define STUB_CODE STUB_START
 #define STUB_DATA (STUB_CODE + UM_KERN_PAGE_SIZE)
-#define STUB_DATA_PAGES 1 /* must be a power of two */
+#define STUB_DATA_PAGES 2 /* must be a power of two */
 #define STUB_END (STUB_DATA + STUB_DATA_PAGES * UM_KERN_PAGE_SIZE)
 
 #ifndef __ASSEMBLY__
diff --git a/arch/um/include/shared/skas/stub-data.h b/arch/um/include/shared/skas/stub-data.h
index 5e3ade3fb38b..779d2a3bac5d 100644
--- a/arch/um/include/shared/skas/stub-data.h
+++ b/arch/um/include/shared/skas/stub-data.h
@@ -8,10 +8,19 @@
 #ifndef __STUB_DATA_H
 #define __STUB_DATA_H
 
+#include <linux/compiler_types.h>
+#include <as-layout.h>
+
 struct stub_data {
 	unsigned long offset;
 	int fd;
 	long parent_err, child_err;
+
+	/* 128 leaves enough room for additional fields in the struct */
+	unsigned char syscall_data[UM_KERN_PAGE_SIZE - 128] __aligned(16);
+
+	/* Stack for our signal handlers and for calling into . */
+	unsigned char sigstack[UM_KERN_PAGE_SIZE] __aligned(UM_KERN_PAGE_SIZE);
 };
 
 #endif
diff --git a/arch/um/kernel/skas/clone.c b/arch/um/kernel/skas/clone.c
index 62435187dda4..906f7454887c 100644
--- a/arch/um/kernel/skas/clone.c
+++ b/arch/um/kernel/skas/clone.c
@@ -27,9 +27,11 @@ stub_clone_handler(void)
 	struct stub_data *data = get_stub_data();
 	long err;
 
+	/* syscall data as a temporary stack area (bottom half). */
 	err = stub_syscall2(__NR_clone, CLONE_PARENT | CLONE_FILES | SIGCHLD,
-			    (unsigned long)data +
-				STUB_DATA_PAGES * UM_KERN_PAGE_SIZE / 2);
+			    (unsigned long) data->syscall_data +
+					    sizeof(data->syscall_data) / 2 -
+					    sizeof(void *));
 	if (err) {
 		data->parent_err = err;
 		goto done;
diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
index 656fe16c9b63..b6b9679e6e11 100644
--- a/arch/um/kernel/skas/mmu.c
+++ b/arch/um/kernel/skas/mmu.c
@@ -13,6 +13,10 @@
 #include <as-layout.h>
 #include <os.h>
 #include <skas.h>
+#include <stub-data.h>
+
+/* Ensure the stub_data struct covers the allocated area */
+static_assert(sizeof(struct stub_data) == STUB_DATA_PAGES * UM_KERN_PAGE_SIZE);
 
 int init_new_context(struct task_struct *task, struct mm_struct *mm)
 {
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index 1f5c3f2523d1..72114720be10 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -470,11 +470,12 @@ static int __init init_thread_regs(void)
 	thread_regs[REGS_IP_INDEX] = STUB_CODE +
 				(unsigned long) stub_clone_handler -
 				(unsigned long) __syscall_stub_start;
-	thread_regs[REGS_SP_INDEX] = STUB_DATA + STUB_DATA_PAGES * UM_KERN_PAGE_SIZE -
-		sizeof(void *);
-#ifdef __SIGNAL_FRAMESIZE
-	thread_regs[REGS_SP_INDEX] -= __SIGNAL_FRAMESIZE;
-#endif
+
+	/* syscall data as a temporary stack area (top half). */
+	thread_regs[REGS_SP_INDEX] = STUB_DATA +
+				     offsetof(struct stub_data, syscall_data) +
+				     sizeof(((struct stub_data *) 0)->syscall_data) -
+				     sizeof(void *);
 	return 0;
 }
 
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 03/12] um: Add generic stub_syscall6 function
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
  2024-04-18  9:23 ` [PATCH 01/12] um: Remove stub-data.h include from common-offsets.h benjamin
  2024-04-18  9:23 ` [PATCH 02/12] um: Create signal stack memory assignment in stub_data benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-18  9:23 ` [PATCH 04/12] um: Rework syscall handling benjamin
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin@sipsolutions.net>

This function will be used by the new syscall handling code.

Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
---
 arch/x86/um/shared/sysdep/stub_32.h | 22 ++++++++++++++++++++++
 arch/x86/um/shared/sysdep/stub_64.h | 16 ++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/arch/x86/um/shared/sysdep/stub_32.h b/arch/x86/um/shared/sysdep/stub_32.h
index 2748b7ee031a..ab08a69fb57f 100644
--- a/arch/x86/um/shared/sysdep/stub_32.h
+++ b/arch/x86/um/shared/sysdep/stub_32.h
@@ -80,6 +80,28 @@ static __always_inline long stub_syscall5(long syscall, long arg1, long arg2,
 	return ret;
 }
 
+static __always_inline long stub_syscall6(long syscall, long arg1, long arg2,
+					  long arg3, long arg4, long arg5,
+					  long arg6)
+{
+	struct syscall_args {
+		int ebx, ebp;
+	} args = { arg1, arg6 };
+	long ret;
+
+	__asm__ volatile ("pushl %%ebp;"
+			"movl 0x4(%%ebx),%%ebp;"
+			"movl (%%ebx),%%ebx;"
+			"int $0x80;"
+			"popl %%ebp"
+			: "=a" (ret)
+			: "0" (syscall), "b" (&args),
+			"c" (arg2), "d" (arg3), "S" (arg4), "D" (arg5)
+			: "memory");
+
+	return ret;
+}
+
 static __always_inline void trap_myself(void)
 {
 	__asm("int3");
diff --git a/arch/x86/um/shared/sysdep/stub_64.h b/arch/x86/um/shared/sysdep/stub_64.h
index 50c5e0529dfb..d27b34d75d70 100644
--- a/arch/x86/um/shared/sysdep/stub_64.h
+++ b/arch/x86/um/shared/sysdep/stub_64.h
@@ -80,6 +80,22 @@ static __always_inline long stub_syscall5(long syscall, long arg1, long arg2,
 	return ret;
 }
 
+static __always_inline long stub_syscall6(long syscall, long arg1, long arg2,
+					  long arg3, long arg4, long arg5,
+					  long arg6)
+{
+	long ret;
+
+	__asm__ volatile ("movq %5,%%r10 ; movq %6,%%r8 ; movq %7,%%r9 ; "
+		__syscall
+		: "=a" (ret)
+		: "0" (syscall), "D" (arg1), "S" (arg2), "d" (arg3),
+		  "g" (arg4), "g" (arg5), "g" (arg6)
+		: __syscall_clobber, "r10", "r8", "r9");
+
+	return ret;
+}
+
 static __always_inline void trap_myself(void)
 {
 	__asm("int3");
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 04/12] um: Rework syscall handling
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
                   ` (2 preceding siblings ...)
  2024-04-18  9:23 ` [PATCH 03/12] um: Add generic stub_syscall6 function benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-18  9:23 ` [PATCH 05/12] um: compress memory related stub syscalls while adding them benjamin
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin@sipsolutions.net>

Rework syscall handling to be platform independent. Also create a clean
split between queueing of syscalls and flushing them out, removing the
need to keep state in the code that triggers the syscalls.

The code adds syscall_data_len to the global mm_id structure. This will
be used later to allow surrounding code to track whether syscalls still
need to run and if errors occurred.

Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
---
 arch/um/include/shared/os.h             |  22 +--
 arch/um/include/shared/skas/mm_id.h     |   1 +
 arch/um/include/shared/skas/stub-data.h |  35 +++-
 arch/um/include/shared/user.h           |   8 +
 arch/um/kernel/exec.c                   |  10 +-
 arch/um/kernel/skas/Makefile            |   7 +-
 arch/um/kernel/skas/clone.c             |   2 +-
 arch/um/kernel/skas/stub.c              |  80 +++++++++
 arch/um/kernel/tlb.c                    |  42 +++--
 arch/um/os-Linux/skas/mem.c             | 212 +++++++++++-------------
 arch/um/os-Linux/skas/process.c         |   4 +-
 arch/x86/um/Makefile                    |   2 +-
 arch/x86/um/ldt.c                       |  41 ++---
 arch/x86/um/shared/sysdep/stub.h        |   1 +
 arch/x86/um/stub_32.S                   |  56 -------
 arch/x86/um/stub_64.S                   |  50 ------
 16 files changed, 275 insertions(+), 298 deletions(-)
 create mode 100644 arch/um/kernel/skas/stub.c
 delete mode 100644 arch/x86/um/stub_32.S
 delete mode 100644 arch/x86/um/stub_64.S

diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index aff8906304ea..16d726f3df84 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -268,19 +268,15 @@ extern long long os_persistent_clock_emulation(void);
 extern long long os_nsecs(void);
 
 /* skas/mem.c */
-extern long run_syscall_stub(struct mm_id * mm_idp,
-			     int syscall, unsigned long *args, long expected,
-			     void **addr, int done);
-extern long syscall_stub_data(struct mm_id * mm_idp,
-			      unsigned long *data, int data_count,
-			      void **addr, void **stub_addr);
-extern int map(struct mm_id * mm_idp, unsigned long virt,
-	       unsigned long len, int prot, int phys_fd,
-	       unsigned long long offset, int done, void **data);
-extern int unmap(struct mm_id * mm_idp, unsigned long addr, unsigned long len,
-		 int done, void **data);
-extern int protect(struct mm_id * mm_idp, unsigned long addr,
-		   unsigned long len, unsigned int prot, int done, void **data);
+int syscall_stub_flush(struct mm_id *mm_idp);
+struct stub_syscall *syscall_stub_alloc(struct mm_id *mm_idp);
+
+void map(struct mm_id *mm_idp, unsigned long virt,
+	 unsigned long len, int prot, int phys_fd,
+	 unsigned long long offset);
+void unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len);
+void protect(struct mm_id *mm_idp, unsigned long addr,
+	     unsigned long len, unsigned int prot);
 
 /* skas/process.c */
 extern int is_skas_winch(int pid, int fd, void *data);
diff --git a/arch/um/include/shared/skas/mm_id.h b/arch/um/include/shared/skas/mm_id.h
index e82e203f5f41..bcb951719b51 100644
--- a/arch/um/include/shared/skas/mm_id.h
+++ b/arch/um/include/shared/skas/mm_id.h
@@ -13,6 +13,7 @@ struct mm_id {
 	} u;
 	unsigned long stack;
 	int kill;
+	int syscall_data_len;
 };
 
 #endif
diff --git a/arch/um/include/shared/skas/stub-data.h b/arch/um/include/shared/skas/stub-data.h
index 779d2a3bac5d..6ffa9fb5218c 100644
--- a/arch/um/include/shared/skas/stub-data.h
+++ b/arch/um/include/shared/skas/stub-data.h
@@ -10,14 +10,45 @@
 
 #include <linux/compiler_types.h>
 #include <as-layout.h>
+#include <sysdep/tls.h>
+
+#define STUB_NEXT_SYSCALL(s) \
+	((struct stub_syscall *) (((unsigned long) s) + (s)->cmd_len))
+
+enum stub_syscall_type {
+	STUB_SYSCALL_UNSET = 0,
+	STUB_SYSCALL_MMAP,
+	STUB_SYSCALL_MUNMAP,
+	STUB_SYSCALL_MPROTECT,
+	STUB_SYSCALL_LDT,
+};
+
+struct stub_syscall {
+	union {
+		struct {
+			unsigned long addr;
+			unsigned long length;
+			unsigned long offset;
+			int fd;
+			int prot;
+		} mem;
+		struct {
+			user_desc_t desc;
+			int func;
+		} ldt;
+	};
+
+	enum stub_syscall_type syscall;
+};
 
 struct stub_data {
 	unsigned long offset;
 	int fd;
-	long parent_err, child_err;
+	long err, child_err;
 
+	int syscall_data_len;
 	/* 128 leaves enough room for additional fields in the struct */
-	unsigned char syscall_data[UM_KERN_PAGE_SIZE - 128] __aligned(16);
+	struct stub_syscall syscall_data[(UM_KERN_PAGE_SIZE - 128) / sizeof(struct stub_syscall)] __aligned(16);
 
 	/* Stack for our signal handlers and for calling into . */
 	unsigned char sigstack[UM_KERN_PAGE_SIZE] __aligned(UM_KERN_PAGE_SIZE);
diff --git a/arch/um/include/shared/user.h b/arch/um/include/shared/user.h
index 326e52450e41..bbab79c0c074 100644
--- a/arch/um/include/shared/user.h
+++ b/arch/um/include/shared/user.h
@@ -42,11 +42,19 @@ extern void panic(const char *fmt, ...)
 #define printk(...) _printk(__VA_ARGS__)
 extern int _printk(const char *fmt, ...)
 	__attribute__ ((format (printf, 1, 2)));
+extern void print_hex_dump(const char *level, const char *prefix_str,
+			   int prefix_type, int rowsize, int groupsize,
+			   const void *buf, size_t len, _Bool ascii);
 #else
 static inline int printk(const char *fmt, ...)
 {
 	return 0;
 }
+static inline void print_hex_dump(const char *level, const char *prefix_str,
+				  int prefix_type, int rowsize, int groupsize,
+				  const void *buf, size_t len, _Bool ascii)
+{
+}
 #endif
 
 extern int in_aton(char *str);
diff --git a/arch/um/kernel/exec.c b/arch/um/kernel/exec.c
index 827a0d3fa589..5c8836b012e9 100644
--- a/arch/um/kernel/exec.c
+++ b/arch/um/kernel/exec.c
@@ -22,15 +22,11 @@
 
 void flush_thread(void)
 {
-	void *data = NULL;
-	int ret;
-
 	arch_flush_thread(&current->thread.arch);
 
-	ret = unmap(&current->mm->context.id, 0, TASK_SIZE, 1, &data);
-	if (ret) {
-		printk(KERN_ERR "%s - clearing address space failed, err = %d\n",
-		       __func__, ret);
+	unmap(&current->mm->context.id, 0, TASK_SIZE);
+	if (syscall_stub_flush(&current->mm->context.id) < 0) {
+		printk(KERN_ERR "%s - clearing address space failed", __func__);
 		force_sig(SIGKILL);
 	}
 	get_safe_registers(current_pt_regs()->regs.gp,
diff --git a/arch/um/kernel/skas/Makefile b/arch/um/kernel/skas/Makefile
index f93972a25765..dd8bc2167e36 100644
--- a/arch/um/kernel/skas/Makefile
+++ b/arch/um/kernel/skas/Makefile
@@ -3,14 +3,15 @@
 # Copyright (C) 2002 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
 #
 
-obj-y := clone.o mmu.o process.o syscall.o uaccess.o
+obj-y := clone.o stub.o mmu.o process.o syscall.o uaccess.o
 
-# clone.o is in the stub, so it can't be built with profiling
+# clone.o and stub.o are in the stub, so it can't be built with profiling
 # GCC hardened also auto-enables -fpic, but we need %ebx so it can't work ->
 # disable it
 
 CFLAGS_clone.o := $(CFLAGS_NO_HARDENING)
-UNPROFILE_OBJS := clone.o
+CFLAGS_stub.o := $(CFLAGS_NO_HARDENING)
+UNPROFILE_OBJS := clone.o stub.o
 
 KCOV_INSTRUMENT := n
 
diff --git a/arch/um/kernel/skas/clone.c b/arch/um/kernel/skas/clone.c
index 906f7454887c..b59fa43d68ce 100644
--- a/arch/um/kernel/skas/clone.c
+++ b/arch/um/kernel/skas/clone.c
@@ -33,7 +33,7 @@ stub_clone_handler(void)
 					    sizeof(data->syscall_data) / 2 -
 					    sizeof(void *));
 	if (err) {
-		data->parent_err = err;
+		data->err = err;
 		goto done;
 	}
 
diff --git a/arch/um/kernel/skas/stub.c b/arch/um/kernel/skas/stub.c
new file mode 100644
index 000000000000..d17536612aa8
--- /dev/null
+++ b/arch/um/kernel/skas/stub.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2021 Benjamin Berg <benjamin@sipsolutions.net>
+ */
+
+#include <sysdep/stub.h>
+
+static __always_inline int syscall_handler(struct stub_data *d)
+{
+	int i;
+	unsigned long res;
+
+	for (i = 0; i < d->syscall_data_len; i++) {
+		struct stub_syscall *sc = &d->syscall_data[i];
+
+		switch (sc->syscall) {
+		case STUB_SYSCALL_MMAP:
+			res = stub_syscall6(STUB_MMAP_NR,
+					    sc->mem.addr, sc->mem.length,
+					    sc->mem.prot,
+					    MAP_SHARED | MAP_FIXED,
+					    sc->mem.fd, sc->mem.offset);
+			if (res != sc->mem.addr) {
+				d->err = res;
+				d->syscall_data_len = i;
+				return -1;
+			}
+			break;
+		case STUB_SYSCALL_MUNMAP:
+			res = stub_syscall2(__NR_munmap,
+					    sc->mem.addr, sc->mem.length);
+			if (res) {
+				d->err = res;
+				d->syscall_data_len = i;
+				return -1;
+			}
+			break;
+		case STUB_SYSCALL_MPROTECT:
+			res = stub_syscall3(__NR_munmap,
+					    sc->mem.addr, sc->mem.length,
+					    sc->mem.prot);
+			if (res) {
+				d->err = res;
+				d->syscall_data_len = i;
+				return -1;
+			}
+			break;
+		case STUB_SYSCALL_LDT:
+			res = stub_syscall3(__NR_modify_ldt, sc->ldt.func,
+					    (unsigned long) &sc->ldt.desc,
+					    sizeof(sc->ldt.desc));
+			/* We only write, so the expected result is zero */
+			if (res) {
+				d->err = res;
+				d->syscall_data_len = i;
+				return -1;
+			}
+			break;
+		default:
+			d->err = -95; /* EOPNOTSUPP */
+			d->syscall_data_len = i;
+			return -1;
+		}
+	}
+
+	d->err = 0;
+	d->syscall_data_len = 0;
+
+	return 0;
+}
+
+void __section(".__syscall_stub")
+stub_syscall_handler(void)
+{
+	struct stub_data *d = get_stub_data();
+
+	syscall_handler(d);
+
+	trap_myself();
+}
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 7d050ab0f78a..2a7b66b6e367 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -70,21 +70,19 @@ static int do_ops(struct host_vm_change *hvc, int end,
 		switch (op->type) {
 		case MMAP:
 			if (hvc->userspace)
-				ret = map(&hvc->mm->context.id, op->u.mmap.addr,
-					  op->u.mmap.len, op->u.mmap.prot,
-					  op->u.mmap.fd,
-					  op->u.mmap.offset, finished,
-					  &hvc->data);
+				map(&hvc->mm->context.id, op->u.mmap.addr,
+				    op->u.mmap.len, op->u.mmap.prot,
+				    op->u.mmap.fd,
+				    op->u.mmap.offset);
 			else
 				map_memory(op->u.mmap.addr, op->u.mmap.offset,
 					   op->u.mmap.len, 1, 1, 1);
 			break;
 		case MUNMAP:
 			if (hvc->userspace)
-				ret = unmap(&hvc->mm->context.id,
-					    op->u.munmap.addr,
-					    op->u.munmap.len, finished,
-					    &hvc->data);
+				unmap(&hvc->mm->context.id,
+				      op->u.munmap.addr,
+				      op->u.munmap.len);
 			else
 				ret = os_unmap_memory(
 					(void *) op->u.munmap.addr,
@@ -93,11 +91,10 @@ static int do_ops(struct host_vm_change *hvc, int end,
 			break;
 		case MPROTECT:
 			if (hvc->userspace)
-				ret = protect(&hvc->mm->context.id,
-					      op->u.mprotect.addr,
-					      op->u.mprotect.len,
-					      op->u.mprotect.prot,
-					      finished, &hvc->data);
+				protect(&hvc->mm->context.id,
+					op->u.mprotect.addr,
+					op->u.mprotect.len,
+					op->u.mprotect.prot);
 			else
 				ret = os_protect_memory(
 					(void *) op->u.mprotect.addr,
@@ -112,6 +109,9 @@ static int do_ops(struct host_vm_change *hvc, int end,
 		}
 	}
 
+	if (hvc->userspace && finished)
+		ret = syscall_stub_flush(&hvc->mm->context.id);
+
 	if (ret == -ENOMEM)
 		report_enomem();
 
@@ -460,7 +460,6 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long address)
 	pmd_t *pmd;
 	pte_t *pte;
 	struct mm_struct *mm = vma->vm_mm;
-	void *flush = NULL;
 	int r, w, x, prot, err = 0;
 	struct mm_id *mm_id;
 
@@ -503,14 +502,13 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long address)
 			int fd;
 
 			fd = phys_mapping(pte_val(*pte) & PAGE_MASK, &offset);
-			err = map(mm_id, address, PAGE_SIZE, prot, fd, offset,
-				  1, &flush);
-		}
-		else err = unmap(mm_id, address, PAGE_SIZE, 1, &flush);
-	}
-	else if (pte_newprot(*pte))
-		err = protect(mm_id, address, PAGE_SIZE, prot, 1, &flush);
+			map(mm_id, address, PAGE_SIZE, prot, fd, offset);
+		} else
+			unmap(mm_id, address, PAGE_SIZE);
+	} else if (pte_newprot(*pte))
+		protect(mm_id, address, PAGE_SIZE, prot);
 
+	err = syscall_stub_flush(mm_id);
 	if (err) {
 		if (err == -ENOMEM)
 			report_enomem();
diff --git a/arch/um/os-Linux/skas/mem.c b/arch/um/os-Linux/skas/mem.c
index 953fb10f3f93..03c8cde0b89b 100644
--- a/arch/um/os-Linux/skas/mem.c
+++ b/arch/um/os-Linux/skas/mem.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
+ * Copyright (C) 2021 Benjamin Berg <benjamin@sipsolutions.net>
  * Copyright (C) 2002 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
  */
 
@@ -18,11 +19,11 @@
 #include <sysdep/ptrace.h>
 #include <sysdep/stub.h>
 
-extern char batch_syscall_stub[], __syscall_stub_start[];
+extern char __syscall_stub_start[];
 
 extern void wait_stub_done(int pid);
 
-static inline unsigned long *check_init_stack(struct mm_id * mm_idp,
+static inline unsigned long *check_init_stack(struct mm_id *mm_idp,
 					      unsigned long *stack)
 {
 	if (stack == NULL) {
@@ -37,22 +38,24 @@ static unsigned long syscall_regs[MAX_REG_NR];
 static int __init init_syscall_regs(void)
 {
 	get_safe_registers(syscall_regs, NULL);
+
 	syscall_regs[REGS_IP_INDEX] = STUB_CODE +
-		((unsigned long) batch_syscall_stub -
+		((unsigned long) stub_syscall_handler -
 		 (unsigned long) __syscall_stub_start);
-	syscall_regs[REGS_SP_INDEX] = STUB_DATA;
+	syscall_regs[REGS_SP_INDEX] = STUB_DATA +
+		offsetof(struct stub_data, sigstack) +
+		sizeof(((struct stub_data *) 0)->sigstack) -
+		sizeof(void *);
 
 	return 0;
 }
 
 __initcall(init_syscall_regs);
 
-static inline long do_syscall_stub(struct mm_id * mm_idp, void **addr)
+static inline long do_syscall_stub(struct mm_id *mm_idp)
 {
+	struct stub_data *proc_data = (void *)mm_idp->stack;
 	int n, i;
-	long ret, offset;
-	unsigned long * data;
-	unsigned long * syscall;
 	int err, pid = mm_idp->u.pid;
 
 	n = ptrace_setregs(pid, syscall_regs);
@@ -64,6 +67,9 @@ static inline long do_syscall_stub(struct mm_id * mm_idp, void **addr)
 		      __func__, -n);
 	}
 
+	/* Inform process how much we have filled in. */
+	proc_data->syscall_data_len = mm_idp->syscall_data_len;
+
 	err = ptrace(PTRACE_CONT, pid, 0, 0);
 	if (err)
 		panic("Failed to continue stub, pid = %d, errno = %d\n", pid,
@@ -72,135 +78,113 @@ static inline long do_syscall_stub(struct mm_id * mm_idp, void **addr)
 	wait_stub_done(pid);
 
 	/*
-	 * When the stub stops, we find the following values on the
-	 * beginning of the stack:
-	 * (long )return_value
-	 * (long )offset to failed sycall-data (0, if no error)
+	 * proc_data->err will be non-zero if there was an (unexpected) error.
+	 * In that case, syscall_data_len points to the last executed syscall,
+	 * otherwise it will be zero (but we do not need to rely on that).
 	 */
-	ret = *((unsigned long *) mm_idp->stack);
-	offset = *((unsigned long *) mm_idp->stack + 1);
-	if (offset) {
-		data = (unsigned long *)(mm_idp->stack + offset - STUB_DATA);
-		printk(UM_KERN_ERR "%s : ret = %ld, offset = %ld, data = %p\n",
-		       __func__, ret, offset, data);
-		syscall = (unsigned long *)((unsigned long)data + data[0]);
-		printk(UM_KERN_ERR "%s: syscall %ld failed, return value = 0x%lx, expected return value = 0x%lx\n",
-		       __func__, syscall[0], ret, syscall[7]);
-		printk(UM_KERN_ERR "    syscall parameters: 0x%lx 0x%lx 0x%lx 0x%lx 0x%lx 0x%lx\n",
-		       syscall[1], syscall[2], syscall[3],
-		       syscall[4], syscall[5], syscall[6]);
-		for (n = 1; n < data[0]/sizeof(long); n++) {
-			if (n == 1)
-				printk(UM_KERN_ERR "    additional syscall data:");
-			if (n % 4 == 1)
-				printk("\n" UM_KERN_ERR "      ");
-			printk("  0x%lx", data[n]);
-		}
-		if (n > 1)
-			printk("\n");
+	if (proc_data->err < 0) {
+		struct stub_syscall *sc;
+
+		if (proc_data->syscall_data_len < 0 ||
+		    proc_data->syscall_data_len >= ARRAY_SIZE(proc_data->syscall_data))
+			panic("Syscall data was corrupted by stub (len is: %d, expected maximum: %d)!",
+			      proc_data->syscall_data_len,
+			      mm_idp->syscall_data_len);
+
+		sc = &proc_data->syscall_data[proc_data->syscall_data_len];
+
+		printk(UM_KERN_ERR "%s : length = %d, last offset = %d",
+		       __func__, mm_idp->syscall_data_len,
+		       proc_data->syscall_data_len);
+		printk(UM_KERN_ERR "%s : stub syscall type %d failed, return value = 0x%lx\n",
+		       __func__, sc->syscall, proc_data->err);
+
+		print_hex_dump(UM_KERN_ERR,
+			       "    syscall data: ", 0,
+			       16, 4, sc, sizeof(*sc), 0);
+
+		mm_idp->syscall_data_len = proc_data->err;
+	} else {
+		mm_idp->syscall_data_len = 0;
 	}
-	else ret = 0;
-
-	*addr = check_init_stack(mm_idp, NULL);
 
-	return ret;
+	return mm_idp->syscall_data_len;
 }
 
-long run_syscall_stub(struct mm_id * mm_idp, int syscall,
-		      unsigned long *args, long expected, void **addr,
-		      int done)
+int syscall_stub_flush(struct mm_id *mm_idp)
 {
-	unsigned long *stack = check_init_stack(mm_idp, *addr);
-
-	*stack += sizeof(long);
-	stack += *stack / sizeof(long);
-
-	*stack++ = syscall;
-	*stack++ = args[0];
-	*stack++ = args[1];
-	*stack++ = args[2];
-	*stack++ = args[3];
-	*stack++ = args[4];
-	*stack++ = args[5];
-	*stack++ = expected;
-	*stack = 0;
-
-	if (!done && ((((unsigned long) stack) & ~UM_KERN_PAGE_MASK) <
-		     UM_KERN_PAGE_SIZE - 10 * sizeof(long))) {
-		*addr = stack;
+	int res;
+
+	if (mm_idp->syscall_data_len == 0)
 		return 0;
+
+	/* If an error happened already, report it and reset the state. */
+	if (mm_idp->syscall_data_len < 0) {
+		res = mm_idp->syscall_data_len;
+		mm_idp->syscall_data_len = 0;
+		return res;
 	}
 
-	return do_syscall_stub(mm_idp, addr);
+	res = do_syscall_stub(mm_idp);
+	mm_idp->syscall_data_len = 0;
+
+	return res;
 }
 
-long syscall_stub_data(struct mm_id * mm_idp,
-		       unsigned long *data, int data_count,
-		       void **addr, void **stub_addr)
+struct stub_syscall *syscall_stub_alloc(struct mm_id *mm_idp)
 {
-	unsigned long *stack;
-	int ret = 0;
-
-	/*
-	 * If *addr still is uninitialized, it *must* contain NULL.
-	 * Thus in this case do_syscall_stub correctly won't be called.
-	 */
-	if ((((unsigned long) *addr) & ~UM_KERN_PAGE_MASK) >=
-	   UM_KERN_PAGE_SIZE - (10 + data_count) * sizeof(long)) {
-		ret = do_syscall_stub(mm_idp, addr);
-		/* in case of error, don't overwrite data on stack */
-		if (ret)
-			return ret;
+	struct stub_syscall *sc;
+	struct stub_data *proc_data = (struct stub_data *) mm_idp->stack;
+
+	if (mm_idp->syscall_data_len > 0 &&
+	    mm_idp->syscall_data_len == ARRAY_SIZE(proc_data->syscall_data))
+		do_syscall_stub(mm_idp);
+
+	if (mm_idp->syscall_data_len < 0) {
+		/* Return dummy to retain error state. */
+		sc = &proc_data->syscall_data[0];
+	} else {
+		sc = &proc_data->syscall_data[mm_idp->syscall_data_len];
+		mm_idp->syscall_data_len += 1;
 	}
+	memset(sc, 0, sizeof(*sc));
 
-	stack = check_init_stack(mm_idp, *addr);
-	*addr = stack;
-
-	*stack = data_count * sizeof(long);
-
-	memcpy(stack + 1, data, data_count * sizeof(long));
-
-	*stub_addr = (void *)(((unsigned long)(stack + 1) &
-			       ~UM_KERN_PAGE_MASK) + STUB_DATA);
-
-	return 0;
+	return sc;
 }
 
-int map(struct mm_id * mm_idp, unsigned long virt, unsigned long len, int prot,
-	int phys_fd, unsigned long long offset, int done, void **data)
-{
-	int ret;
-	unsigned long args[] = { virt, len, prot,
-				 MAP_SHARED | MAP_FIXED, phys_fd,
-				 MMAP_OFFSET(offset) };
 
-	ret = run_syscall_stub(mm_idp, STUB_MMAP_NR, args, virt,
-			       data, done);
-
-	return ret;
+void map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
+	int phys_fd, unsigned long long offset)
+{
+	struct stub_syscall *sc;
+
+	sc = syscall_stub_alloc(mm_idp);
+	sc->syscall = STUB_SYSCALL_MMAP;
+	sc->mem.addr = virt;
+	sc->mem.length = len;
+	sc->mem.prot = prot;
+	sc->mem.fd = phys_fd;
+	sc->mem.offset = MMAP_OFFSET(offset);
 }
 
-int unmap(struct mm_id * mm_idp, unsigned long addr, unsigned long len,
-	  int done, void **data)
+void unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
 {
-	int ret;
-	unsigned long args[] = { (unsigned long) addr, len, 0, 0, 0,
-				 0 };
+	struct stub_syscall *sc;
 
-	ret = run_syscall_stub(mm_idp, __NR_munmap, args, 0,
-			       data, done);
-
-	return ret;
+	sc = syscall_stub_alloc(mm_idp);
+	sc->syscall = STUB_SYSCALL_MUNMAP;
+	sc->mem.addr = addr;
+	sc->mem.length = len;
 }
 
-int protect(struct mm_id * mm_idp, unsigned long addr, unsigned long len,
-	    unsigned int prot, int done, void **data)
+void protect(struct mm_id *mm_idp, unsigned long addr, unsigned long len,
+	    unsigned int prot)
 {
-	int ret;
-	unsigned long args[] = { addr, len, prot, 0, 0, 0 };
-
-	ret = run_syscall_stub(mm_idp, __NR_mprotect, args, 0,
-			       data, done);
+	struct stub_syscall *sc;
 
-	return ret;
+	sc = syscall_stub_alloc(mm_idp);
+	sc->syscall = STUB_SYSCALL_MPROTECT;
+	sc->mem.addr = addr;
+	sc->mem.length = len;
+	sc->mem.prot = prot;
 }
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index 72114720be10..2bf03115817c 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -497,7 +497,7 @@ int copy_context_skas0(unsigned long new_stack, int pid)
 	*data = ((struct stub_data) {
 		.offset	= MMAP_OFFSET(new_offset),
 		.fd     = new_fd,
-		.parent_err = -ESRCH,
+		.err    = -ESRCH,
 		.child_err = 0,
 	});
 
@@ -534,7 +534,7 @@ int copy_context_skas0(unsigned long new_stack, int pid)
 
 	wait_stub_done(pid);
 
-	pid = data->parent_err;
+	pid = data->err;
 	if (pid < 0) {
 		printk(UM_KERN_ERR "%s - stub-parent reports error %d\n",
 		      __func__, -pid);
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index 8bc72a51b257..4559d2a693c4 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -11,7 +11,7 @@ endif
 
 obj-y = bugs_$(BITS).o delay.o fault.o ldt.o \
 	ptrace_$(BITS).o ptrace_user.o setjmp_$(BITS).o signal.o \
-	stub_$(BITS).o stub_segv.o \
+	stub_segv.o \
 	sys_call_table_$(BITS).o sysrq_$(BITS).o tls_$(BITS).o \
 	mem_$(BITS).o subarch.o os-Linux/
 
diff --git a/arch/x86/um/ldt.c b/arch/x86/um/ldt.c
index 255a44dd415a..c99fc23b1290 100644
--- a/arch/x86/um/ldt.c
+++ b/arch/x86/um/ldt.c
@@ -12,33 +12,22 @@
 #include <os.h>
 #include <skas.h>
 #include <sysdep/tls.h>
+#include <stub-data.h>
 
 static inline int modify_ldt (int func, void *ptr, unsigned long bytecount)
 {
 	return syscall(__NR_modify_ldt, func, ptr, bytecount);
 }
 
-static long write_ldt_entry(struct mm_id *mm_idp, int func,
-		     struct user_desc *desc, void **addr, int done)
+static void write_ldt_entry(struct mm_id *mm_idp, int func,
+		     struct user_desc *desc)
 {
-	long res;
-	void *stub_addr;
-
-	BUILD_BUG_ON(sizeof(*desc) % sizeof(long));
-
-	res = syscall_stub_data(mm_idp, (unsigned long *)desc,
-				sizeof(*desc) / sizeof(long),
-				addr, &stub_addr);
-	if (!res) {
-		unsigned long args[] = { func,
-					 (unsigned long)stub_addr,
-					 sizeof(*desc),
-					 0, 0, 0 };
-		res = run_syscall_stub(mm_idp, __NR_modify_ldt, args,
-				       0, addr, done);
-	}
+	struct stub_syscall *sc;
 
-	return res;
+	sc = syscall_stub_alloc(mm_idp);
+	sc->syscall = STUB_SYSCALL_LDT;
+	sc->ldt.func = func;
+	memcpy(&sc->ldt.desc, desc, sizeof(*desc));
 }
 
 /*
@@ -127,7 +116,6 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int func)
 	int i, err;
 	struct user_desc ldt_info;
 	struct ldt_entry entry0, *ldt_p;
-	void *addr = NULL;
 
 	err = -EINVAL;
 	if (bytecount != sizeof(ldt_info))
@@ -148,7 +136,8 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int func)
 
 	mutex_lock(&ldt->lock);
 
-	err = write_ldt_entry(mm_idp, func, &ldt_info, &addr, 1);
+	write_ldt_entry(mm_idp, func, &ldt_info);
+	err = syscall_stub_flush(mm_idp);
 	if (err)
 		goto out_unlock;
 
@@ -166,7 +155,8 @@ static int write_ldt(void __user * ptr, unsigned long bytecount, int func)
 				err = -ENOMEM;
 				/* Undo the change in host */
 				memset(&ldt_info, 0, sizeof(ldt_info));
-				write_ldt_entry(mm_idp, 1, &ldt_info, &addr, 1);
+				write_ldt_entry(mm_idp, 1, &ldt_info);
+				err = syscall_stub_flush(mm_idp);
 				goto out_unlock;
 			}
 			if (i == 0) {
@@ -303,7 +293,6 @@ long init_new_ldt(struct mm_context *new_mm, struct mm_context *from_mm)
 	short * num_p;
 	int i;
 	long page, err=0;
-	void *addr = NULL;
 
 
 	mutex_init(&new_mm->arch.ldt.lock);
@@ -318,11 +307,9 @@ long init_new_ldt(struct mm_context *new_mm, struct mm_context *from_mm)
 		ldt_get_host_info();
 		for (num_p=host_ldt_entries; *num_p != -1; num_p++) {
 			desc.entry_number = *num_p;
-			err = write_ldt_entry(&new_mm->id, 1, &desc,
-					      &addr, *(num_p + 1) == -1);
-			if (err)
-				break;
+			write_ldt_entry(&new_mm->id, 1, &desc);
 		}
+		err = syscall_stub_flush(&new_mm->id);
 		new_mm->arch.ldt.entry_count = 0;
 
 		goto out;
diff --git a/arch/x86/um/shared/sysdep/stub.h b/arch/x86/um/shared/sysdep/stub.h
index ce0ca46ad383..579681d12158 100644
--- a/arch/x86/um/shared/sysdep/stub.h
+++ b/arch/x86/um/shared/sysdep/stub.h
@@ -12,4 +12,5 @@
 #endif
 
 extern void stub_segv_handler(int, siginfo_t *, void *);
+extern void stub_syscall_handler(void);
 extern void stub_clone_handler(void);
diff --git a/arch/x86/um/stub_32.S b/arch/x86/um/stub_32.S
deleted file mode 100644
index 8291899e6aaf..000000000000
--- a/arch/x86/um/stub_32.S
+++ /dev/null
@@ -1,56 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#include <as-layout.h>
-
-.section .__syscall_stub, "ax"
-
-	.globl batch_syscall_stub
-batch_syscall_stub:
-	/* %esp comes in as "top of page" */
-	mov %esp, %ecx
-	/* %esp has pointer to first operation */
-	add $8, %esp
-again:
-	/* load length of additional data */
-	mov	0x0(%esp), %eax
-
-	/* if(length == 0) : end of list */
-	/* write possible 0 to header */
-	mov	%eax, 0x4(%ecx)
-	cmpl	$0, %eax
-	jz	done
-
-	/* save current pointer */
-	mov	%esp, 0x4(%ecx)
-
-	/* skip additional data */
-	add	%eax, %esp
-
-	/* load syscall-# */
-	pop	%eax
-
-	/* load syscall params */
-	pop	%ebx
-	pop	%ecx
-	pop	%edx
-	pop	%esi
- 	pop	%edi
-	pop	%ebp
-
-	/* execute syscall */
-	int	$0x80
-
-	/* restore top of page pointer in %ecx */
-	mov	%esp, %ecx
-	andl	$(~UM_KERN_PAGE_SIZE) + 1, %ecx
-
-	/* check return value */
-	pop	%ebx
-	cmp	%ebx, %eax
-	je	again
-
-done:
-	/* save return value */
-	mov	%eax, (%ecx)
-
-	/* stop */
-	int3
diff --git a/arch/x86/um/stub_64.S b/arch/x86/um/stub_64.S
deleted file mode 100644
index f3404640197a..000000000000
--- a/arch/x86/um/stub_64.S
+++ /dev/null
@@ -1,50 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#include <as-layout.h>
-
-.section .__syscall_stub, "ax"
-	.globl batch_syscall_stub
-batch_syscall_stub:
-	/* %rsp has the pointer to first operation */
-	mov	%rsp, %rbx
-	add	$0x10, %rsp
-again:
-	/* load length of additional data */
-	mov	0x0(%rsp), %rax
-
-	/* if(length == 0) : end of list */
-	/* write possible 0 to header */
-	mov	%rax, 8(%rbx)
-	cmp	$0, %rax
-	jz	done
-
-	/* save current pointer */
-	mov	%rsp, 8(%rbx)
-
-	/* skip additional data */
-	add	%rax, %rsp
-
-	/* load syscall-# */
-	pop	%rax
-
-	/* load syscall params */
-	pop	%rdi
-	pop	%rsi
-	pop	%rdx
-	pop	%r10
- 	pop	%r8
-	pop	%r9
-
-	/* execute syscall */
-	syscall
-
-	/* check return value */
-	pop	%rcx
-	cmp	%rcx, %rax
-	je	again
-
-done:
-	/* save return value */
-	mov	%rax, (%rbx)
-
-	/* stop */
-	int3
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 05/12] um: compress memory related stub syscalls while adding them
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
                   ` (3 preceding siblings ...)
  2024-04-18  9:23 ` [PATCH 04/12] um: Rework syscall handling benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-18  9:23 ` [PATCH 06/12] um: remove LDT support benjamin
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin.berg@intel.com>

To keep the number of syscalls that the stub has to do lower, compress
two consecutive syscalls of the same type if the second is just a
continuation of the first.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
 arch/um/os-Linux/skas/mem.c | 39 +++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/arch/um/os-Linux/skas/mem.c b/arch/um/os-Linux/skas/mem.c
index 03c8cde0b89b..e3128e080365 100644
--- a/arch/um/os-Linux/skas/mem.c
+++ b/arch/um/os-Linux/skas/mem.c
@@ -152,12 +152,37 @@ struct stub_syscall *syscall_stub_alloc(struct mm_id *mm_idp)
 	return sc;
 }
 
+static struct stub_syscall *syscall_stub_get_previous(struct mm_id *mm_idp,
+						      int syscall_type,
+						      unsigned long virt)
+{
+	if (mm_idp->syscall_data_len > 0) {
+		struct stub_data *proc_data = (void *) mm_idp->stack;
+		struct stub_syscall *sc;
+
+		sc = &proc_data->syscall_data[mm_idp->syscall_data_len - 1];
+
+		if (sc->syscall == syscall_type &&
+		    sc->mem.addr + sc->mem.length == virt)
+			return sc;
+	}
+
+	return NULL;
+}
 
 void map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
 	int phys_fd, unsigned long long offset)
 {
 	struct stub_syscall *sc;
 
+	/* Compress with previous syscall if that is possible */
+	sc = syscall_stub_get_previous(mm_idp, STUB_SYSCALL_MMAP, virt);
+	if (sc && sc->mem.prot == prot && sc->mem.fd == phys_fd &&
+	    sc->mem.offset == MMAP_OFFSET(offset - sc->mem.length)) {
+		sc->mem.length += len;
+		return;
+	}
+
 	sc = syscall_stub_alloc(mm_idp);
 	sc->syscall = STUB_SYSCALL_MMAP;
 	sc->mem.addr = virt;
@@ -171,6 +196,13 @@ void unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
 {
 	struct stub_syscall *sc;
 
+	/* Compress with previous syscall if that is possible */
+	sc = syscall_stub_get_previous(mm_idp, STUB_SYSCALL_MUNMAP, addr);
+	if (sc) {
+		sc->mem.length += len;
+		return;
+	}
+
 	sc = syscall_stub_alloc(mm_idp);
 	sc->syscall = STUB_SYSCALL_MUNMAP;
 	sc->mem.addr = addr;
@@ -182,6 +214,13 @@ void protect(struct mm_id *mm_idp, unsigned long addr, unsigned long len,
 {
 	struct stub_syscall *sc;
 
+	/* Compress with previous syscall if that is possible */
+	sc = syscall_stub_get_previous(mm_idp, STUB_SYSCALL_MPROTECT, addr);
+	if (sc && sc->mem.prot == prot) {
+		sc->mem.length += len;
+		return;
+	}
+
 	sc = syscall_stub_alloc(mm_idp);
 	sc->syscall = STUB_SYSCALL_MPROTECT;
 	sc->mem.addr = addr;
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 06/12] um: remove LDT support
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
                   ` (4 preceding siblings ...)
  2024-04-18  9:23 ` [PATCH 05/12] um: compress memory related stub syscalls while adding them benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-18  9:23 ` [PATCH 07/12] um: remove copy_context_skas0 benjamin
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin.berg@intel.com>

The current LDT code has a few issues that mean it should be redone in a
different way once we always start with a fresh MM even when cloning.

In a new and better world, the kernel would just ensure its own LDT is
clear at startup. At that point, all that is needed is a simple function
to populate the LDT from another MM in arch_dup_mmap combined with some
tracking of the installed LDT entries for each MM.

Note that the old implementation was even incorrect with regard to
reading, as it copied out the LDT entries in the internal format rather
than converting them to the userspace structure.

Removal should be fine as the LDT is not used for thread-local storage
anymore.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
 arch/um/include/asm/mmu.h               |   6 -
 arch/um/include/shared/skas/stub-data.h |  21 +-
 arch/um/kernel/skas/mmu.c               |   8 -
 arch/um/kernel/skas/stub.c              |  11 -
 arch/um/os-Linux/start_up.c             |   1 +
 arch/x86/um/Makefile                    |   2 +-
 arch/x86/um/asm/mm_context.h            |  70 -----
 arch/x86/um/ldt.c                       | 367 ------------------------
 arch/x86/um/tls_32.c                    |   1 +
 9 files changed, 10 insertions(+), 477 deletions(-)
 delete mode 100644 arch/x86/um/asm/mm_context.h
 delete mode 100644 arch/x86/um/ldt.c

diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index a7555e43ed14..37eb6e89e79a 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -7,17 +7,11 @@
 #define __ARCH_UM_MMU_H
 
 #include <mm_id.h>
-#include <asm/mm_context.h>
 
 typedef struct mm_context {
 	struct mm_id id;
-	struct uml_arch_mm_context arch;
 } mm_context_t;
 
 extern void __switch_mm(struct mm_id * mm_idp);
 
-/* Avoid tangled inclusion with asm/ldt.h */
-extern long init_new_ldt(struct mm_context *to_mm, struct mm_context *from_mm);
-extern void free_ldt(struct mm_context *mm);
-
 #endif
diff --git a/arch/um/include/shared/skas/stub-data.h b/arch/um/include/shared/skas/stub-data.h
index 6ffa9fb5218c..6b8caf6b8283 100644
--- a/arch/um/include/shared/skas/stub-data.h
+++ b/arch/um/include/shared/skas/stub-data.h
@@ -20,23 +20,16 @@ enum stub_syscall_type {
 	STUB_SYSCALL_MMAP,
 	STUB_SYSCALL_MUNMAP,
 	STUB_SYSCALL_MPROTECT,
-	STUB_SYSCALL_LDT,
 };
 
 struct stub_syscall {
-	union {
-		struct {
-			unsigned long addr;
-			unsigned long length;
-			unsigned long offset;
-			int fd;
-			int prot;
-		} mem;
-		struct {
-			user_desc_t desc;
-			int func;
-		} ldt;
-	};
+	struct {
+		unsigned long addr;
+		unsigned long length;
+		unsigned long offset;
+		int fd;
+		int prot;
+	} mem;
 
 	enum stub_syscall_type syscall;
 };
diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
index b6b9679e6e11..d6183cfd51fe 100644
--- a/arch/um/kernel/skas/mmu.c
+++ b/arch/um/kernel/skas/mmu.c
@@ -45,13 +45,6 @@ int init_new_context(struct task_struct *task, struct mm_struct *mm)
 		goto out_free;
 	}
 
-	ret = init_new_ldt(to_mm, from_mm);
-	if (ret < 0) {
-		printk(KERN_ERR "init_new_context_skas - init_ldt"
-		       " failed, errno = %d\n", ret);
-		goto out_free;
-	}
-
 	return 0;
 
  out_free:
@@ -79,5 +72,4 @@ void destroy_context(struct mm_struct *mm)
 	os_kill_ptraced_process(mmu->id.u.pid, 1);
 
 	free_pages(mmu->id.stack, ilog2(STUB_DATA_PAGES));
-	free_ldt(mmu);
 }
diff --git a/arch/um/kernel/skas/stub.c b/arch/um/kernel/skas/stub.c
index d17536612aa8..144fb133294e 100644
--- a/arch/um/kernel/skas/stub.c
+++ b/arch/um/kernel/skas/stub.c
@@ -45,17 +45,6 @@ static __always_inline int syscall_handler(struct stub_data *d)
 				return -1;
 			}
 			break;
-		case STUB_SYSCALL_LDT:
-			res = stub_syscall3(__NR_modify_ldt, sc->ldt.func,
-					    (unsigned long) &sc->ldt.desc,
-					    sizeof(sc->ldt.desc));
-			/* We only write, so the expected result is zero */
-			if (res) {
-				d->err = res;
-				d->syscall_data_len = i;
-				return -1;
-			}
-			break;
 		default:
 			d->err = -95; /* EOPNOTSUPP */
 			d->syscall_data_len = i;
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index 8b0e98ab842c..1baabcc9bbd5 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -17,6 +17,7 @@
 #include <sys/wait.h>
 #include <sys/time.h>
 #include <sys/resource.h>
+#include <asm/ldt.h>
 #include <asm/unistd.h>
 #include <init.h>
 #include <os.h>
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index 4559d2a693c4..17b85209c43d 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -9,7 +9,7 @@ else
 	BITS := 64
 endif
 
-obj-y = bugs_$(BITS).o delay.o fault.o ldt.o \
+obj-y = bugs_$(BITS).o delay.o fault.o \
 	ptrace_$(BITS).o ptrace_user.o setjmp_$(BITS).o signal.o \
 	stub_segv.o \
 	sys_call_table_$(BITS).o sysrq_$(BITS).o tls_$(BITS).o \
diff --git a/arch/x86/um/asm/mm_context.h b/arch/x86/um/asm/mm_context.h
deleted file mode 100644
index dc32dc023c2f..000000000000
--- a/arch/x86/um/asm/mm_context.h
+++ /dev/null
@@ -1,70 +0,0 @@
-/*
- * Copyright (C) 2004 Fujitsu Siemens Computers GmbH
- * Licensed under the GPL
- *
- * Author: Bodo Stroesser <bstroesser@fujitsu-siemens.com>
- */
-
-#ifndef __ASM_LDT_H
-#define __ASM_LDT_H
-
-#include <linux/mutex.h>
-#include <asm/ldt.h>
-
-#define LDT_PAGES_MAX \
-	((LDT_ENTRIES * LDT_ENTRY_SIZE)/PAGE_SIZE)
-#define LDT_ENTRIES_PER_PAGE \
-	(PAGE_SIZE/LDT_ENTRY_SIZE)
-#define LDT_DIRECT_ENTRIES \
-	((LDT_PAGES_MAX*sizeof(void *))/LDT_ENTRY_SIZE)
-
-struct ldt_entry {
-	__u32 a;
-	__u32 b;
-};
-
-typedef struct uml_ldt {
-	int entry_count;
-	struct mutex lock;
-	union {
-		struct ldt_entry * pages[LDT_PAGES_MAX];
-		struct ldt_entry entries[LDT_DIRECT_ENTRIES];
-	} u;
-} uml_ldt_t;
-
-#define LDT_entry_a(info) \
-	((((info)->base_addr & 0x0000ffff) << 16) | ((info)->limit & 0x0ffff))
-
-#define LDT_entry_b(info) \
-	(((info)->base_addr & 0xff000000) | \
-	(((info)->base_addr & 0x00ff0000) >> 16) | \
-	((info)->limit & 0xf0000) | \
-	(((info)->read_exec_only ^ 1) << 9) | \
-	((info)->contents << 10) | \
-	(((info)->seg_not_present ^ 1) << 15) | \
-	((info)->seg_32bit << 22) | \
-	((info)->limit_in_pages << 23) | \
-	((info)->useable << 20) | \
-	0x7000)
-
-#define _LDT_empty(info) (\
-	(info)->base_addr	== 0	&& \
-	(info)->limit		== 0	&& \
-	(info)->contents	== 0	&& \
-	(info)->read_exec_only	== 1	&& \
-	(info)->seg_32bit	== 0	&& \
-	(info)->limit_in_pages	== 0	&& \
-	(info)->seg_not_present	== 1	&& \
-	(info)->useable		== 0	)
-
-#ifdef CONFIG_X86_64
-#define LDT_empty(info) (_LDT_empty(info) && ((info)->lm == 0))
-#else
-#define LDT_empty(info) (_LDT_empty(info))
-#endif
-
-struct uml_arch_mm_context {
-	uml_ldt_t ldt;
-};
-
-#endif
diff --git a/arch/x86/um/ldt.c b/arch/x86/um/ldt.c
deleted file mode 100644
index c99fc23b1290..000000000000
--- a/arch/x86/um/ldt.c
+++ /dev/null
@@ -1,367 +0,0 @@
-/*
- * Copyright (C) 2001 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
- * Licensed under the GPL
- */
-
-#include <linux/mm.h>
-#include <linux/sched.h>
-#include <linux/slab.h>
-#include <linux/syscalls.h>
-#include <linux/uaccess.h>
-#include <asm/unistd.h>
-#include <os.h>
-#include <skas.h>
-#include <sysdep/tls.h>
-#include <stub-data.h>
-
-static inline int modify_ldt (int func, void *ptr, unsigned long bytecount)
-{
-	return syscall(__NR_modify_ldt, func, ptr, bytecount);
-}
-
-static void write_ldt_entry(struct mm_id *mm_idp, int func,
-		     struct user_desc *desc)
-{
-	struct stub_syscall *sc;
-
-	sc = syscall_stub_alloc(mm_idp);
-	sc->syscall = STUB_SYSCALL_LDT;
-	sc->ldt.func = func;
-	memcpy(&sc->ldt.desc, desc, sizeof(*desc));
-}
-
-/*
- * In skas mode, we hold our own ldt data in UML.
- * Thus, the code implementing sys_modify_ldt_skas
- * is very similar to (and mostly stolen from) sys_modify_ldt
- * for arch/i386/kernel/ldt.c
- * The routines copied and modified in part are:
- * - read_ldt
- * - read_default_ldt
- * - write_ldt
- * - sys_modify_ldt_skas
- */
-
-static int read_ldt(void __user * ptr, unsigned long bytecount)
-{
-	int i, err = 0;
-	unsigned long size;
-	uml_ldt_t *ldt = &current->mm->context.arch.ldt;
-
-	if (!ldt->entry_count)
-		goto out;
-	if (bytecount > LDT_ENTRY_SIZE*LDT_ENTRIES)
-		bytecount = LDT_ENTRY_SIZE*LDT_ENTRIES;
-	err = bytecount;
-
-	mutex_lock(&ldt->lock);
-	if (ldt->entry_count <= LDT_DIRECT_ENTRIES) {
-		size = LDT_ENTRY_SIZE*LDT_DIRECT_ENTRIES;
-		if (size > bytecount)
-			size = bytecount;
-		if (copy_to_user(ptr, ldt->u.entries, size))
-			err = -EFAULT;
-		bytecount -= size;
-		ptr += size;
-	}
-	else {
-		for (i=0; i<ldt->entry_count/LDT_ENTRIES_PER_PAGE && bytecount;
-		     i++) {
-			size = PAGE_SIZE;
-			if (size > bytecount)
-				size = bytecount;
-			if (copy_to_user(ptr, ldt->u.pages[i], size)) {
-				err = -EFAULT;
-				break;
-			}
-			bytecount -= size;
-			ptr += size;
-		}
-	}
-	mutex_unlock(&ldt->lock);
-
-	if (bytecount == 0 || err == -EFAULT)
-		goto out;
-
-	if (clear_user(ptr, bytecount))
-		err = -EFAULT;
-
-out:
-	return err;
-}
-
-static int read_default_ldt(void __user * ptr, unsigned long bytecount)
-{
-	int err;
-
-	if (bytecount > 5*LDT_ENTRY_SIZE)
-		bytecount = 5*LDT_ENTRY_SIZE;
-
-	err = bytecount;
-	/*
-	 * UML doesn't support lcall7 and lcall27.
-	 * So, we don't really have a default ldt, but emulate
-	 * an empty ldt of common host default ldt size.
-	 */
-	if (clear_user(ptr, bytecount))
-		err = -EFAULT;
-
-	return err;
-}
-
-static int write_ldt(void __user * ptr, unsigned long bytecount, int func)
-{
-	uml_ldt_t *ldt = &current->mm->context.arch.ldt;
-	struct mm_id * mm_idp = &current->mm->context.id;
-	int i, err;
-	struct user_desc ldt_info;
-	struct ldt_entry entry0, *ldt_p;
-
-	err = -EINVAL;
-	if (bytecount != sizeof(ldt_info))
-		goto out;
-	err = -EFAULT;
-	if (copy_from_user(&ldt_info, ptr, sizeof(ldt_info)))
-		goto out;
-
-	err = -EINVAL;
-	if (ldt_info.entry_number >= LDT_ENTRIES)
-		goto out;
-	if (ldt_info.contents == 3) {
-		if (func == 1)
-			goto out;
-		if (ldt_info.seg_not_present == 0)
-			goto out;
-	}
-
-	mutex_lock(&ldt->lock);
-
-	write_ldt_entry(mm_idp, func, &ldt_info);
-	err = syscall_stub_flush(mm_idp);
-	if (err)
-		goto out_unlock;
-
-	if (ldt_info.entry_number >= ldt->entry_count &&
-	    ldt_info.entry_number >= LDT_DIRECT_ENTRIES) {
-		for (i=ldt->entry_count/LDT_ENTRIES_PER_PAGE;
-		     i*LDT_ENTRIES_PER_PAGE <= ldt_info.entry_number;
-		     i++) {
-			if (i == 0)
-				memcpy(&entry0, ldt->u.entries,
-				       sizeof(entry0));
-			ldt->u.pages[i] = (struct ldt_entry *)
-				__get_free_page(GFP_KERNEL|__GFP_ZERO);
-			if (!ldt->u.pages[i]) {
-				err = -ENOMEM;
-				/* Undo the change in host */
-				memset(&ldt_info, 0, sizeof(ldt_info));
-				write_ldt_entry(mm_idp, 1, &ldt_info);
-				err = syscall_stub_flush(mm_idp);
-				goto out_unlock;
-			}
-			if (i == 0) {
-				memcpy(ldt->u.pages[0], &entry0,
-				       sizeof(entry0));
-				memcpy(ldt->u.pages[0]+1, ldt->u.entries+1,
-				       sizeof(entry0)*(LDT_DIRECT_ENTRIES-1));
-			}
-			ldt->entry_count = (i + 1) * LDT_ENTRIES_PER_PAGE;
-		}
-	}
-	if (ldt->entry_count <= ldt_info.entry_number)
-		ldt->entry_count = ldt_info.entry_number + 1;
-
-	if (ldt->entry_count <= LDT_DIRECT_ENTRIES)
-		ldt_p = ldt->u.entries + ldt_info.entry_number;
-	else
-		ldt_p = ldt->u.pages[ldt_info.entry_number/LDT_ENTRIES_PER_PAGE] +
-			ldt_info.entry_number%LDT_ENTRIES_PER_PAGE;
-
-	if (ldt_info.base_addr == 0 && ldt_info.limit == 0 &&
-	   (func == 1 || LDT_empty(&ldt_info))) {
-		ldt_p->a = 0;
-		ldt_p->b = 0;
-	}
-	else{
-		if (func == 1)
-			ldt_info.useable = 0;
-		ldt_p->a = LDT_entry_a(&ldt_info);
-		ldt_p->b = LDT_entry_b(&ldt_info);
-	}
-	err = 0;
-
-out_unlock:
-	mutex_unlock(&ldt->lock);
-out:
-	return err;
-}
-
-static long do_modify_ldt_skas(int func, void __user *ptr,
-			       unsigned long bytecount)
-{
-	int ret = -ENOSYS;
-
-	switch (func) {
-		case 0:
-			ret = read_ldt(ptr, bytecount);
-			break;
-		case 1:
-		case 0x11:
-			ret = write_ldt(ptr, bytecount, func);
-			break;
-		case 2:
-			ret = read_default_ldt(ptr, bytecount);
-			break;
-	}
-	return ret;
-}
-
-static DEFINE_SPINLOCK(host_ldt_lock);
-static short dummy_list[9] = {0, -1};
-static short * host_ldt_entries = NULL;
-
-static void ldt_get_host_info(void)
-{
-	long ret;
-	struct ldt_entry * ldt;
-	short *tmp;
-	int i, size, k, order;
-
-	spin_lock(&host_ldt_lock);
-
-	if (host_ldt_entries != NULL) {
-		spin_unlock(&host_ldt_lock);
-		return;
-	}
-	host_ldt_entries = dummy_list+1;
-
-	spin_unlock(&host_ldt_lock);
-
-	for (i = LDT_PAGES_MAX-1, order=0; i; i>>=1, order++)
-		;
-
-	ldt = (struct ldt_entry *)
-	      __get_free_pages(GFP_KERNEL|__GFP_ZERO, order);
-	if (ldt == NULL) {
-		printk(KERN_ERR "ldt_get_host_info: couldn't allocate buffer "
-		       "for host ldt\n");
-		return;
-	}
-
-	ret = modify_ldt(0, ldt, (1<<order)*PAGE_SIZE);
-	if (ret < 0) {
-		printk(KERN_ERR "ldt_get_host_info: couldn't read host ldt\n");
-		goto out_free;
-	}
-	if (ret == 0) {
-		/* default_ldt is active, simply write an empty entry 0 */
-		host_ldt_entries = dummy_list;
-		goto out_free;
-	}
-
-	for (i=0, size=0; i<ret/LDT_ENTRY_SIZE; i++) {
-		if (ldt[i].a != 0 || ldt[i].b != 0)
-			size++;
-	}
-
-	if (size < ARRAY_SIZE(dummy_list))
-		host_ldt_entries = dummy_list;
-	else {
-		size = (size + 1) * sizeof(dummy_list[0]);
-		tmp = kmalloc(size, GFP_KERNEL);
-		if (tmp == NULL) {
-			printk(KERN_ERR "ldt_get_host_info: couldn't allocate "
-			       "host ldt list\n");
-			goto out_free;
-		}
-		host_ldt_entries = tmp;
-	}
-
-	for (i=0, k=0; i<ret/LDT_ENTRY_SIZE; i++) {
-		if (ldt[i].a != 0 || ldt[i].b != 0)
-			host_ldt_entries[k++] = i;
-	}
-	host_ldt_entries[k] = -1;
-
-out_free:
-	free_pages((unsigned long)ldt, order);
-}
-
-long init_new_ldt(struct mm_context *new_mm, struct mm_context *from_mm)
-{
-	struct user_desc desc;
-	short * num_p;
-	int i;
-	long page, err=0;
-
-
-	mutex_init(&new_mm->arch.ldt.lock);
-
-	if (!from_mm) {
-		memset(&desc, 0, sizeof(desc));
-		/*
-		 * Now we try to retrieve info about the ldt, we
-		 * inherited from the host. All ldt-entries found
-		 * will be reset in the following loop
-		 */
-		ldt_get_host_info();
-		for (num_p=host_ldt_entries; *num_p != -1; num_p++) {
-			desc.entry_number = *num_p;
-			write_ldt_entry(&new_mm->id, 1, &desc);
-		}
-		err = syscall_stub_flush(&new_mm->id);
-		new_mm->arch.ldt.entry_count = 0;
-
-		goto out;
-	}
-
-	/*
-	 * Our local LDT is used to supply the data for
-	 * modify_ldt(READLDT), if PTRACE_LDT isn't available,
-	 * i.e., we have to use the stub for modify_ldt, which
-	 * can't handle the big read buffer of up to 64kB.
-	 */
-	mutex_lock(&from_mm->arch.ldt.lock);
-	if (from_mm->arch.ldt.entry_count <= LDT_DIRECT_ENTRIES)
-		memcpy(new_mm->arch.ldt.u.entries, from_mm->arch.ldt.u.entries,
-		       sizeof(new_mm->arch.ldt.u.entries));
-	else {
-		i = from_mm->arch.ldt.entry_count / LDT_ENTRIES_PER_PAGE;
-		while (i-->0) {
-			page = __get_free_page(GFP_KERNEL|__GFP_ZERO);
-			if (!page) {
-				err = -ENOMEM;
-				break;
-			}
-			new_mm->arch.ldt.u.pages[i] =
-				(struct ldt_entry *) page;
-			memcpy(new_mm->arch.ldt.u.pages[i],
-			       from_mm->arch.ldt.u.pages[i], PAGE_SIZE);
-		}
-	}
-	new_mm->arch.ldt.entry_count = from_mm->arch.ldt.entry_count;
-	mutex_unlock(&from_mm->arch.ldt.lock);
-
-    out:
-	return err;
-}
-
-
-void free_ldt(struct mm_context *mm)
-{
-	int i;
-
-	if (mm->arch.ldt.entry_count > LDT_DIRECT_ENTRIES) {
-		i = mm->arch.ldt.entry_count / LDT_ENTRIES_PER_PAGE;
-		while (i-- > 0)
-			free_page((long) mm->arch.ldt.u.pages[i]);
-	}
-	mm->arch.ldt.entry_count = 0;
-}
-
-SYSCALL_DEFINE3(modify_ldt, int , func , void __user * , ptr ,
-		unsigned long , bytecount)
-{
-	/* See non-um modify_ldt() for why we do this cast */
-	return (unsigned int)do_modify_ldt_skas(func, ptr, bytecount);
-}
diff --git a/arch/x86/um/tls_32.c b/arch/x86/um/tls_32.c
index 66162eafd8e8..8d6a6656a5cf 100644
--- a/arch/x86/um/tls_32.c
+++ b/arch/x86/um/tls_32.c
@@ -11,6 +11,7 @@
 #include <os.h>
 #include <skas.h>
 #include <sysdep/tls.h>
+#include <asm/desc.h>
 
 /*
  * If needed we can detect when it's uninitialized.
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 07/12] um: remove copy_context_skas0
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
                   ` (5 preceding siblings ...)
  2024-04-18  9:23 ` [PATCH 06/12] um: remove LDT support benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-18  9:23 ` [PATCH 08/12] um: Delay flushing syscalls until the thread is restarted benjamin
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin.berg@intel.com>

The kernel flushes the memory ranges anyway for CoW and does not assume
that the userspace process has anything set up already. So, start with a
fresh process for the new mm context.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
 arch/um/include/shared/os.h     |   1 -
 arch/um/kernel/skas/Makefile    |   8 +--
 arch/um/kernel/skas/clone.c     |  50 ---------------
 arch/um/kernel/skas/mmu.c       |  20 +++---
 arch/um/os-Linux/skas/process.c | 108 --------------------------------
 5 files changed, 10 insertions(+), 177 deletions(-)
 delete mode 100644 arch/um/kernel/skas/clone.c

diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index 16d726f3df84..dc341ed4724e 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -281,7 +281,6 @@ void protect(struct mm_id *mm_idp, unsigned long addr,
 /* skas/process.c */
 extern int is_skas_winch(int pid, int fd, void *data);
 extern int start_userspace(unsigned long stub_stack);
-extern int copy_context_skas0(unsigned long stack, int pid);
 extern void userspace(struct uml_pt_regs *regs, unsigned long *aux_fp_regs);
 extern void new_thread(void *stack, jmp_buf *buf, void (*handler)(void));
 extern void switch_threads(jmp_buf *me, jmp_buf *you);
diff --git a/arch/um/kernel/skas/Makefile b/arch/um/kernel/skas/Makefile
index dd8bc2167e36..6f86d53e3d69 100644
--- a/arch/um/kernel/skas/Makefile
+++ b/arch/um/kernel/skas/Makefile
@@ -3,16 +3,14 @@
 # Copyright (C) 2002 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
 #
 
-obj-y := clone.o stub.o mmu.o process.o syscall.o uaccess.o
+obj-y := stub.o mmu.o process.o syscall.o uaccess.o
 
-# clone.o and stub.o are in the stub, so it can't be built with profiling
+# stub.o is in the stub, so it can't be built with profiling
 # GCC hardened also auto-enables -fpic, but we need %ebx so it can't work ->
 # disable it
 
-CFLAGS_clone.o := $(CFLAGS_NO_HARDENING)
 CFLAGS_stub.o := $(CFLAGS_NO_HARDENING)
-UNPROFILE_OBJS := clone.o stub.o
-
+UNPROFILE_OBJS := stub.o
 KCOV_INSTRUMENT := n
 
 include $(srctree)/arch/um/scripts/Makefile.rules
diff --git a/arch/um/kernel/skas/clone.c b/arch/um/kernel/skas/clone.c
deleted file mode 100644
index b59fa43d68ce..000000000000
--- a/arch/um/kernel/skas/clone.c
+++ /dev/null
@@ -1,50 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Copyright (C) 2015 Thomas Meyer (thomas@m3y3r.de)
- * Copyright (C) 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
- */
-
-#include <signal.h>
-#include <sched.h>
-#include <asm/unistd.h>
-#include <sys/time.h>
-#include <as-layout.h>
-#include <ptrace_user.h>
-#include <stub-data.h>
-#include <sysdep/stub.h>
-
-/*
- * This is in a separate file because it needs to be compiled with any
- * extraneous gcc flags (-pg, -fprofile-arcs, -ftest-coverage) disabled
- *
- * Use UM_KERN_PAGE_SIZE instead of PAGE_SIZE because that calls getpagesize
- * on some systems.
- */
-
-void __attribute__ ((__section__ (".__syscall_stub")))
-stub_clone_handler(void)
-{
-	struct stub_data *data = get_stub_data();
-	long err;
-
-	/* syscall data as a temporary stack area (bottom half). */
-	err = stub_syscall2(__NR_clone, CLONE_PARENT | CLONE_FILES | SIGCHLD,
-			    (unsigned long) data->syscall_data +
-					    sizeof(data->syscall_data) / 2 -
-					    sizeof(void *));
-	if (err) {
-		data->err = err;
-		goto done;
-	}
-
-	err = stub_syscall4(__NR_ptrace, PTRACE_TRACEME, 0, 0, 0);
-	if (err) {
-		data->child_err = err;
-		goto done;
-	}
-
-	remap_stack_and_trap();
-
- done:
-	trap_myself();
-}
diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
index d6183cfd51fe..76c0c7d600a8 100644
--- a/arch/um/kernel/skas/mmu.c
+++ b/arch/um/kernel/skas/mmu.c
@@ -20,8 +20,7 @@ static_assert(sizeof(struct stub_data) == STUB_DATA_PAGES * UM_KERN_PAGE_SIZE);
 
 int init_new_context(struct task_struct *task, struct mm_struct *mm)
 {
- 	struct mm_context *from_mm = NULL;
-	struct mm_context *to_mm = &mm->context;
+	struct mm_id *new_id = &mm->context.id;
 	unsigned long stack = 0;
 	int ret = -ENOMEM;
 
@@ -29,27 +28,22 @@ int init_new_context(struct task_struct *task, struct mm_struct *mm)
 	if (stack == 0)
 		goto out;
 
-	to_mm->id.stack = stack;
-	if (current->mm != NULL && current->mm != &init_mm)
-		from_mm = &current->mm->context;
+	new_id->stack = stack;
 
 	block_signals_trace();
-	if (from_mm)
-		to_mm->id.u.pid = copy_context_skas0(stack,
-						     from_mm->id.u.pid);
-	else to_mm->id.u.pid = start_userspace(stack);
+	new_id->u.pid = start_userspace(stack);
 	unblock_signals_trace();
 
-	if (to_mm->id.u.pid < 0) {
-		ret = to_mm->id.u.pid;
+	if (new_id->u.pid < 0) {
+		ret = new_id->u.pid;
 		goto out_free;
 	}
 
 	return 0;
 
  out_free:
-	if (to_mm->id.stack != 0)
-		free_pages(to_mm->id.stack, ilog2(STUB_DATA_PAGES));
+	if (new_id->stack != 0)
+		free_pages(new_id->stack, ilog2(STUB_DATA_PAGES));
  out:
 	return ret;
 }
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index 2bf03115817c..70c43e4a9cf0 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -460,114 +460,6 @@ void userspace(struct uml_pt_regs *regs, unsigned long *aux_fp_regs)
 	}
 }
 
-static unsigned long thread_regs[MAX_REG_NR];
-static unsigned long thread_fp_regs[FP_SIZE];
-
-static int __init init_thread_regs(void)
-{
-	get_safe_registers(thread_regs, thread_fp_regs);
-	/* Set parent's instruction pointer to start of clone-stub */
-	thread_regs[REGS_IP_INDEX] = STUB_CODE +
-				(unsigned long) stub_clone_handler -
-				(unsigned long) __syscall_stub_start;
-
-	/* syscall data as a temporary stack area (top half). */
-	thread_regs[REGS_SP_INDEX] = STUB_DATA +
-				     offsetof(struct stub_data, syscall_data) +
-				     sizeof(((struct stub_data *) 0)->syscall_data) -
-				     sizeof(void *);
-	return 0;
-}
-
-__initcall(init_thread_regs);
-
-int copy_context_skas0(unsigned long new_stack, int pid)
-{
-	int err;
-	unsigned long current_stack = current_stub_stack();
-	struct stub_data *data = (struct stub_data *) current_stack;
-	struct stub_data *child_data = (struct stub_data *) new_stack;
-	unsigned long long new_offset;
-	int new_fd = phys_mapping(uml_to_phys((void *)new_stack), &new_offset);
-
-	/*
-	 * prepare offset and fd of child's stack as argument for parent's
-	 * and child's mmap2 calls
-	 */
-	*data = ((struct stub_data) {
-		.offset	= MMAP_OFFSET(new_offset),
-		.fd     = new_fd,
-		.err    = -ESRCH,
-		.child_err = 0,
-	});
-
-	*child_data = ((struct stub_data) {
-		.child_err = -ESRCH,
-	});
-
-	err = ptrace_setregs(pid, thread_regs);
-	if (err < 0) {
-		err = -errno;
-		printk(UM_KERN_ERR "%s : PTRACE_SETREGS failed, pid = %d, errno = %d\n",
-		      __func__, pid, -err);
-		return err;
-	}
-
-	err = put_fp_registers(pid, thread_fp_regs);
-	if (err < 0) {
-		printk(UM_KERN_ERR "%s : put_fp_registers failed, pid = %d, err = %d\n",
-		       __func__, pid, err);
-		return err;
-	}
-
-	/*
-	 * Wait, until parent has finished its work: read child's pid from
-	 * parent's stack, and check, if bad result.
-	 */
-	err = ptrace(PTRACE_CONT, pid, 0, 0);
-	if (err) {
-		err = -errno;
-		printk(UM_KERN_ERR "Failed to continue new process, pid = %d, errno = %d\n",
-		       pid, errno);
-		return err;
-	}
-
-	wait_stub_done(pid);
-
-	pid = data->err;
-	if (pid < 0) {
-		printk(UM_KERN_ERR "%s - stub-parent reports error %d\n",
-		      __func__, -pid);
-		return pid;
-	}
-
-	/*
-	 * Wait, until child has finished too: read child's result from
-	 * child's stack and check it.
-	 */
-	wait_stub_done(pid);
-	if (child_data->child_err != STUB_DATA) {
-		printk(UM_KERN_ERR "%s - stub-child %d reports error %ld\n",
-		       __func__, pid, data->child_err);
-		err = data->child_err;
-		goto out_kill;
-	}
-
-	if (ptrace(PTRACE_SETOPTIONS, pid, NULL,
-		   (void *)PTRACE_O_TRACESYSGOOD) < 0) {
-		err = -errno;
-		printk(UM_KERN_ERR "%s : PTRACE_SETOPTIONS failed, errno = %d\n",
-		       __func__, errno);
-		goto out_kill;
-	}
-
-	return pid;
-
- out_kill:
-	os_kill_ptraced_process(pid, 1);
-	return err;
-}
-
 void new_thread(void *stack, jmp_buf *buf, void (*handler)(void))
 {
 	(*buf)[0].JB_IP = (unsigned long) handler;
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 08/12] um: Delay flushing syscalls until the thread is restarted
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
                   ` (6 preceding siblings ...)
  2024-04-18  9:23 ` [PATCH 07/12] um: remove copy_context_skas0 benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-18  9:23 ` [PATCH 09/12] um: Do not flush MM in flush_thread benjamin
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin@sipsolutions.net>

As running the syscalls is expensive due to context switches, we should
do so as late as possible in case more syscalls need to be queued later
on. This will also benefit a later move to a SECCOMP enabled userspace
as in that case the need of extra context switches is removed entirely.

Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
---
 arch/um/include/asm/tlbflush.h      |  2 ++
 arch/um/include/shared/os.h         |  4 +++
 arch/um/include/shared/skas/mm_id.h |  1 -
 arch/um/include/shared/skas/skas.h  |  1 +
 arch/um/kernel/skas/process.c       |  8 +++++
 arch/um/kernel/tlb.c                | 21 ++------------
 arch/um/os-Linux/skas/mem.c         | 45 +++++++++++++++++------------
 arch/um/os-Linux/skas/process.c     | 12 ++++++--
 8 files changed, 52 insertions(+), 42 deletions(-)

diff --git a/arch/um/include/asm/tlbflush.h b/arch/um/include/asm/tlbflush.h
index a5bda890390d..d7cf82023b74 100644
--- a/arch/um/include/asm/tlbflush.h
+++ b/arch/um/include/asm/tlbflush.h
@@ -28,4 +28,6 @@ extern void flush_tlb_kernel_vm(void);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 extern void __flush_tlb_one(unsigned long addr);
 
+void report_enomem(void);
+
 #endif
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index dc341ed4724e..ecc1273fd230 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -191,6 +191,9 @@ extern void get_host_cpu_features(
 /* mem.c */
 extern int create_mem_file(unsigned long long len);
 
+/* tlb.c */
+extern void report_enomem(void);
+
 /* process.c */
 extern unsigned long os_process_pc(int pid);
 extern int os_process_parent(int pid);
@@ -270,6 +273,7 @@ extern long long os_nsecs(void);
 /* skas/mem.c */
 int syscall_stub_flush(struct mm_id *mm_idp);
 struct stub_syscall *syscall_stub_alloc(struct mm_id *mm_idp);
+void syscall_stub_dump_error(struct mm_id *mm_idp);
 
 void map(struct mm_id *mm_idp, unsigned long virt,
 	 unsigned long len, int prot, int phys_fd,
diff --git a/arch/um/include/shared/skas/mm_id.h b/arch/um/include/shared/skas/mm_id.h
index bcb951719b51..10edf40f60e9 100644
--- a/arch/um/include/shared/skas/mm_id.h
+++ b/arch/um/include/shared/skas/mm_id.h
@@ -12,7 +12,6 @@ struct mm_id {
 		int pid;
 	} u;
 	unsigned long stack;
-	int kill;
 	int syscall_data_len;
 };
 
diff --git a/arch/um/include/shared/skas/skas.h b/arch/um/include/shared/skas/skas.h
index c93d2cbc8f32..5c78b0cc3dd4 100644
--- a/arch/um/include/shared/skas/skas.h
+++ b/arch/um/include/shared/skas/skas.h
@@ -15,5 +15,6 @@ extern void new_thread_handler(void);
 extern void handle_syscall(struct uml_pt_regs *regs);
 extern long execute_syscall_skas(void *r);
 extern unsigned long current_stub_stack(void);
+extern struct mm_id *current_mm_id(void);
 
 #endif
diff --git a/arch/um/kernel/skas/process.c b/arch/um/kernel/skas/process.c
index f2ac134c9752..c7345c83e07b 100644
--- a/arch/um/kernel/skas/process.c
+++ b/arch/um/kernel/skas/process.c
@@ -53,3 +53,11 @@ unsigned long current_stub_stack(void)
 
 	return current->mm->context.id.stack;
 }
+
+struct mm_id *current_mm_id(void)
+{
+	if (current->mm == NULL)
+		return NULL;
+
+	return &current->mm->context.id;
+}
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 2a7b66b6e367..391c98137890 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -52,7 +52,7 @@ struct host_vm_change {
 	   .index	= 0, \
 	   .force	= force })
 
-static void report_enomem(void)
+void report_enomem(void)
 {
 	printk(KERN_ERR "UML ran out of memory on the host side! "
 			"This can happen due to a memory limitation or "
@@ -337,15 +337,6 @@ static void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 
 	if (!ret)
 		ret = do_ops(&hvc, hvc.index, 1);
-
-	/* This is not an else because ret is modified above */
-	if (ret) {
-		struct mm_id *mm_idp = &current->mm->context.id;
-
-		printk(KERN_ERR "fix_range_common: failed, killing current "
-		       "process: %d\n", task_tgid_vnr(current));
-		mm_idp->kill = 1;
-	}
 }
 
 static int flush_tlb_kernel_range_common(unsigned long start, unsigned long end)
@@ -460,7 +451,7 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long address)
 	pmd_t *pmd;
 	pte_t *pte;
 	struct mm_struct *mm = vma->vm_mm;
-	int r, w, x, prot, err = 0;
+	int r, w, x, prot;
 	struct mm_id *mm_id;
 
 	address &= PAGE_MASK;
@@ -508,14 +499,6 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long address)
 	} else if (pte_newprot(*pte))
 		protect(mm_id, address, PAGE_SIZE, prot);
 
-	err = syscall_stub_flush(mm_id);
-	if (err) {
-		if (err == -ENOMEM)
-			report_enomem();
-
-		goto kill;
-	}
-
 	*pte = pte_mkuptodate(*pte);
 
 	return;
diff --git a/arch/um/os-Linux/skas/mem.c b/arch/um/os-Linux/skas/mem.c
index e3128e080365..bb7eace4feac 100644
--- a/arch/um/os-Linux/skas/mem.c
+++ b/arch/um/os-Linux/skas/mem.c
@@ -23,6 +23,30 @@ extern char __syscall_stub_start[];
 
 extern void wait_stub_done(int pid);
 
+void syscall_stub_dump_error(struct mm_id *mm_idp)
+{
+	struct stub_data *proc_data = (void *)mm_idp->stack;
+	struct stub_syscall *sc;
+
+	if (proc_data->syscall_data_len < 0 ||
+	    proc_data->syscall_data_len >= ARRAY_SIZE(proc_data->syscall_data))
+		panic("Syscall data was corrupted by stub (len is: %d, expected maximum: %d)!",
+			proc_data->syscall_data_len,
+			mm_idp->syscall_data_len);
+
+	sc = &proc_data->syscall_data[proc_data->syscall_data_len];
+
+	printk(UM_KERN_ERR "%s : length = %d, last offset = %d",
+		__func__, mm_idp->syscall_data_len,
+		proc_data->syscall_data_len);
+	printk(UM_KERN_ERR "%s : stub syscall type %d failed, return value = 0x%lx\n",
+		__func__, sc->syscall, proc_data->err);
+
+	print_hex_dump(UM_KERN_ERR,
+			"    syscall data: ", 0,
+			16, 4, sc, sizeof(*sc), 0);
+}
+
 static inline unsigned long *check_init_stack(struct mm_id *mm_idp,
 					      unsigned long *stack)
 {
@@ -83,26 +107,9 @@ static inline long do_syscall_stub(struct mm_id *mm_idp)
 	 * otherwise it will be zero (but we do not need to rely on that).
 	 */
 	if (proc_data->err < 0) {
-		struct stub_syscall *sc;
-
-		if (proc_data->syscall_data_len < 0 ||
-		    proc_data->syscall_data_len >= ARRAY_SIZE(proc_data->syscall_data))
-			panic("Syscall data was corrupted by stub (len is: %d, expected maximum: %d)!",
-			      proc_data->syscall_data_len,
-			      mm_idp->syscall_data_len);
-
-		sc = &proc_data->syscall_data[proc_data->syscall_data_len];
-
-		printk(UM_KERN_ERR "%s : length = %d, last offset = %d",
-		       __func__, mm_idp->syscall_data_len,
-		       proc_data->syscall_data_len);
-		printk(UM_KERN_ERR "%s : stub syscall type %d failed, return value = 0x%lx\n",
-		       __func__, sc->syscall, proc_data->err);
-
-		print_hex_dump(UM_KERN_ERR,
-			       "    syscall data: ", 0,
-			       16, 4, sc, sizeof(*sc), 0);
+		syscall_stub_dump_error(mm_idp);
 
+		/* Store error code in case someone tries to add more syscalls */
 		mm_idp->syscall_data_len = proc_data->err;
 	} else {
 		mm_idp->syscall_data_len = 0;
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index 70c43e4a9cf0..8dca42627f39 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -252,7 +252,6 @@ static int userspace_tramp(void *stack)
 }
 
 int userspace_pid[NR_CPUS];
-int kill_userspace_mm[NR_CPUS];
 
 /**
  * start_userspace() - prepare a new userspace process
@@ -344,8 +343,16 @@ void userspace(struct uml_pt_regs *regs, unsigned long *aux_fp_regs)
 	interrupt_end();
 
 	while (1) {
-		if (kill_userspace_mm[0])
+		/* Flush out any pending syscalls */
+		err = syscall_stub_flush(current_mm_id());
+		if (err) {
+			if (err == -ENOMEM)
+				report_enomem();
+
+			printk(UM_KERN_ERR "%s - Error flushing stub syscalls: %d",
+				__func__, -err);
 			fatal_sigsegv();
+		}
 
 		/*
 		 * This can legitimately fail if the process loads a
@@ -576,5 +583,4 @@ void reboot_skas(void)
 void __switch_mm(struct mm_id *mm_idp)
 {
 	userspace_pid[0] = mm_idp->u.pid;
-	kill_userspace_mm[0] = mm_idp->kill;
 }
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 09/12] um: Do not flush MM in flush_thread
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
                   ` (7 preceding siblings ...)
  2024-04-18  9:23 ` [PATCH 08/12] um: Delay flushing syscalls until the thread is restarted benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-18  9:23 ` [PATCH 10/12] um: remove force_flush_all from fork_handler benjamin
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin.berg@intel.com>

There should be no need to flush the memory in flush_thread. Doing this
likely worked around some issue where memory was still incorrectly
mapped when creating or cloning an MM.

With the removal of the special clone path, that isn't relevant anymore.
However, add the flush into MM initialization so that any new userspace
MM is guaranteed to be clean.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
 arch/um/kernel/exec.c     | 5 -----
 arch/um/kernel/skas/mmu.c | 3 +++
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/um/kernel/exec.c b/arch/um/kernel/exec.c
index 5c8836b012e9..2c15bb2c104c 100644
--- a/arch/um/kernel/exec.c
+++ b/arch/um/kernel/exec.c
@@ -24,11 +24,6 @@ void flush_thread(void)
 {
 	arch_flush_thread(&current->thread.arch);
 
-	unmap(&current->mm->context.id, 0, TASK_SIZE);
-	if (syscall_stub_flush(&current->mm->context.id) < 0) {
-		printk(KERN_ERR "%s - clearing address space failed", __func__);
-		force_sig(SIGKILL);
-	}
 	get_safe_registers(current_pt_regs()->regs.gp,
 			   current_pt_regs()->regs.fp);
 
diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
index 76c0c7d600a8..9bfefcd33f36 100644
--- a/arch/um/kernel/skas/mmu.c
+++ b/arch/um/kernel/skas/mmu.c
@@ -39,6 +39,9 @@ int init_new_context(struct task_struct *task, struct mm_struct *mm)
 		goto out_free;
 	}
 
+	/* Ensure the new MM is clean and nothing unwanted is mapped */
+	unmap(new_id, 0, TASK_SIZE);
+
 	return 0;
 
  out_free:
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 10/12] um: remove force_flush_all from fork_handler
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
                   ` (8 preceding siblings ...)
  2024-04-18  9:23 ` [PATCH 09/12] um: Do not flush MM in flush_thread benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-18  9:23 ` [PATCH 11/12] um: simplify and consolidate TLB updates benjamin
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin.berg@intel.com>

There should be no need for this. It may be that this used to work
around another issue where after a clone the MM was in a bad state.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
 arch/um/include/asm/mmu_context.h |  2 --
 arch/um/kernel/process.c          |  2 --
 arch/um/kernel/tlb.c              | 46 +++++++++++--------------------
 3 files changed, 16 insertions(+), 34 deletions(-)

diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h
index 68e2eb9cfb47..23dcc914d44e 100644
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -13,8 +13,6 @@
 #include <asm/mm_hooks.h>
 #include <asm/mmu.h>
 
-extern void force_flush_all(void);
-
 #define activate_mm activate_mm
 static inline void activate_mm(struct mm_struct *old, struct mm_struct *new)
 {
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index ab95648e93e1..390bf711fbd1 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -139,8 +139,6 @@ void new_thread_handler(void)
 /* Called magically, see new_thread_handler above */
 void fork_handler(void)
 {
-	force_flush_all();
-
 	schedule_tail(current->thread.prev_sched);
 
 	/*
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index 391c98137890..f183a9b9ff7b 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -40,17 +40,15 @@ struct host_vm_change {
 	int index;
 	struct mm_struct *mm;
 	void *data;
-	int force;
 };
 
-#define INIT_HVC(mm, force, userspace) \
+#define INIT_HVC(mm, userspace) \
 	((struct host_vm_change) \
 	 { .ops		= { { .type = NONE } },	\
 	   .mm		= mm, \
        	   .data	= NULL, \
 	   .userspace	= userspace, \
-	   .index	= 0, \
-	   .force	= force })
+	   .index	= 0 })
 
 void report_enomem(void)
 {
@@ -234,7 +232,7 @@ static inline int update_pte_range(pmd_t *pmd, unsigned long addr,
 
 		prot = ((r ? UM_PROT_READ : 0) | (w ? UM_PROT_WRITE : 0) |
 			(x ? UM_PROT_EXEC : 0));
-		if (hvc->force || pte_newpage(*pte)) {
+		if (pte_newpage(*pte)) {
 			if (pte_present(*pte)) {
 				if (pte_newpage(*pte))
 					ret = add_mmap(addr, pte_val(*pte) & PAGE_MASK,
@@ -260,7 +258,7 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 	do {
 		next = pmd_addr_end(addr, end);
 		if (!pmd_present(*pmd)) {
-			if (hvc->force || pmd_newpage(*pmd)) {
+			if (pmd_newpage(*pmd)) {
 				ret = add_munmap(addr, next - addr, hvc);
 				pmd_mkuptodate(*pmd);
 			}
@@ -282,7 +280,7 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 	do {
 		next = pud_addr_end(addr, end);
 		if (!pud_present(*pud)) {
-			if (hvc->force || pud_newpage(*pud)) {
+			if (pud_newpage(*pud)) {
 				ret = add_munmap(addr, next - addr, hvc);
 				pud_mkuptodate(*pud);
 			}
@@ -304,7 +302,7 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 	do {
 		next = p4d_addr_end(addr, end);
 		if (!p4d_present(*p4d)) {
-			if (hvc->force || p4d_newpage(*p4d)) {
+			if (p4d_newpage(*p4d)) {
 				ret = add_munmap(addr, next - addr, hvc);
 				p4d_mkuptodate(*p4d);
 			}
@@ -315,19 +313,19 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 }
 
 static void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
-			     unsigned long end_addr, int force)
+			     unsigned long end_addr)
 {
 	pgd_t *pgd;
 	struct host_vm_change hvc;
 	unsigned long addr = start_addr, next;
 	int ret = 0, userspace = 1;
 
-	hvc = INIT_HVC(mm, force, userspace);
+	hvc = INIT_HVC(mm, userspace);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end_addr);
 		if (!pgd_present(*pgd)) {
-			if (force || pgd_newpage(*pgd)) {
+			if (pgd_newpage(*pgd)) {
 				ret = add_munmap(addr, next - addr, &hvc);
 				pgd_mkuptodate(*pgd);
 			}
@@ -348,11 +346,11 @@ static int flush_tlb_kernel_range_common(unsigned long start, unsigned long end)
 	pmd_t *pmd;
 	pte_t *pte;
 	unsigned long addr, last;
-	int updated = 0, err = 0, force = 0, userspace = 0;
+	int updated = 0, err = 0,  userspace = 0;
 	struct host_vm_change hvc;
 
 	mm = &init_mm;
-	hvc = INIT_HVC(mm, force, userspace);
+	hvc = INIT_HVC(mm, userspace);
 	for (addr = start; addr < end;) {
 		pgd = pgd_offset(mm, addr);
 		if (!pgd_present(*pgd)) {
@@ -536,7 +534,7 @@ void __flush_tlb_one(unsigned long addr)
 }
 
 static void fix_range(struct mm_struct *mm, unsigned long start_addr,
-		      unsigned long end_addr, int force)
+		      unsigned long end_addr)
 {
 	/*
 	 * Don't bother flushing if this address space is about to be
@@ -545,7 +543,7 @@ static void fix_range(struct mm_struct *mm, unsigned long start_addr,
 	if (atomic_read(&mm->mm_users) == 0)
 		return;
 
-	fix_range_common(mm, start_addr, end_addr, force);
+	fix_range_common(mm, start_addr, end_addr);
 }
 
 void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
@@ -553,14 +551,14 @@ void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 {
 	if (vma->vm_mm == NULL)
 		flush_tlb_kernel_range_common(start, end);
-	else fix_range(vma->vm_mm, start, end, 0);
+	else fix_range(vma->vm_mm, start, end);
 }
 EXPORT_SYMBOL(flush_tlb_range);
 
 void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 			unsigned long end)
 {
-	fix_range(mm, start, end, 0);
+	fix_range(mm, start, end);
 }
 
 void flush_tlb_mm(struct mm_struct *mm)
@@ -569,17 +567,5 @@ void flush_tlb_mm(struct mm_struct *mm)
 	VMA_ITERATOR(vmi, mm, 0);
 
 	for_each_vma(vmi, vma)
-		fix_range(mm, vma->vm_start, vma->vm_end, 0);
-}
-
-void force_flush_all(void)
-{
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma;
-	VMA_ITERATOR(vmi, mm, 0);
-
-	mmap_read_lock(mm);
-	for_each_vma(vmi, vma)
-		fix_range(mm, vma->vm_start, vma->vm_end, 1);
-	mmap_read_unlock(mm);
+		fix_range(mm, vma->vm_start, vma->vm_end);
 }
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 11/12] um: simplify and consolidate TLB updates
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
                   ` (9 preceding siblings ...)
  2024-04-18  9:23 ` [PATCH 10/12] um: remove force_flush_all from fork_handler benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-18  9:23 ` [PATCH 12/12] um: refactor TLB update handling benjamin
  2024-04-22  2:35 ` [PATCH 00/12] Rework stub syscall and page table handling Tiwei Bie
  12 siblings, 0 replies; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin.berg@intel.com>

The HVC update was mostly used to compress consecutive calls into one.
This is mostly relevant for userspace where it is already handled by the
syscall stub code.

Simplify the whole logic and consolidate it for both kernel and
userspace. This does remove the sequential syscall compression for the
kernel, however that shouldn't be the main factor in most runs.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
 arch/um/include/shared/os.h |  12 +-
 arch/um/kernel/tlb.c        | 386 ++++++++----------------------------
 arch/um/os-Linux/skas/mem.c |  18 +-
 3 files changed, 99 insertions(+), 317 deletions(-)

diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index ecc1273fd230..01af239dcc01 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -275,12 +275,12 @@ int syscall_stub_flush(struct mm_id *mm_idp);
 struct stub_syscall *syscall_stub_alloc(struct mm_id *mm_idp);
 void syscall_stub_dump_error(struct mm_id *mm_idp);
 
-void map(struct mm_id *mm_idp, unsigned long virt,
-	 unsigned long len, int prot, int phys_fd,
-	 unsigned long long offset);
-void unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len);
-void protect(struct mm_id *mm_idp, unsigned long addr,
-	     unsigned long len, unsigned int prot);
+int map(struct mm_id *mm_idp, unsigned long virt,
+	unsigned long len, int prot, int phys_fd,
+	unsigned long long offset);
+int unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len);
+int protect(struct mm_id *mm_idp, unsigned long addr,
+	    unsigned long len, unsigned int prot);
 
 /* skas/process.c */
 extern int is_skas_winch(int pid, int fd, void *data);
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index f183a9b9ff7b..c137ff6f84dd 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -14,207 +14,54 @@
 #include <skas.h>
 #include <kern_util.h>
 
-struct host_vm_change {
-	struct host_vm_op {
-		enum { NONE, MMAP, MUNMAP, MPROTECT } type;
-		union {
-			struct {
-				unsigned long addr;
-				unsigned long len;
-				unsigned int prot;
-				int fd;
-				__u64 offset;
-			} mmap;
-			struct {
-				unsigned long addr;
-				unsigned long len;
-			} munmap;
-			struct {
-				unsigned long addr;
-				unsigned long len;
-				unsigned int prot;
-			} mprotect;
-		} u;
-	} ops[1];
-	int userspace;
-	int index;
-	struct mm_struct *mm;
-	void *data;
+struct vm_ops {
+	struct mm_id *mm_idp;
+
+	int (*mmap)(struct mm_id *mm_idp,
+		    unsigned long virt, unsigned long len, int prot,
+		    int phys_fd, unsigned long long offset);
+	int (*unmap)(struct mm_id *mm_idp,
+		     unsigned long virt, unsigned long len);
+	int (*mprotect)(struct mm_id *mm_idp,
+			unsigned long virt, unsigned long len,
+			unsigned int prot);
 };
 
-#define INIT_HVC(mm, userspace) \
-	((struct host_vm_change) \
-	 { .ops		= { { .type = NONE } },	\
-	   .mm		= mm, \
-       	   .data	= NULL, \
-	   .userspace	= userspace, \
-	   .index	= 0 })
-
-void report_enomem(void)
+static int kern_map(struct mm_id *mm_idp,
+		    unsigned long virt, unsigned long len, int prot,
+		    int phys_fd, unsigned long long offset)
 {
-	printk(KERN_ERR "UML ran out of memory on the host side! "
-			"This can happen due to a memory limitation or "
-			"vm.max_map_count has been reached.\n");
+	/* TODO: Why is executable needed to be always set in the kernel? */
+	return os_map_memory((void *)virt, phys_fd, offset, len,
+			     prot & UM_PROT_READ, prot & UM_PROT_WRITE,
+			     1);
 }
 
-static int do_ops(struct host_vm_change *hvc, int end,
-		  int finished)
+static int kern_unmap(struct mm_id *mm_idp,
+		      unsigned long virt, unsigned long len)
 {
-	struct host_vm_op *op;
-	int i, ret = 0;
-
-	for (i = 0; i < end && !ret; i++) {
-		op = &hvc->ops[i];
-		switch (op->type) {
-		case MMAP:
-			if (hvc->userspace)
-				map(&hvc->mm->context.id, op->u.mmap.addr,
-				    op->u.mmap.len, op->u.mmap.prot,
-				    op->u.mmap.fd,
-				    op->u.mmap.offset);
-			else
-				map_memory(op->u.mmap.addr, op->u.mmap.offset,
-					   op->u.mmap.len, 1, 1, 1);
-			break;
-		case MUNMAP:
-			if (hvc->userspace)
-				unmap(&hvc->mm->context.id,
-				      op->u.munmap.addr,
-				      op->u.munmap.len);
-			else
-				ret = os_unmap_memory(
-					(void *) op->u.munmap.addr,
-						      op->u.munmap.len);
-
-			break;
-		case MPROTECT:
-			if (hvc->userspace)
-				protect(&hvc->mm->context.id,
-					op->u.mprotect.addr,
-					op->u.mprotect.len,
-					op->u.mprotect.prot);
-			else
-				ret = os_protect_memory(
-					(void *) op->u.mprotect.addr,
-							op->u.mprotect.len,
-							1, 1, 1);
-			break;
-		default:
-			printk(KERN_ERR "Unknown op type %d in do_ops\n",
-			       op->type);
-			BUG();
-			break;
-		}
-	}
-
-	if (hvc->userspace && finished)
-		ret = syscall_stub_flush(&hvc->mm->context.id);
-
-	if (ret == -ENOMEM)
-		report_enomem();
-
-	return ret;
+	return os_unmap_memory((void *)virt, len);
 }
 
-static int add_mmap(unsigned long virt, unsigned long phys, unsigned long len,
-		    unsigned int prot, struct host_vm_change *hvc)
+static int kern_mprotect(struct mm_id *mm_idp,
+			 unsigned long virt, unsigned long len,
+			 unsigned int prot)
 {
-	__u64 offset;
-	struct host_vm_op *last;
-	int fd = -1, ret = 0;
-
-	if (hvc->userspace)
-		fd = phys_mapping(phys, &offset);
-	else
-		offset = phys;
-	if (hvc->index != 0) {
-		last = &hvc->ops[hvc->index - 1];
-		if ((last->type == MMAP) &&
-		   (last->u.mmap.addr + last->u.mmap.len == virt) &&
-		   (last->u.mmap.prot == prot) && (last->u.mmap.fd == fd) &&
-		   (last->u.mmap.offset + last->u.mmap.len == offset)) {
-			last->u.mmap.len += len;
-			return 0;
-		}
-	}
-
-	if (hvc->index == ARRAY_SIZE(hvc->ops)) {
-		ret = do_ops(hvc, ARRAY_SIZE(hvc->ops), 0);
-		hvc->index = 0;
-	}
-
-	hvc->ops[hvc->index++] = ((struct host_vm_op)
-				  { .type	= MMAP,
-				    .u = { .mmap = { .addr	= virt,
-						     .len	= len,
-						     .prot	= prot,
-						     .fd	= fd,
-						     .offset	= offset }
-			   } });
-	return ret;
+	return os_protect_memory((void *)virt, len,
+				 prot & UM_PROT_READ, prot & UM_PROT_WRITE,
+				 1);
 }
 
-static int add_munmap(unsigned long addr, unsigned long len,
-		      struct host_vm_change *hvc)
-{
-	struct host_vm_op *last;
-	int ret = 0;
-
-	if (hvc->index != 0) {
-		last = &hvc->ops[hvc->index - 1];
-		if ((last->type == MUNMAP) &&
-		   (last->u.munmap.addr + last->u.mmap.len == addr)) {
-			last->u.munmap.len += len;
-			return 0;
-		}
-	}
-
-	if (hvc->index == ARRAY_SIZE(hvc->ops)) {
-		ret = do_ops(hvc, ARRAY_SIZE(hvc->ops), 0);
-		hvc->index = 0;
-	}
-
-	hvc->ops[hvc->index++] = ((struct host_vm_op)
-				  { .type	= MUNMAP,
-			     	    .u = { .munmap = { .addr	= addr,
-						       .len	= len } } });
-	return ret;
-}
-
-static int add_mprotect(unsigned long addr, unsigned long len,
-			unsigned int prot, struct host_vm_change *hvc)
+void report_enomem(void)
 {
-	struct host_vm_op *last;
-	int ret = 0;
-
-	if (hvc->index != 0) {
-		last = &hvc->ops[hvc->index - 1];
-		if ((last->type == MPROTECT) &&
-		   (last->u.mprotect.addr + last->u.mprotect.len == addr) &&
-		   (last->u.mprotect.prot == prot)) {
-			last->u.mprotect.len += len;
-			return 0;
-		}
-	}
-
-	if (hvc->index == ARRAY_SIZE(hvc->ops)) {
-		ret = do_ops(hvc, ARRAY_SIZE(hvc->ops), 0);
-		hvc->index = 0;
-	}
-
-	hvc->ops[hvc->index++] = ((struct host_vm_op)
-				  { .type	= MPROTECT,
-			     	    .u = { .mprotect = { .addr	= addr,
-							 .len	= len,
-							 .prot	= prot } } });
-	return ret;
+	printk(KERN_ERR "UML ran out of memory on the host side! "
+			"This can happen due to a memory limitation or "
+			"vm.max_map_count has been reached.\n");
 }
 
-#define ADD_ROUND(n, inc) (((n) + (inc)) & ~((inc) - 1))
-
 static inline int update_pte_range(pmd_t *pmd, unsigned long addr,
 				   unsigned long end,
-				   struct host_vm_change *hvc)
+				   struct vm_ops *ops)
 {
 	pte_t *pte;
 	int r, w, x, prot, ret = 0;
@@ -234,13 +81,20 @@ static inline int update_pte_range(pmd_t *pmd, unsigned long addr,
 			(x ? UM_PROT_EXEC : 0));
 		if (pte_newpage(*pte)) {
 			if (pte_present(*pte)) {
-				if (pte_newpage(*pte))
-					ret = add_mmap(addr, pte_val(*pte) & PAGE_MASK,
-						       PAGE_SIZE, prot, hvc);
+				if (pte_newpage(*pte)) {
+					__u64 offset;
+					unsigned long phys =
+						pte_val(*pte) & PAGE_MASK;
+					int fd =  phys_mapping(phys, &offset);
+
+					ret = ops->mmap(ops->mm_idp, addr,
+							PAGE_SIZE, prot, fd,
+							offset);
+				}
 			} else
-				ret = add_munmap(addr, PAGE_SIZE, hvc);
+				ret = ops->unmap(ops->mm_idp, addr, PAGE_SIZE);
 		} else if (pte_newprot(*pte))
-			ret = add_mprotect(addr, PAGE_SIZE, prot, hvc);
+			ret = ops->mprotect(ops->mm_idp, addr, PAGE_SIZE, prot);
 		*pte = pte_mkuptodate(*pte);
 	} while (pte++, addr += PAGE_SIZE, ((addr < end) && !ret));
 	return ret;
@@ -248,7 +102,7 @@ static inline int update_pte_range(pmd_t *pmd, unsigned long addr,
 
 static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 				   unsigned long end,
-				   struct host_vm_change *hvc)
+				   struct vm_ops *ops)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -259,18 +113,19 @@ static inline int update_pmd_range(pud_t *pud, unsigned long addr,
 		next = pmd_addr_end(addr, end);
 		if (!pmd_present(*pmd)) {
 			if (pmd_newpage(*pmd)) {
-				ret = add_munmap(addr, next - addr, hvc);
+				ret = ops->unmap(ops->mm_idp, addr,
+						 next - addr);
 				pmd_mkuptodate(*pmd);
 			}
 		}
-		else ret = update_pte_range(pmd, addr, next, hvc);
+		else ret = update_pte_range(pmd, addr, next, ops);
 	} while (pmd++, addr = next, ((addr < end) && !ret));
 	return ret;
 }
 
 static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 				   unsigned long end,
-				   struct host_vm_change *hvc)
+				   struct vm_ops *ops)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -281,18 +136,19 @@ static inline int update_pud_range(p4d_t *p4d, unsigned long addr,
 		next = pud_addr_end(addr, end);
 		if (!pud_present(*pud)) {
 			if (pud_newpage(*pud)) {
-				ret = add_munmap(addr, next - addr, hvc);
+				ret = ops->unmap(ops->mm_idp, addr,
+						 next - addr);
 				pud_mkuptodate(*pud);
 			}
 		}
-		else ret = update_pmd_range(pud, addr, next, hvc);
+		else ret = update_pmd_range(pud, addr, next, ops);
 	} while (pud++, addr = next, ((addr < end) && !ret));
 	return ret;
 }
 
 static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 				   unsigned long end,
-				   struct host_vm_change *hvc)
+				   struct vm_ops *ops)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -303,142 +159,62 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 		next = p4d_addr_end(addr, end);
 		if (!p4d_present(*p4d)) {
 			if (p4d_newpage(*p4d)) {
-				ret = add_munmap(addr, next - addr, hvc);
+				ret = ops->unmap(ops->mm_idp, addr,
+						 next - addr);
 				p4d_mkuptodate(*p4d);
 			}
 		} else
-			ret = update_pud_range(p4d, addr, next, hvc);
+			ret = update_pud_range(p4d, addr, next, ops);
 	} while (p4d++, addr = next, ((addr < end) && !ret));
 	return ret;
 }
 
-static void fix_range_common(struct mm_struct *mm, unsigned long start_addr,
+static int fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 			     unsigned long end_addr)
 {
 	pgd_t *pgd;
-	struct host_vm_change hvc;
+	struct vm_ops ops;
 	unsigned long addr = start_addr, next;
-	int ret = 0, userspace = 1;
+	int ret = 0;
+
+	ops.mm_idp = &mm->context.id;
+	if (mm == &init_mm) {
+		ops.mmap = kern_map;
+		ops.unmap = kern_unmap;
+		ops.mprotect = kern_mprotect;
+	} else {
+		ops.mmap = map;
+		ops.unmap = unmap;
+		ops.mprotect = protect;
+	}
 
-	hvc = INIT_HVC(mm, userspace);
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end_addr);
 		if (!pgd_present(*pgd)) {
 			if (pgd_newpage(*pgd)) {
-				ret = add_munmap(addr, next - addr, &hvc);
+				ret = ops.unmap(ops.mm_idp, addr,
+						next - addr);
 				pgd_mkuptodate(*pgd);
 			}
 		} else
-			ret = update_p4d_range(pgd, addr, next, &hvc);
+			ret = update_p4d_range(pgd, addr, next, &ops);
 	} while (pgd++, addr = next, ((addr < end_addr) && !ret));
 
-	if (!ret)
-		ret = do_ops(&hvc, hvc.index, 1);
+	if (ret == -ENOMEM)
+		report_enomem();
+
+	return ret;
 }
 
-static int flush_tlb_kernel_range_common(unsigned long start, unsigned long end)
+static void flush_tlb_kernel_range_common(unsigned long start, unsigned long end)
 {
-	struct mm_struct *mm;
-	pgd_t *pgd;
-	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-	unsigned long addr, last;
-	int updated = 0, err = 0,  userspace = 0;
-	struct host_vm_change hvc;
-
-	mm = &init_mm;
-	hvc = INIT_HVC(mm, userspace);
-	for (addr = start; addr < end;) {
-		pgd = pgd_offset(mm, addr);
-		if (!pgd_present(*pgd)) {
-			last = ADD_ROUND(addr, PGDIR_SIZE);
-			if (last > end)
-				last = end;
-			if (pgd_newpage(*pgd)) {
-				updated = 1;
-				err = add_munmap(addr, last - addr, &hvc);
-				if (err < 0)
-					panic("munmap failed, errno = %d\n",
-					      -err);
-			}
-			addr = last;
-			continue;
-		}
-
-		p4d = p4d_offset(pgd, addr);
-		if (!p4d_present(*p4d)) {
-			last = ADD_ROUND(addr, P4D_SIZE);
-			if (last > end)
-				last = end;
-			if (p4d_newpage(*p4d)) {
-				updated = 1;
-				err = add_munmap(addr, last - addr, &hvc);
-				if (err < 0)
-					panic("munmap failed, errno = %d\n",
-					      -err);
-			}
-			addr = last;
-			continue;
-		}
+	int err;
 
-		pud = pud_offset(p4d, addr);
-		if (!pud_present(*pud)) {
-			last = ADD_ROUND(addr, PUD_SIZE);
-			if (last > end)
-				last = end;
-			if (pud_newpage(*pud)) {
-				updated = 1;
-				err = add_munmap(addr, last - addr, &hvc);
-				if (err < 0)
-					panic("munmap failed, errno = %d\n",
-					      -err);
-			}
-			addr = last;
-			continue;
-		}
-
-		pmd = pmd_offset(pud, addr);
-		if (!pmd_present(*pmd)) {
-			last = ADD_ROUND(addr, PMD_SIZE);
-			if (last > end)
-				last = end;
-			if (pmd_newpage(*pmd)) {
-				updated = 1;
-				err = add_munmap(addr, last - addr, &hvc);
-				if (err < 0)
-					panic("munmap failed, errno = %d\n",
-					      -err);
-			}
-			addr = last;
-			continue;
-		}
-
-		pte = pte_offset_kernel(pmd, addr);
-		if (!pte_present(*pte) || pte_newpage(*pte)) {
-			updated = 1;
-			err = add_munmap(addr, PAGE_SIZE, &hvc);
-			if (err < 0)
-				panic("munmap failed, errno = %d\n",
-				      -err);
-			if (pte_present(*pte))
-				err = add_mmap(addr, pte_val(*pte) & PAGE_MASK,
-					       PAGE_SIZE, 0, &hvc);
-		}
-		else if (pte_newprot(*pte)) {
-			updated = 1;
-			err = add_mprotect(addr, PAGE_SIZE, 0, &hvc);
-		}
-		addr += PAGE_SIZE;
-	}
-	if (!err)
-		err = do_ops(&hvc, hvc.index, 1);
+	err = fix_range_common(&init_mm, start, end);
 
-	if (err < 0)
+	if (err)
 		panic("flush_tlb_kernel failed, errno = %d\n", err);
-	return updated;
 }
 
 void flush_tlb_page(struct vm_area_struct *vma, unsigned long address)
diff --git a/arch/um/os-Linux/skas/mem.c b/arch/um/os-Linux/skas/mem.c
index bb7eace4feac..5e7e218073d8 100644
--- a/arch/um/os-Linux/skas/mem.c
+++ b/arch/um/os-Linux/skas/mem.c
@@ -177,7 +177,7 @@ static struct stub_syscall *syscall_stub_get_previous(struct mm_id *mm_idp,
 	return NULL;
 }
 
-void map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
+int map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
 	int phys_fd, unsigned long long offset)
 {
 	struct stub_syscall *sc;
@@ -187,7 +187,7 @@ void map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
 	if (sc && sc->mem.prot == prot && sc->mem.fd == phys_fd &&
 	    sc->mem.offset == MMAP_OFFSET(offset - sc->mem.length)) {
 		sc->mem.length += len;
-		return;
+		return 0;
 	}
 
 	sc = syscall_stub_alloc(mm_idp);
@@ -197,9 +197,11 @@ void map(struct mm_id *mm_idp, unsigned long virt, unsigned long len, int prot,
 	sc->mem.prot = prot;
 	sc->mem.fd = phys_fd;
 	sc->mem.offset = MMAP_OFFSET(offset);
+
+	return 0;
 }
 
-void unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
+int unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
 {
 	struct stub_syscall *sc;
 
@@ -207,16 +209,18 @@ void unmap(struct mm_id *mm_idp, unsigned long addr, unsigned long len)
 	sc = syscall_stub_get_previous(mm_idp, STUB_SYSCALL_MUNMAP, addr);
 	if (sc) {
 		sc->mem.length += len;
-		return;
+		return 0;
 	}
 
 	sc = syscall_stub_alloc(mm_idp);
 	sc->syscall = STUB_SYSCALL_MUNMAP;
 	sc->mem.addr = addr;
 	sc->mem.length = len;
+
+	return 0;
 }
 
-void protect(struct mm_id *mm_idp, unsigned long addr, unsigned long len,
+int protect(struct mm_id *mm_idp, unsigned long addr, unsigned long len,
 	    unsigned int prot)
 {
 	struct stub_syscall *sc;
@@ -225,7 +229,7 @@ void protect(struct mm_id *mm_idp, unsigned long addr, unsigned long len,
 	sc = syscall_stub_get_previous(mm_idp, STUB_SYSCALL_MPROTECT, addr);
 	if (sc && sc->mem.prot == prot) {
 		sc->mem.length += len;
-		return;
+		return 0;
 	}
 
 	sc = syscall_stub_alloc(mm_idp);
@@ -233,4 +237,6 @@ void protect(struct mm_id *mm_idp, unsigned long addr, unsigned long len,
 	sc->mem.addr = addr;
 	sc->mem.length = len;
 	sc->mem.prot = prot;
+
+	return 0;
 }
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 12/12] um: refactor TLB update handling
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
                   ` (10 preceding siblings ...)
  2024-04-18  9:23 ` [PATCH 11/12] um: simplify and consolidate TLB updates benjamin
@ 2024-04-18  9:23 ` benjamin
  2024-04-22  2:51   ` Tiwei Bie
  2024-04-22  2:35 ` [PATCH 00/12] Rework stub syscall and page table handling Tiwei Bie
  12 siblings, 1 reply; 19+ messages in thread
From: benjamin @ 2024-04-18  9:23 UTC (permalink / raw)
  To: linux-um; +Cc: Benjamin Berg

From: Benjamin Berg <benjamin.berg@intel.com>

Conceptually, we want the memory mappings to always be up to date and
represent whatever is in the TLB. To ensure that, we need to sync them
over in the userspace case and for the kernel we need to process the
mappings.

The kernel will call flush_tlb_* if page table entries that were valid
before become invalid. Unfortunately, this is not the case if entries
are added.

As such, change both flush_tlb_* and set_ptes to track the memory range
that has to be synchronized. For the kernel, we need to execute a
flush_tlb_kern_* immediately but we can wait for the first page fault in
case of set_ptes. For userspace in contrast we only store that a range
of memory needs to be synced and do so whenever we switch to that
process.

Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
---
 arch/um/include/asm/mmu.h          |   4 +
 arch/um/include/asm/pgtable.h      |  32 +++++++
 arch/um/include/asm/tlbflush.h     |  44 ++++++++--
 arch/um/include/shared/skas/skas.h |   1 +
 arch/um/kernel/skas/process.c      |  10 +++
 arch/um/kernel/tlb.c               | 136 +++--------------------------
 arch/um/kernel/trap.c              |  15 +++-
 arch/um/os-Linux/skas/process.c    |   2 +
 8 files changed, 108 insertions(+), 136 deletions(-)

diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index 37eb6e89e79a..bf8da736609c 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -10,6 +10,10 @@
 
 typedef struct mm_context {
 	struct mm_id id;
+
+	/* Address range in need of a TLB sync */
+	long int sync_tlb_range_from;
+	long int sync_tlb_range_to;
 } mm_context_t;
 
 extern void __switch_mm(struct mm_id * mm_idp);
diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
index e1ece21dbe3f..5bb397b65efb 100644
--- a/arch/um/include/asm/pgtable.h
+++ b/arch/um/include/asm/pgtable.h
@@ -244,6 +244,38 @@ static inline void set_pte(pte_t *pteptr, pte_t pteval)
 
 #define PFN_PTE_SHIFT		PAGE_SHIFT
 
+static inline void um_tlb_mark_sync(struct mm_struct *mm, unsigned long start,
+				    unsigned long end)
+{
+	if (!mm->context.sync_tlb_range_to) {
+		mm->context.sync_tlb_range_from = start;
+		mm->context.sync_tlb_range_to = end;
+	} else {
+		if (start < mm->context.sync_tlb_range_from)
+			mm->context.sync_tlb_range_from = start;
+		if (end > mm->context.sync_tlb_range_to)
+			mm->context.sync_tlb_range_to = end;
+	}
+}
+
+#define set_ptes set_ptes
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+			    pte_t *ptep, pte_t pte, int nr)
+{
+	/* Basically the default implementation */
+	size_t length = nr * PAGE_SIZE;
+
+	for (;;) {
+		set_pte(ptep, pte);
+		if (--nr == 0)
+			break;
+		ptep++;
+		pte = __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
+	}
+
+	um_tlb_mark_sync(mm, addr, addr + length);
+}
+
 #define __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
diff --git a/arch/um/include/asm/tlbflush.h b/arch/um/include/asm/tlbflush.h
index d7cf82023b74..62816f6f1c91 100644
--- a/arch/um/include/asm/tlbflush.h
+++ b/arch/um/include/asm/tlbflush.h
@@ -9,24 +9,50 @@
 #include <linux/mm.h>
 
 /*
- * TLB flushing:
+ * In UML, we need to sync the TLB over by using mmap/munmap/mprotect syscalls
+ * from the process handling the MM (which can be the kernel itself).
+ *
+ * To track updates, we can hook into set_ptes and flush_tlb_*. With set_ptes
+ * we catch all PTE transitions where memory that was unusable becomes usable.
+ * While with flush_tlb_* we can track any memory that becomes unusable and
+ * even if a higher layer of the page table was modified.
+ *
+ * So, we simply track updates using both methods and mark the memory area to
+ * be synced later on. The only special case is that flush_tlb_kern_* needs to
+ * be executed immediately as there is no good synchronization point in that
+ * case. In contrast, in the set_ptes case we can wait for the next kernel
+ * segfault before we do the synchornization.
  *
- *  - flush_tlb() flushes the current mm struct TLBs
  *  - flush_tlb_all() flushes all processes TLBs
  *  - flush_tlb_mm(mm) flushes the specified mm context TLB's
  *  - flush_tlb_page(vma, vmaddr) flushes one page
- *  - flush_tlb_kernel_vm() flushes the kernel vm area
  *  - flush_tlb_range(vma, start, end) flushes a range of pages
+ *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
  */
 
+extern int um_tlb_sync(struct mm_struct *mm);
+
 extern void flush_tlb_all(void);
 extern void flush_tlb_mm(struct mm_struct *mm);
-extern void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, 
-			    unsigned long end);
-extern void flush_tlb_page(struct vm_area_struct *vma, unsigned long address);
-extern void flush_tlb_kernel_vm(void);
-extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
-extern void __flush_tlb_one(unsigned long addr);
+
+static void flush_tlb_page(struct vm_area_struct *vma, unsigned long address)
+{
+	um_tlb_mark_sync(vma->vm_mm, address, address + PAGE_SIZE);
+}
+
+static void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
+			    unsigned long end)
+{
+	um_tlb_mark_sync(vma->vm_mm, start, end);
+}
+
+static void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	um_tlb_mark_sync(&init_mm, start, end);
+
+	/* Kernel needs to be synced immediately */
+	um_tlb_sync(&init_mm);
+}
 
 void report_enomem(void);
 
diff --git a/arch/um/include/shared/skas/skas.h b/arch/um/include/shared/skas/skas.h
index 5c78b0cc3dd4..ebaa116de30b 100644
--- a/arch/um/include/shared/skas/skas.h
+++ b/arch/um/include/shared/skas/skas.h
@@ -16,5 +16,6 @@ extern void handle_syscall(struct uml_pt_regs *regs);
 extern long execute_syscall_skas(void *r);
 extern unsigned long current_stub_stack(void);
 extern struct mm_id *current_mm_id(void);
+extern void current_mm_sync(void);
 
 #endif
diff --git a/arch/um/kernel/skas/process.c b/arch/um/kernel/skas/process.c
index c7345c83e07b..26c12db3eca9 100644
--- a/arch/um/kernel/skas/process.c
+++ b/arch/um/kernel/skas/process.c
@@ -8,6 +8,8 @@
 #include <linux/sched/task_stack.h>
 #include <linux/sched/task.h>
 
+#include <asm/tlbflush.h>
+
 #include <as-layout.h>
 #include <kern.h>
 #include <os.h>
@@ -61,3 +63,11 @@ struct mm_id *current_mm_id(void)
 
 	return &current->mm->context.id;
 }
+
+void current_mm_sync(void)
+{
+	if (current->mm == NULL)
+		return;
+
+	um_tlb_sync(current->mm);
+}
diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index c137ff6f84dd..232aa7601d5d 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -169,14 +169,16 @@ static inline int update_p4d_range(pgd_t *pgd, unsigned long addr,
 	return ret;
 }
 
-static int fix_range_common(struct mm_struct *mm, unsigned long start_addr,
-			     unsigned long end_addr)
+int um_tlb_sync(struct mm_struct *mm)
 {
 	pgd_t *pgd;
 	struct vm_ops ops;
-	unsigned long addr = start_addr, next;
+	unsigned long addr = mm->context.sync_tlb_range_from, next;
 	int ret = 0;
 
+	if (mm->context.sync_tlb_range_to == 0)
+		return 0;
+
 	ops.mm_idp = &mm->context.id;
 	if (mm == &init_mm) {
 		ops.mmap = kern_map;
@@ -190,7 +192,7 @@ static int fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 
 	pgd = pgd_offset(mm, addr);
 	do {
-		next = pgd_addr_end(addr, end_addr);
+		next = pgd_addr_end(addr, mm->context.sync_tlb_range_to);
 		if (!pgd_present(*pgd)) {
 			if (pgd_newpage(*pgd)) {
 				ret = ops.unmap(ops.mm_idp, addr,
@@ -199,87 +201,16 @@ static int fix_range_common(struct mm_struct *mm, unsigned long start_addr,
 			}
 		} else
 			ret = update_p4d_range(pgd, addr, next, &ops);
-	} while (pgd++, addr = next, ((addr < end_addr) && !ret));
+	} while (pgd++, addr = next,
+		 ((addr < mm->context.sync_tlb_range_to) && !ret));
 
 	if (ret == -ENOMEM)
 		report_enomem();
 
-	return ret;
-}
-
-static void flush_tlb_kernel_range_common(unsigned long start, unsigned long end)
-{
-	int err;
-
-	err = fix_range_common(&init_mm, start, end);
-
-	if (err)
-		panic("flush_tlb_kernel failed, errno = %d\n", err);
-}
-
-void flush_tlb_page(struct vm_area_struct *vma, unsigned long address)
-{
-	pgd_t *pgd;
-	p4d_t *p4d;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-	struct mm_struct *mm = vma->vm_mm;
-	int r, w, x, prot;
-	struct mm_id *mm_id;
-
-	address &= PAGE_MASK;
-
-	pgd = pgd_offset(mm, address);
-	if (!pgd_present(*pgd))
-		goto kill;
-
-	p4d = p4d_offset(pgd, address);
-	if (!p4d_present(*p4d))
-		goto kill;
-
-	pud = pud_offset(p4d, address);
-	if (!pud_present(*pud))
-		goto kill;
-
-	pmd = pmd_offset(pud, address);
-	if (!pmd_present(*pmd))
-		goto kill;
-
-	pte = pte_offset_kernel(pmd, address);
-
-	r = pte_read(*pte);
-	w = pte_write(*pte);
-	x = pte_exec(*pte);
-	if (!pte_young(*pte)) {
-		r = 0;
-		w = 0;
-	} else if (!pte_dirty(*pte)) {
-		w = 0;
-	}
-
-	mm_id = &mm->context.id;
-	prot = ((r ? UM_PROT_READ : 0) | (w ? UM_PROT_WRITE : 0) |
-		(x ? UM_PROT_EXEC : 0));
-	if (pte_newpage(*pte)) {
-		if (pte_present(*pte)) {
-			unsigned long long offset;
-			int fd;
-
-			fd = phys_mapping(pte_val(*pte) & PAGE_MASK, &offset);
-			map(mm_id, address, PAGE_SIZE, prot, fd, offset);
-		} else
-			unmap(mm_id, address, PAGE_SIZE);
-	} else if (pte_newprot(*pte))
-		protect(mm_id, address, PAGE_SIZE, prot);
-
-	*pte = pte_mkuptodate(*pte);
+	mm->context.sync_tlb_range_from = 0;
+	mm->context.sync_tlb_range_to = 0;
 
-	return;
-
-kill:
-	printk(KERN_ERR "Failed to flush page for address 0x%lx\n", address);
-	force_sig(SIGKILL);
+	return ret;
 }
 
 void flush_tlb_all(void)
@@ -294,54 +225,11 @@ void flush_tlb_all(void)
 	flush_tlb_mm(current->mm);
 }
 
-void flush_tlb_kernel_range(unsigned long start, unsigned long end)
-{
-	flush_tlb_kernel_range_common(start, end);
-}
-
-void flush_tlb_kernel_vm(void)
-{
-	flush_tlb_kernel_range_common(start_vm, end_vm);
-}
-
-void __flush_tlb_one(unsigned long addr)
-{
-	flush_tlb_kernel_range_common(addr, addr + PAGE_SIZE);
-}
-
-static void fix_range(struct mm_struct *mm, unsigned long start_addr,
-		      unsigned long end_addr)
-{
-	/*
-	 * Don't bother flushing if this address space is about to be
-	 * destroyed.
-	 */
-	if (atomic_read(&mm->mm_users) == 0)
-		return;
-
-	fix_range_common(mm, start_addr, end_addr);
-}
-
-void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
-		     unsigned long end)
-{
-	if (vma->vm_mm == NULL)
-		flush_tlb_kernel_range_common(start, end);
-	else fix_range(vma->vm_mm, start, end);
-}
-EXPORT_SYMBOL(flush_tlb_range);
-
-void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-			unsigned long end)
-{
-	fix_range(mm, start, end);
-}
-
 void flush_tlb_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	VMA_ITERATOR(vmi, mm, 0);
 
 	for_each_vma(vmi, vma)
-		fix_range(mm, vma->vm_start, vma->vm_end);
+		um_tlb_mark_sync(mm, vma->vm_start, vma->vm_end);
 }
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 6d8ae86ae978..97c8df9c4401 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -113,7 +113,7 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 #if 0
 	WARN_ON(!pte_young(*pte) || (is_write && !pte_dirty(*pte)));
 #endif
-	flush_tlb_page(vma, address);
+
 out:
 	mmap_read_unlock(mm);
 out_nosemaphore:
@@ -210,8 +210,17 @@ unsigned long segv(struct faultinfo fi, unsigned long ip, int is_user,
 	if (!is_user && regs)
 		current->thread.segv_regs = container_of(regs, struct pt_regs, regs);
 
-	if (!is_user && (address >= start_vm) && (address < end_vm)) {
-		flush_tlb_kernel_vm();
+	if (!is_user && init_mm.context.sync_tlb_range_to) {
+		/*
+		 * Kernel has pending updates from set_ptes that were not
+		 * flushed yet. Syncing them should fix the pagefault (if not
+		 * we'll get here again and panic).
+		 */
+		err = um_tlb_sync(&init_mm);
+		if (err == -ENOMEM)
+			report_enomem();
+		if (err)
+			panic("Failed to sync kernel TLBs: %d", err);
 		goto out;
 	}
 	else if (current->mm == NULL) {
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index 8dca42627f39..06e23cf870e2 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -343,6 +343,8 @@ void userspace(struct uml_pt_regs *regs, unsigned long *aux_fp_regs)
 	interrupt_end();
 
 	while (1) {
+		current_mm_sync();
+
 		/* Flush out any pending syscalls */
 		err = syscall_stub_flush(current_mm_id());
 		if (err) {
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/12] Rework stub syscall and page table handling
  2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
                   ` (11 preceding siblings ...)
  2024-04-18  9:23 ` [PATCH 12/12] um: refactor TLB update handling benjamin
@ 2024-04-22  2:35 ` Tiwei Bie
  2024-04-22  7:41   ` Benjamin Berg
  12 siblings, 1 reply; 19+ messages in thread
From: Tiwei Bie @ 2024-04-22  2:35 UTC (permalink / raw)
  To: benjamin, linux-um; +Cc: Benjamin Berg

Hello Benjamin,

On 4/18/24 5:23 PM, benjamin@sipsolutions.net wrote:
> From: Benjamin Berg <benjamin.berg@intel.com>
> 
> This patchset reworks the stub syscall handling and also redos how page
> table updates are tracked and synchronized. Some of this originated in
> the SECCOMP patchset, but it became clear that these refactorings make
> sense independently as they result in a considerably fewer page faults.

I saw your SECCOMP patchset. It's pretty cool! Just wondering if you're about
to post a new version soon. :)

PS. Just FYI, gVisor also implemented a SECCOMP based platform which is
also very interesting:

https://gvisor.dev/blog/2023/04/28/systrap-release/
https://github.com/google/gvisor/tree/master/pkg/sentry/platform/systrap

Regards,
Tiwei


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/12] um: refactor TLB update handling
  2024-04-18  9:23 ` [PATCH 12/12] um: refactor TLB update handling benjamin
@ 2024-04-22  2:51   ` Tiwei Bie
  2024-04-22  7:22     ` Benjamin Berg
  0 siblings, 1 reply; 19+ messages in thread
From: Tiwei Bie @ 2024-04-22  2:51 UTC (permalink / raw)
  To: benjamin, linux-um; +Cc: Benjamin Berg

On 4/18/24 5:23 PM, benjamin@sipsolutions.net wrote:
> diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
> index 37eb6e89e79a..bf8da736609c 100644
> --- a/arch/um/include/asm/mmu.h
> +++ b/arch/um/include/asm/mmu.h
> @@ -10,6 +10,10 @@
>  
>  typedef struct mm_context {
>  	struct mm_id id;
> +
> +	/* Address range in need of a TLB sync */
> +	long int sync_tlb_range_from;
> +	long int sync_tlb_range_to;

Why not "unsigned long"?

>  } mm_context_t;
>  
>  extern void __switch_mm(struct mm_id * mm_idp);
> diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
> index e1ece21dbe3f..5bb397b65efb 100644
> --- a/arch/um/include/asm/pgtable.h
> +++ b/arch/um/include/asm/pgtable.h
> @@ -244,6 +244,38 @@ static inline void set_pte(pte_t *pteptr, pte_t pteval)
>  
>  #define PFN_PTE_SHIFT		PAGE_SHIFT
>  
> +static inline void um_tlb_mark_sync(struct mm_struct *mm, unsigned long start,
> +				    unsigned long end)
> +{
> +	if (!mm->context.sync_tlb_range_to) {
> +		mm->context.sync_tlb_range_from = start;
> +		mm->context.sync_tlb_range_to = end;
> +	} else {
> +		if (start < mm->context.sync_tlb_range_from)
> +			mm->context.sync_tlb_range_from = start;
> +		if (end > mm->context.sync_tlb_range_to)
> +			mm->context.sync_tlb_range_to = end;
> +	}
> +}

IIUC, in some cases, the range [sync_tlb_range_from, sync_tlb_range_to)
might become very large when merging non-adjacent ranges? Could that
be an issue?

> diff --git a/arch/um/include/asm/tlbflush.h b/arch/um/include/asm/tlbflush.h
> index d7cf82023b74..62816f6f1c91 100644
> --- a/arch/um/include/asm/tlbflush.h
> +++ b/arch/um/include/asm/tlbflush.h
> @@ -9,24 +9,50 @@
>  #include <linux/mm.h>
>  
>  /*
> - * TLB flushing:
> + * In UML, we need to sync the TLB over by using mmap/munmap/mprotect syscalls
> + * from the process handling the MM (which can be the kernel itself).
> + *
> + * To track updates, we can hook into set_ptes and flush_tlb_*. With set_ptes
> + * we catch all PTE transitions where memory that was unusable becomes usable.
> + * While with flush_tlb_* we can track any memory that becomes unusable and
> + * even if a higher layer of the page table was modified.
> + *
> + * So, we simply track updates using both methods and mark the memory area to
> + * be synced later on. The only special case is that flush_tlb_kern_* needs to
> + * be executed immediately as there is no good synchronization point in that
> + * case. In contrast, in the set_ptes case we can wait for the next kernel
> + * segfault before we do the synchornization.
>   *
> - *  - flush_tlb() flushes the current mm struct TLBs
>   *  - flush_tlb_all() flushes all processes TLBs
>   *  - flush_tlb_mm(mm) flushes the specified mm context TLB's
>   *  - flush_tlb_page(vma, vmaddr) flushes one page
> - *  - flush_tlb_kernel_vm() flushes the kernel vm area
>   *  - flush_tlb_range(vma, start, end) flushes a range of pages
> + *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
>   */
>  
> +extern int um_tlb_sync(struct mm_struct *mm);
> +
>  extern void flush_tlb_all(void);
>  extern void flush_tlb_mm(struct mm_struct *mm);
> -extern void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, 
> -			    unsigned long end);
> -extern void flush_tlb_page(struct vm_area_struct *vma, unsigned long address);
> -extern void flush_tlb_kernel_vm(void);
> -extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
> -extern void __flush_tlb_one(unsigned long addr);
> +
> +static void flush_tlb_page(struct vm_area_struct *vma, unsigned long address)
> +{
> +	um_tlb_mark_sync(vma->vm_mm, address, address + PAGE_SIZE);
> +}
> +
> +static void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
> +			    unsigned long end)
> +{
> +	um_tlb_mark_sync(vma->vm_mm, start, end);
> +}
> +
> +static void flush_tlb_kernel_range(unsigned long start, unsigned long end)
> +{
> +	um_tlb_mark_sync(&init_mm, start, end);
> +
> +	/* Kernel needs to be synced immediately */
> +	um_tlb_sync(&init_mm);
> +}

Nit: this is a header file, these functions should be defined as inline functions.

> diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
> index c137ff6f84dd..232aa7601d5d 100644
> --- a/arch/um/kernel/tlb.c
> +++ b/arch/um/kernel/tlb.c
[...]
>  
> -void flush_tlb_kernel_range(unsigned long start, unsigned long end)
> -{
> -	flush_tlb_kernel_range_common(start, end);
> -}
> -
> -void flush_tlb_kernel_vm(void)
> -{
> -	flush_tlb_kernel_range_common(start_vm, end_vm);
> -}

The build breaks with this change, as there is still a call to
flush_tlb_kernel_vm() in ubd.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/um/drivers/ubd_kern.c?id=fb5d1d389c9e78d68f1f71f926d6251017579f5b#n774

Regards,
Tiwei



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/12] um: refactor TLB update handling
  2024-04-22  2:51   ` Tiwei Bie
@ 2024-04-22  7:22     ` Benjamin Berg
  2024-04-22  7:51       ` Anton Ivanov
  0 siblings, 1 reply; 19+ messages in thread
From: Benjamin Berg @ 2024-04-22  7:22 UTC (permalink / raw)
  To: Tiwei Bie, linux-um

On Mon, 2024-04-22 at 10:51 +0800, Tiwei Bie wrote:
> On 4/18/24 5:23 PM, benjamin@sipsolutions.net wrote:
> > diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
> > index 37eb6e89e79a..bf8da736609c 100644
> > --- a/arch/um/include/asm/mmu.h
> > +++ b/arch/um/include/asm/mmu.h
> > @@ -10,6 +10,10 @@
> >  
> >  typedef struct mm_context {
> >  	struct mm_id id;
> > +
> > +	/* Address range in need of a TLB sync */
> > +	long int sync_tlb_range_from;
> > +	long int sync_tlb_range_to;
> 
> Why not "unsigned long"?

Oops, yes, it should be "unsigned long".

> 
> >  } mm_context_t;
> >  
> >  extern void __switch_mm(struct mm_id * mm_idp);
> > diff --git a/arch/um/include/asm/pgtable.h
> > b/arch/um/include/asm/pgtable.h
> > index e1ece21dbe3f..5bb397b65efb 100644
> > --- a/arch/um/include/asm/pgtable.h
> > +++ b/arch/um/include/asm/pgtable.h
> > @@ -244,6 +244,38 @@ static inline void set_pte(pte_t *pteptr,
> > pte_t pteval)
> >  
> >  #define PFN_PTE_SHIFT		PAGE_SHIFT
> >  
> > +static inline void um_tlb_mark_sync(struct mm_struct *mm, unsigned
> > long start,
> > +				    unsigned long end)
> > +{
> > +	if (!mm->context.sync_tlb_range_to) {
> > +		mm->context.sync_tlb_range_from = start;
> > +		mm->context.sync_tlb_range_to = end;
> > +	} else {
> > +		if (start < mm->context.sync_tlb_range_from)
> > +			mm->context.sync_tlb_range_from = start;
> > +		if (end > mm->context.sync_tlb_range_to)
> > +			mm->context.sync_tlb_range_to = end;
> > +	}
> > +}
> 
> IIUC, in some cases, the range [sync_tlb_range_from, sync_tlb_range_to)
> might become very large when merging non-adjacent ranges? Could that
> be an issue?

I figured it is not a big problem. It will result in scanning the
entire page table once to check whether the NEW_PAGE bit is set on any
PTE. I am assuming that this will happen almost never and scanning the
page table (but not doing syscalls) is reasonably cheap at the end.

> > diff --git a/arch/um/include/asm/tlbflush.h b/arch/um/include/asm/tlbflush.h
> > index d7cf82023b74..62816f6f1c91 100644
> > --- a/arch/um/include/asm/tlbflush.h
> > +++ b/arch/um/include/asm/tlbflush.h
> > @@ -9,24 +9,50 @@
> >  #include <linux/mm.h>
> >  
> >  /*
> > - * TLB flushing:
> > + * In UML, we need to sync the TLB over by using mmap/munmap/mprotect syscalls
> > + * from the process handling the MM (which can be the kernel itself).
> > + *
> > + * To track updates, we can hook into set_ptes and flush_tlb_*. With set_ptes
> > + * we catch all PTE transitions where memory that was unusable becomes usable.
> > + * While with flush_tlb_* we can track any memory that becomes unusable and
> > + * even if a higher layer of the page table was modified.
> > + *
> > + * So, we simply track updates using both methods and mark the memory area to
> > + * be synced later on. The only special case is that flush_tlb_kern_* needs to
> > + * be executed immediately as there is no good synchronization point in that
> > + * case. In contrast, in the set_ptes case we can wait for the next kernel
> > + * segfault before we do the synchornization.
> >   *
> > - *  - flush_tlb() flushes the current mm struct TLBs
> >   *  - flush_tlb_all() flushes all processes TLBs
> >   *  - flush_tlb_mm(mm) flushes the specified mm context TLB's
> >   *  - flush_tlb_page(vma, vmaddr) flushes one page
> > - *  - flush_tlb_kernel_vm() flushes the kernel vm area
> >   *  - flush_tlb_range(vma, start, end) flushes a range of pages
> > + *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
> >   */
> >  
> > +extern int um_tlb_sync(struct mm_struct *mm);
> > +
> >  extern void flush_tlb_all(void);
> >  extern void flush_tlb_mm(struct mm_struct *mm);
> > -extern void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, 
> > -			    unsigned long end);
> > -extern void flush_tlb_page(struct vm_area_struct *vma, unsigned long address);
> > -extern void flush_tlb_kernel_vm(void);
> > -extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
> > -extern void __flush_tlb_one(unsigned long addr);
> > +
> > +static void flush_tlb_page(struct vm_area_struct *vma, unsigned long address)
> > +{
> > +	um_tlb_mark_sync(vma->vm_mm, address, address + PAGE_SIZE);
> > +}
> > +
> > +static void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
> > +			    unsigned long end)
> > +{
> > +	um_tlb_mark_sync(vma->vm_mm, start, end);
> > +}
> > +
> > +static void flush_tlb_kernel_range(unsigned long start, unsigned long end)
> > +{
> > +	um_tlb_mark_sync(&init_mm, start, end);
> > +
> > +	/* Kernel needs to be synced immediately */
> > +	um_tlb_sync(&init_mm);
> > +}
> 
> Nit: this is a header file, these functions should be defined as inline functions.

Yup, thanks!

> > diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
> > index c137ff6f84dd..232aa7601d5d 100644
> > --- a/arch/um/kernel/tlb.c
> > +++ b/arch/um/kernel/tlb.c
> [...]
> >  
> > -void flush_tlb_kernel_range(unsigned long start, unsigned long
> > end)
> > -{
> > -	flush_tlb_kernel_range_common(start, end);
> > -}
> > -
> > -void flush_tlb_kernel_vm(void)
> > -{
> > -	flush_tlb_kernel_range_common(start_vm, end_vm);
> > -}
> 
> The build breaks with this change, as there is still a call to
> flush_tlb_kernel_vm() in ubd.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/um/drivers/ubd_kern.c?id=fb5d1d389c9e78d68f1f71f926d6251017579f5b#n774

Oh, thanks for the pointer!

I do not see a good reason for that call to even exist. My best theory
right now is that it existed to avoid later pagefaults for new memory
regions (the vmalloc?). So a workaround that is not needed anymore with
this patch.

Benjamin




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/12] Rework stub syscall and page table handling
  2024-04-22  2:35 ` [PATCH 00/12] Rework stub syscall and page table handling Tiwei Bie
@ 2024-04-22  7:41   ` Benjamin Berg
  2024-04-22 12:08     ` Tiwei Bie
  0 siblings, 1 reply; 19+ messages in thread
From: Benjamin Berg @ 2024-04-22  7:41 UTC (permalink / raw)
  To: Tiwei Bie, linux-um

Hi Tiwei,

On Mon, 2024-04-22 at 10:35 +0800, Tiwei Bie wrote:
> On 4/18/24 5:23 PM, benjamin@sipsolutions.net wrote:
> > From: Benjamin Berg <benjamin.berg@intel.com>
> > 
> > This patchset reworks the stub syscall handling and also redos how page
> > table updates are tracked and synchronized. Some of this originated in
> > the SECCOMP patchset, but it became clear that these refactorings make
> > sense independently as they result in a considerably fewer page faults.
> 
> I saw your SECCOMP patchset. It's pretty cool! Just wondering if you're about
> to post a new version soon. :)

I am planning to work on it again, but it is not very high on my
priority list. So, could be quite soon or some months :-)

In the ARM support thread ("UML for arm64"), there were some ideas to
use FD passing in order to protect memory mappings better. Doing that
should allow the SECCOMP approach to scale to SMP and will also
simplify the security model.

Making those changes will take a bit of thought and experimentation.
Nothing really big though, it pretty much boils down to using sockets
for (some of) the synchronization and replacing mprotect with mmap so
the FD can authorize the operation.

> PS. Just FYI, gVisor also implemented a SECCOMP based platform which is
> also very interesting:
> 
> https://gvisor.dev/blog/2023/04/28/systrap-release/
> https://github.com/google/gvisor/tree/master/pkg/sentry/platform/systrap

It is a good choice if you want to catch syscalls and do some custom
handling in userspace. Just annoying that it is still stuck on classic
BPF :-)

Benjamin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 12/12] um: refactor TLB update handling
  2024-04-22  7:22     ` Benjamin Berg
@ 2024-04-22  7:51       ` Anton Ivanov
  0 siblings, 0 replies; 19+ messages in thread
From: Anton Ivanov @ 2024-04-22  7:51 UTC (permalink / raw)
  To: Benjamin Berg, Tiwei Bie, linux-um



On 22/04/2024 08:22, Benjamin Berg wrote:
> On Mon, 2024-04-22 at 10:51 +0800, Tiwei Bie wrote:
>> On 4/18/24 5:23 PM, benjamin@sipsolutions.net wrote:
>>> diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
>>> index 37eb6e89e79a..bf8da736609c 100644
>>> --- a/arch/um/include/asm/mmu.h
>>> +++ b/arch/um/include/asm/mmu.h
>>> @@ -10,6 +10,10 @@
>>>   
>>>   typedef struct mm_context {
>>>   	struct mm_id id;
>>> +
>>> +	/* Address range in need of a TLB sync */
>>> +	long int sync_tlb_range_from;
>>> +	long int sync_tlb_range_to;
>>
>> Why not "unsigned long"?
> 
> Oops, yes, it should be "unsigned long".
> 
>>
>>>   } mm_context_t;
>>>   
>>>   extern void __switch_mm(struct mm_id * mm_idp);
>>> diff --git a/arch/um/include/asm/pgtable.h
>>> b/arch/um/include/asm/pgtable.h
>>> index e1ece21dbe3f..5bb397b65efb 100644
>>> --- a/arch/um/include/asm/pgtable.h
>>> +++ b/arch/um/include/asm/pgtable.h
>>> @@ -244,6 +244,38 @@ static inline void set_pte(pte_t *pteptr,
>>> pte_t pteval)
>>>   
>>>   #define PFN_PTE_SHIFT		PAGE_SHIFT
>>>   
>>> +static inline void um_tlb_mark_sync(struct mm_struct *mm, unsigned
>>> long start,
>>> +				    unsigned long end)
>>> +{
>>> +	if (!mm->context.sync_tlb_range_to) {
>>> +		mm->context.sync_tlb_range_from = start;
>>> +		mm->context.sync_tlb_range_to = end;
>>> +	} else {
>>> +		if (start < mm->context.sync_tlb_range_from)
>>> +			mm->context.sync_tlb_range_from = start;
>>> +		if (end > mm->context.sync_tlb_range_to)
>>> +			mm->context.sync_tlb_range_to = end;
>>> +	}
>>> +}
>>
>> IIUC, in some cases, the range [sync_tlb_range_from, sync_tlb_range_to)
>> might become very large when merging non-adjacent ranges? Could that
>> be an issue?
> 
> I figured it is not a big problem. It will result in scanning the
> entire page table once to check whether the NEW_PAGE bit is set on any
> PTE. I am assuming that this will happen almost never and scanning the
> page table (but not doing syscalls) is reasonably cheap at the end.
> 
>>> diff --git a/arch/um/include/asm/tlbflush.h b/arch/um/include/asm/tlbflush.h
>>> index d7cf82023b74..62816f6f1c91 100644
>>> --- a/arch/um/include/asm/tlbflush.h
>>> +++ b/arch/um/include/asm/tlbflush.h
>>> @@ -9,24 +9,50 @@
>>>   #include <linux/mm.h>
>>>   
>>>   /*
>>> - * TLB flushing:
>>> + * In UML, we need to sync the TLB over by using mmap/munmap/mprotect syscalls
>>> + * from the process handling the MM (which can be the kernel itself).
>>> + *
>>> + * To track updates, we can hook into set_ptes and flush_tlb_*. With set_ptes
>>> + * we catch all PTE transitions where memory that was unusable becomes usable.
>>> + * While with flush_tlb_* we can track any memory that becomes unusable and
>>> + * even if a higher layer of the page table was modified.
>>> + *
>>> + * So, we simply track updates using both methods and mark the memory area to
>>> + * be synced later on. The only special case is that flush_tlb_kern_* needs to
>>> + * be executed immediately as there is no good synchronization point in that
>>> + * case. In contrast, in the set_ptes case we can wait for the next kernel
>>> + * segfault before we do the synchornization.
>>>    *
>>> - *  - flush_tlb() flushes the current mm struct TLBs
>>>    *  - flush_tlb_all() flushes all processes TLBs
>>>    *  - flush_tlb_mm(mm) flushes the specified mm context TLB's
>>>    *  - flush_tlb_page(vma, vmaddr) flushes one page
>>> - *  - flush_tlb_kernel_vm() flushes the kernel vm area
>>>    *  - flush_tlb_range(vma, start, end) flushes a range of pages
>>> + *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
>>>    */
>>>   
>>> +extern int um_tlb_sync(struct mm_struct *mm);
>>> +
>>>   extern void flush_tlb_all(void);
>>>   extern void flush_tlb_mm(struct mm_struct *mm);
>>> -extern void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
>>> -			    unsigned long end);
>>> -extern void flush_tlb_page(struct vm_area_struct *vma, unsigned long address);
>>> -extern void flush_tlb_kernel_vm(void);
>>> -extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
>>> -extern void __flush_tlb_one(unsigned long addr);
>>> +
>>> +static void flush_tlb_page(struct vm_area_struct *vma, unsigned long address)
>>> +{
>>> +	um_tlb_mark_sync(vma->vm_mm, address, address + PAGE_SIZE);
>>> +}
>>> +
>>> +static void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
>>> +			    unsigned long end)
>>> +{
>>> +	um_tlb_mark_sync(vma->vm_mm, start, end);
>>> +}
>>> +
>>> +static void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>>> +{
>>> +	um_tlb_mark_sync(&init_mm, start, end);
>>> +
>>> +	/* Kernel needs to be synced immediately */
>>> +	um_tlb_sync(&init_mm);
>>> +}
>>
>> Nit: this is a header file, these functions should be defined as inline functions.
> 
> Yup, thanks!
> 
>>> diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
>>> index c137ff6f84dd..232aa7601d5d 100644
>>> --- a/arch/um/kernel/tlb.c
>>> +++ b/arch/um/kernel/tlb.c
>> [...]
>>>   
>>> -void flush_tlb_kernel_range(unsigned long start, unsigned long
>>> end)
>>> -{
>>> -	flush_tlb_kernel_range_common(start, end);
>>> -}
>>> -
>>> -void flush_tlb_kernel_vm(void)
>>> -{
>>> -	flush_tlb_kernel_range_common(start_vm, end_vm);
>>> -}
>>
>> The build breaks with this change, as there is still a call to
>> flush_tlb_kernel_vm() in ubd.
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/um/drivers/ubd_kern.c?id=fb5d1d389c9e78d68f1f71f926d6251017579f5b#n774
> 
> Oh, thanks for the pointer!
> 
> I do not see a good reason for that call to even exist. My best theory
> right now is that it existed to avoid later pagefaults for new memory
> regions (the vmalloc?). So a workaround that is not needed anymore with
> this patch.

It is there since prehistoric times.

No idea why and what it's doing. IMHO it is not needed.

> 
> Benjamin
> 
> 
> 
> 

-- 
Anton R. Ivanov
https://www.kot-begemot.co.uk/


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/12] Rework stub syscall and page table handling
  2024-04-22  7:41   ` Benjamin Berg
@ 2024-04-22 12:08     ` Tiwei Bie
  0 siblings, 0 replies; 19+ messages in thread
From: Tiwei Bie @ 2024-04-22 12:08 UTC (permalink / raw)
  To: Benjamin Berg, linux-um

On 4/22/24 3:41 PM, Benjamin Berg wrote:
> On Mon, 2024-04-22 at 10:35 +0800, Tiwei Bie wrote:
>> On 4/18/24 5:23 PM, benjamin@sipsolutions.net wrote:
>>> From: Benjamin Berg <benjamin.berg@intel.com>
>>>
>>> This patchset reworks the stub syscall handling and also redos how page
>>> table updates are tracked and synchronized. Some of this originated in
>>> the SECCOMP patchset, but it became clear that these refactorings make
>>> sense independently as they result in a considerably fewer page faults.
>>
>> I saw your SECCOMP patchset. It's pretty cool! Just wondering if you're about
>> to post a new version soon. :)
> 
> I am planning to work on it again, but it is not very high on my
> priority list. So, could be quite soon or some months :-)
> 
> In the ARM support thread ("UML for arm64"), there were some ideas to
> use FD passing in order to protect memory mappings better. Doing that
> should allow the SECCOMP approach to scale to SMP and will also
> simplify the security model.
> 
> Making those changes will take a bit of thought and experimentation.
> Nothing really big though, it pretty much boils down to using sockets
> for (some of) the synchronization and replacing mprotect with mmap so
> the FD can authorize the operation.

Cool. Thanks for the details! Looking forward to your new version. :)

Regards,
Tiwei


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-04-22 12:08 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-18  9:23 [PATCH 00/12] Rework stub syscall and page table handling benjamin
2024-04-18  9:23 ` [PATCH 01/12] um: Remove stub-data.h include from common-offsets.h benjamin
2024-04-18  9:23 ` [PATCH 02/12] um: Create signal stack memory assignment in stub_data benjamin
2024-04-18  9:23 ` [PATCH 03/12] um: Add generic stub_syscall6 function benjamin
2024-04-18  9:23 ` [PATCH 04/12] um: Rework syscall handling benjamin
2024-04-18  9:23 ` [PATCH 05/12] um: compress memory related stub syscalls while adding them benjamin
2024-04-18  9:23 ` [PATCH 06/12] um: remove LDT support benjamin
2024-04-18  9:23 ` [PATCH 07/12] um: remove copy_context_skas0 benjamin
2024-04-18  9:23 ` [PATCH 08/12] um: Delay flushing syscalls until the thread is restarted benjamin
2024-04-18  9:23 ` [PATCH 09/12] um: Do not flush MM in flush_thread benjamin
2024-04-18  9:23 ` [PATCH 10/12] um: remove force_flush_all from fork_handler benjamin
2024-04-18  9:23 ` [PATCH 11/12] um: simplify and consolidate TLB updates benjamin
2024-04-18  9:23 ` [PATCH 12/12] um: refactor TLB update handling benjamin
2024-04-22  2:51   ` Tiwei Bie
2024-04-22  7:22     ` Benjamin Berg
2024-04-22  7:51       ` Anton Ivanov
2024-04-22  2:35 ` [PATCH 00/12] Rework stub syscall and page table handling Tiwei Bie
2024-04-22  7:41   ` Benjamin Berg
2024-04-22 12:08     ` Tiwei Bie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).