Kernel-hardening archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
       [not found] <aefc85852ea518982e74b233e11e16d2e707bc32>
@ 2020-07-28 13:10 ` madvenka
  2020-07-28 13:10   ` [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API madvenka
                     ` (7 more replies)
  0 siblings, 8 replies; 49+ messages in thread
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
  To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86,
	madvenka

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Introduction
------------

Trampolines are used in many different user applications. Trampoline
code is often generated at runtime. Trampoline code can also just be a
pre-defined sequence of machine instructions in a data buffer.

Trampoline code is placed either in a data page or in a stack page. In
order to execute a trampoline, the page it resides in needs to be mapped
with execute permissions. Writable pages with execute permissions provide
an attack surface for hackers. Attackers can use this to inject malicious
code, modify existing code or do other harm.

To mitigate this, LSMs such as SELinux may not allow pages to have both
write and execute permissions. This prevents trampolines from executing
and blocks applications that use trampolines. To allow genuine applications
to run, exceptions have to be made for them (by setting execmem, etc).
In this case, the attack surface is just the pages of such applications.

An application that is not allowed to have writable executable pages
may try to load trampoline code into a file and map the file with execute
permissions. In this case, the attack surface is just the buffer that
contains trampoline code. However, a successful exploit may provide the
hacker with means to load his own code in a file, map it and execute it.

LSMs (such as the IPE proposal [1]) may allow only properly signed object
files to be mapped with execute permissions. This will prevent trampoline
files from being mapped. Again, exceptions have to be made for genuine
applications.

We need a way to execute trampolines without making security exceptions
where possible and to reduce the attack surface even further.

Examples of trampolines
-----------------------

libffi (A Portable Foreign Function Interface Library):

libffi allows a user to define functions with an arbitrary list of
arguments and return value through a feature called "Closures".
Closures use trampolines to jump to ABI handlers that handle calling
conventions and call a target function. libffi is used by a lot
of different applications. To name a few:

	- Python
	- Java
	- Javascript
	- Ruby FFI
	- Lisp
	- Objective C

GCC nested functions:

GCC has traditionally used trampolines for implementing nested
functions. The trampoline is placed on the user stack. So, the stack
needs to be executable.

Currently available solution
----------------------------

One solution that has been proposed to allow trampolines to be executed
without making security exceptions is Trampoline Emulation. See:

https://pax.grsecurity.net/docs/emutramp.txt

In this solution, the kernel recognizes certain sequences of instructions
as "well-known" trampolines. When such a trampoline is executed, a page
fault happens because the trampoline page does not have execute permission.
The kernel recognizes the trampoline and emulates it. Basically, the
kernel does the work of the trampoline on behalf of the application.

Here, the attack surface is the buffer that contains the trampoline.
The attack surface is narrower than before. A hacker may still be able to
modify what gets loaded in the registers or modify the target PC to point
to arbitrary locations.

Currently, the emulated trampolines are the ones used in libffi and GCC
nested functions. To my knowledge, only X86 is supported at this time.

As noted in emutramp.txt, this is not a generic solution. For every new
trampoline that needs to be supported, new instruction sequences need to
be recognized by the kernel and emulated. And this has to be done for
every architecture that needs to be supported.

emutramp.txt notes the following:

"... the real solution is not in emulation but by designing a kernel API
for runtime code generation and modifying userland to make use of it."

Trampoline File Descriptor (trampfd)
--------------------------

I am proposing a kernel API using anonymous file descriptors that
can be used to create and execute trampolines with the help of the
kernel. In this solution also, the kernel does the work of the trampoline.
The API is described in patch 1/4 of this patchset. I provide a
summary here:

Trampolines commonly execute the following sequence:

	- Load some values in some registers and/or
	- Push some values on the stack
	- Jump to a target PC

libffi and GCC nested function trampolines fit into this model.

Using the kernel API, applications and libraries can:

	- Create a trampoline object
	- Associate a register context with the trampoline (including
	  a target PC)
	- Associate a stack context with the trampoline
	- Map the trampoline into a process address space
	- Execute the trampoline by executing at the trampoline address

The kernel creates the trampoline mapping without any permissions. When
the trampoline is executed by user code, a page fault happens and the
kernel gets control. The kernel recognizes that this is a trampoline
invocation. It sets up the user registers based on the specified
register context, and/or pushes values on the user stack based on the
specified stack context, and sets the user PC to the requested target
PC. When the kernel returns, execution continues at the target PC.
So, the kernel does the work of the trampoline on behalf of the
application.

In this case, the attack surface is the context buffer. A hacker may
attack an application with a vulnerability and may be able to modify the
context buffer. So, when the register or stack context is set for
a trampoline, the values may have been tampered with. From an attack
surface perspective, this is similar to Trampoline Emulation. But
with trampfd, user code can retrieve a trampoline's context from the
kernel and add defensive checks to see if the context has been
tampered with.

As for the target PC, trampfd implements a measure called the
"Allowed PCs" context (see Advantages) to prevent a hacker from making
the target PC point to arbitrary locations. So, the attack surface is
narrower than Trampoline Emulation.

Advantages of the Trampoline File Descriptor approach
-----------------------------------------------------

- trampfd is customizable. The user can specify any combination of
  allowed register name-value pairs in the register context and the kernel
  will set it up accordingly. This allows different user trampolines to be
  converted to use trampfd.

- trampfd allows a stack context to be set up so that trampolines that
  need to push values on the user stack can do that.

- The initial work is targeted for X86 and ARM. But the implementation
  leverages small portions of existing signal delivery code. Specifically,
  it uses pt_regs for setting up user registers and copy_to_user()
  to push values on the stack. So, this can be very easily ported to other
  architectures.

- trampfd provides a basic framework. In the future, new trampoline types
  can be implemented, new contexts can be defined, and additional rules
  can be implemented for security purposes.

- For instance, trampfd defines an "Allowed PCs" context in this initial
  work. As an example, libffi can create a read-only array of all ABI
  handlers for an architecture at build time. This array can be used to
  set the list of allowed PCs for a trampoline. This will mean that a hacker
  cannot hack the PC part of the register context and make it point to
  arbitrary locations.

- An SELinux setting called "exectramp" can be implemented along the
  lines of "execmem", "execstack" and "execheap" to selectively allow the
  use of trampolines on a per application basis.

- User code can add defensive checks in the code before invoking a
  trampoline to make sure that a hacker has not modified the context data.
  It can do this by getting the trampoline context from the kernel and
  double checking it.

- In the future, if the kernel can be enhanced to use a safe code
  generation component, that code can be placed in the trampoline mapping
  pages. Then, the trampoline invocation does not have to incur a trip
  into the kernel.

- Also, if the kernel can be enhanced to use a safe code generation
  component, other forms of dynamic code such as JIT code can be
  addressed by the trampfd framework.

- Trampolines can be shared across processes which can give rise to
  interesting uses in the future.

- Trampfd can be used for other purposes to extend the kernel's
  functionality.

libffi
------

I have implemented my solution for libffi and provided the changes for
X86 and ARM, 32-bit and 64-bit. Here is the reference patch:

http://linux.microsoft.com/~madvenka/libffi/libffi.txt

If the trampfd patchset gets accepted, I will send the libffi changes
to the maintainers for a review. BTW, I have also successfully executed
the libffi self tests.

Work that is pending
--------------------

- I am working on implementing an SELinux setting called "exectramp"
  similar to "execmem" to allow the use of trampfd on a per application
  basis.

- I have a comprehensive test program to test the kernel API. I am
  working on adding it to selftests.

References
----------

[1] https://microsoft.github.io/ipe/
---
Madhavan T. Venkataraman (4):
  fs/trampfd: Implement the trampoline file descriptor API
  x86/trampfd: Support for the trampoline file descriptor
  arm64/trampfd: Support for the trampoline file descriptor
  arm/trampfd: Support for the trampoline file descriptor

 arch/arm/include/uapi/asm/ptrace.h     |  20 ++
 arch/arm/kernel/Makefile               |   1 +
 arch/arm/kernel/trampfd.c              | 214 +++++++++++++++++
 arch/arm/mm/fault.c                    |  12 +-
 arch/arm/tools/syscall.tbl             |   1 +
 arch/arm64/include/asm/ptrace.h        |   9 +
 arch/arm64/include/asm/unistd.h        |   2 +-
 arch/arm64/include/asm/unistd32.h      |   2 +
 arch/arm64/include/uapi/asm/ptrace.h   |  57 +++++
 arch/arm64/kernel/Makefile             |   2 +
 arch/arm64/kernel/trampfd.c            | 278 ++++++++++++++++++++++
 arch/arm64/mm/fault.c                  |  15 +-
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 arch/x86/include/uapi/asm/ptrace.h     |  38 +++
 arch/x86/kernel/Makefile               |   2 +
 arch/x86/kernel/trampfd.c              | 313 +++++++++++++++++++++++++
 arch/x86/mm/fault.c                    |  11 +
 fs/Makefile                            |   1 +
 fs/trampfd/Makefile                    |   6 +
 fs/trampfd/trampfd_data.c              |  43 ++++
 fs/trampfd/trampfd_fops.c              | 131 +++++++++++
 fs/trampfd/trampfd_map.c               |  78 ++++++
 fs/trampfd/trampfd_pcs.c               |  95 ++++++++
 fs/trampfd/trampfd_regs.c              | 137 +++++++++++
 fs/trampfd/trampfd_stack.c             | 131 +++++++++++
 fs/trampfd/trampfd_stubs.c             |  41 ++++
 fs/trampfd/trampfd_syscall.c           |  92 ++++++++
 include/linux/syscalls.h               |   3 +
 include/linux/trampfd.h                |  82 +++++++
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/trampfd.h           | 171 ++++++++++++++
 init/Kconfig                           |   8 +
 kernel/sys_ni.c                        |   3 +
 34 files changed, 1998 insertions(+), 7 deletions(-)
 create mode 100644 arch/arm/kernel/trampfd.c
 create mode 100644 arch/arm64/kernel/trampfd.c
 create mode 100644 arch/x86/kernel/trampfd.c
 create mode 100644 fs/trampfd/Makefile
 create mode 100644 fs/trampfd/trampfd_data.c
 create mode 100644 fs/trampfd/trampfd_fops.c
 create mode 100644 fs/trampfd/trampfd_map.c
 create mode 100644 fs/trampfd/trampfd_pcs.c
 create mode 100644 fs/trampfd/trampfd_regs.c
 create mode 100644 fs/trampfd/trampfd_stack.c
 create mode 100644 fs/trampfd/trampfd_stubs.c
 create mode 100644 fs/trampfd/trampfd_syscall.c
 create mode 100644 include/linux/trampfd.h
 create mode 100644 include/uapi/linux/trampfd.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
@ 2020-07-28 13:10   ` madvenka
  2020-07-28 14:50     ` Oleg Nesterov
  2020-07-28 13:10   ` [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor madvenka
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 49+ messages in thread
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
  To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86,
	madvenka

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

There are many applications that use trampoline code. Trampoline code is
usually placed in a data page or a stack page. In order to execute a
trampoline, the page that contains the trampoline needs to have execute
permissions.

Writable pages with execute permissions provide an attack surface for
hackers. To mitigate this, LSMs such as SELinux may prevent a page from
having both write and execute permissions.

An application may attempt to circumvent this by writing the trampoline
code into a temporary file and mapping the file into its process
address space with just execute permissions. This presents the same
opportunity to hackers as before. LSMs that implement cryptographic
verification of files can prevent such temporary files from being mapped.

Such security mitigations prevent genuine trampoline code from running
as well.

Typically, trampolines simply load some values in some registers and/or
push some values on the stack and jump to a target PC. For such simple
trampolines, an application could request the kernel to do that work
instead of executing trampoline code to do that work. trampfd allows
applications to do exactly this.

Such applications can then run without having to relax security
settings for them. For instance, libffi trampolines can easily be
replaced by trampfd. libffi is used by a variety of applications.

trampfd_create() system call
----------------------------

A new system call is introduced to create a trampoline. The system call
number for this is 440. The system call is invoked like this:

	int	trampfd;

	trampfd = syscall(440, type, data);

	type	Trampoline type.
	data	Trampoline type-specific data.

Types of trampolines
--------------------

Different types of trampolines can be defined based on the desired
functionality. In this initial work, the following type is defined:

	TRAMPFD_USER

This implements the simple trampoline type I referred to earlier.
The type-specific structure for TRAMPFD_USER is struct trampfd_user.

Trampoline contexts
-------------------

A trampoline can have one or more contexts associated with it. Contexts
are of two kinds:

	- Contexts that can be specified by the user. These can be added,
	  retrieved and removed by user code.

	- Contexts that are specified by the kernel. This can only be
	  added by the kernel. But these can be read by the user.

In this initial work, I define the following contexts:

User specifiable:

	Register Context
	----------------

	Contains register name-value pairs. When a trampoline is invoked,
	the specified values are loaded in the specified registers. This
	includes the value of the PC register. The kernel specifies the
	subset of registers that can be specified.

	Stack Context
	-------------

	Contains data to push on the user stack when a trampoline is
	invoked.

	Allowed PCs
	-----------

	This specifies a list of PCs that the trampoline is allowed to
	jump to. This prevents a hacker from modifying the trampoline's
	target PC.

Kernel specified:

	Mapping parameters
	------------------

	Used to map a trampoline into an address space. Mapping parameters
	are determined by the kernel based on the trampoline type and
	type-specific information.

Other contexts can be defined in the future.

How to set and read contexts
----------------------------

A symbolic file offset is associated with each context type.

	TRAMPFD_MAP_OFFSET
	TRAMPFD_REGS_OFFSET
	TRAMPFD_STACK_OFFSET
	TRAMPFD_PCS_OFFSET

A structure is defined for each context type as well:

	struct trampfd_map
	struct trampfd_regs
	struct trampfd_stack
	struct trampfd_pcs

To set/retrieve a context, seek to the corresponding offset and
write()/read() the corresponding structure. As a convenience, pread()
and pwrite() can be used so it can be done in one call instead of two.

Invoking a trampoline
---------------------

Map the file descriptor into process address space using mmap(). The
kernel returns an address to invoke the trampoline with. The protection
for the mapping is set to PROT_NONE.

Execute the trampoline in one of two ways depending upon what the target
PC points to:

   - Branch to the trampoline address.

   - Use the trampoline address as a function pointer and call it.

Because the user process does not have execute permissions on the
trampoline address, it traps into the kernel. The kernel recognizes
it as a trampoline invocation and performs the action indicated by the
trampoline's type and context. In the case of TRAMPFD_USER, the
kernel loads the user registers with the values specified in the
register context, pushes the values specfied in the stack context on
the user stack and sets the user PC to point to the PC register value
in the register context. Then, the process returns to user land and
continues execution at the target PC.

Removing a context
------------------

To remove a context, write the context structure into trampfd but
specify a zero context. For example, for register context, specify
the number of registers as 0. For stack context, specify size of
stack data as 0.

Removing a trampoline
---------------------

To remove a trampoline, unmap it and close the file descriptor. When
the last reference on the trampoline goes away, the trampoline is freed.

Sharing trampolines
-------------------

A trampoline created by one thread can be used by other threads sharing
the same address space.

Trampolines, in general, may be shared across processes by the usual
mechanism of sending the file descriptor to another process over a Unix
domain socket.

Architecture support
--------------------

The handling of the trampoline page fault and the setting up of the
register and stack contexts are architecture specific. Architecture
specific patches will implement support for the API.

The signal delivery code in the kernel already implements the elements
needed for this work. That will be leveraged.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 fs/Makefile                       |   1 +
 fs/trampfd/Makefile               |   6 ++
 fs/trampfd/trampfd_data.c         |  43 ++++++++
 fs/trampfd/trampfd_fops.c         | 131 +++++++++++++++++++++++
 fs/trampfd/trampfd_map.c          |  78 ++++++++++++++
 fs/trampfd/trampfd_pcs.c          |  95 +++++++++++++++++
 fs/trampfd/trampfd_regs.c         | 137 ++++++++++++++++++++++++
 fs/trampfd/trampfd_stack.c        | 131 +++++++++++++++++++++++
 fs/trampfd/trampfd_stubs.c        |  41 +++++++
 fs/trampfd/trampfd_syscall.c      |  92 ++++++++++++++++
 include/linux/syscalls.h          |   3 +
 include/linux/trampfd.h           |  82 ++++++++++++++
 include/uapi/asm-generic/unistd.h |   4 +-
 include/uapi/linux/trampfd.h      | 171 ++++++++++++++++++++++++++++++
 init/Kconfig                      |   8 ++
 kernel/sys_ni.c                   |   3 +
 16 files changed, 1025 insertions(+), 1 deletion(-)
 create mode 100644 fs/trampfd/Makefile
 create mode 100644 fs/trampfd/trampfd_data.c
 create mode 100644 fs/trampfd/trampfd_fops.c
 create mode 100644 fs/trampfd/trampfd_map.c
 create mode 100644 fs/trampfd/trampfd_pcs.c
 create mode 100644 fs/trampfd/trampfd_regs.c
 create mode 100644 fs/trampfd/trampfd_stack.c
 create mode 100644 fs/trampfd/trampfd_stubs.c
 create mode 100644 fs/trampfd/trampfd_syscall.c
 create mode 100644 include/linux/trampfd.h
 create mode 100644 include/uapi/linux/trampfd.h

diff --git a/fs/Makefile b/fs/Makefile
index 2ce5112b02c8..227761302000 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -136,3 +136,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-$(CONFIG_TRAMPFD)		+= trampfd/
diff --git a/fs/trampfd/Makefile b/fs/trampfd/Makefile
new file mode 100644
index 000000000000..bdf5e487facc
--- /dev/null
+++ b/fs/trampfd/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_TRAMPFD) += trampfd.o
+
+trampfd-y += trampfd_data.o trampfd_fops.o trampfd_map.o trampfd_pcs.o
+trampfd-y += trampfd_regs.o trampfd_stack.o trampfd_stubs.o trampfd_syscall.o
diff --git a/fs/trampfd/trampfd_data.c b/fs/trampfd/trampfd_data.c
new file mode 100644
index 000000000000..0a316754cbe4
--- /dev/null
+++ b/fs/trampfd/trampfd_data.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Trampoline type-specific code.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/mman.h>
+#include <linux/trampfd.h>
+
+int trampfd_create_data(struct trampfd *trampfd, const void __user *tramp_data)
+{
+	struct trampfd_map	*map = &trampfd->map;
+	struct trampfd_user	*user;
+
+	if (trampfd->type == TRAMPFD_USER) {
+		user = kmalloc(sizeof(*user), GFP_KERNEL);
+		if (!user)
+			return -ENOMEM;
+
+		if (copy_from_user(user, tramp_data, sizeof(*user))) {
+			kfree(user);
+			return -EFAULT;
+		}
+		if (user->flags || user->reserved) {
+			kfree(user);
+			return -EINVAL;
+		}
+		trampfd->data = user;
+
+		map->size = PAGE_SIZE;
+		map->prot = PROT_NONE;
+		map->flags = MAP_PRIVATE;
+		map->offset = 0;
+		map->ioffset = 0;
+	}
+	return 0;
+}
diff --git a/fs/trampfd/trampfd_fops.c b/fs/trampfd/trampfd_fops.c
new file mode 100644
index 000000000000..94b82e0da75b
--- /dev/null
+++ b/fs/trampfd/trampfd_fops.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - File operations.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/seq_file.h>
+#include <linux/trampfd.h>
+
+#ifdef CONFIG_PROC_FS
+static const char * const trampfd_type_names[TRAMPFD_NUM_TYPES] = {
+	"TRAMPFD_USER",
+};
+
+static void trampfd_show_fdinfo(struct seq_file *sfile, struct file *file)
+{
+	struct trampfd		*trampfd = file->private_data;
+
+	seq_printf(sfile, "type: %s\n", trampfd_type_names[trampfd->type]);
+}
+#endif
+
+static loff_t trampfd_llseek(struct file *file, loff_t offset, int whence)
+{
+	struct trampfd		*trampfd = file->private_data;
+
+	if (whence != SEEK_SET)
+		return -EINVAL;
+
+	if ((offset < 0) || (offset >= TRAMPFD_NUM_OFFSETS))
+		return -EINVAL;
+
+	mutex_lock(&trampfd->lock);
+	if (offset != file->f_pos) {
+		file->f_pos = offset;
+		file->f_version = 0;
+	}
+	mutex_unlock(&trampfd->lock);
+	return offset;
+}
+
+static ssize_t trampfd_read(struct file *file, char __user *arg,
+			    size_t count, loff_t *ppos)
+{
+	int		rc;
+
+	if (!arg || !count)
+		return -EINVAL;
+
+	switch (*ppos) {
+	case TRAMPFD_MAP_OFFSET:
+		rc = trampfd_get_map(file, arg, count);
+		break;
+
+	case TRAMPFD_REGS_OFFSET:
+		rc = trampfd_get_regs(file, arg, count);
+		break;
+
+	case TRAMPFD_STACK_OFFSET:
+		rc = trampfd_get_stack(file, arg, count);
+		break;
+
+	default:
+		rc = -EINVAL;
+		goto out;
+	}
+out:
+	return rc ? rc : (ssize_t) count;
+}
+
+static ssize_t trampfd_write(struct file *file, const char __user *arg,
+			     size_t count, loff_t *ppos)
+{
+	int		rc;
+
+	if (!arg || !count)
+		return -EINVAL;
+
+	switch (*ppos) {
+	case TRAMPFD_REGS_OFFSET:
+		rc = trampfd_set_regs(file, arg, count);
+		break;
+
+	case TRAMPFD_STACK_OFFSET:
+		rc = trampfd_set_stack(file, arg, count);
+		break;
+
+	case TRAMPFD_ALLOWED_PCS_OFFSET:
+		rc = trampfd_set_allowed_pcs(file, arg, count);
+		break;
+
+	default:
+		rc = -EINVAL;
+		goto out;
+	}
+out:
+	return rc ? rc : (ssize_t) count;
+}
+
+static int trampfd_release(struct inode *inode, struct file *file)
+{
+	struct trampfd		*trampfd = file->private_data;
+
+	if (trampfd->type == TRAMPFD_USER) {
+		kfree(trampfd->regs);
+		kfree(trampfd->stack);
+		kfree(trampfd->allowed_pcs);
+	}
+	kfree(trampfd->data);
+	mutex_destroy(&trampfd->lock);
+	kmem_cache_free(trampfd_cache, trampfd);
+	return 0;
+}
+
+const struct file_operations trampfd_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo		= trampfd_show_fdinfo,
+#endif
+	.llseek			= trampfd_llseek,
+	.read			= trampfd_read,
+	.write			= trampfd_write,
+	.release		= trampfd_release,
+	.mmap			= trampfd_mmap,
+	.get_unmapped_area	= trampfd_get_unmapped_area,
+};
diff --git a/fs/trampfd/trampfd_map.c b/fs/trampfd/trampfd_map.c
new file mode 100644
index 000000000000..1a156c850ca8
--- /dev/null
+++ b/fs/trampfd/trampfd_map.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Memory mapping.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/mman.h>
+#include <linux/security.h>
+#include <linux/trampfd.h>
+
+int trampfd_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct trampfd		*trampfd = file->private_data;
+
+	if (trampfd->type == TRAMPFD_USER) {
+		/*
+		 * These mappings are special mappings that should not be
+		 * merged or inherited. No physical page is currently allocated
+		 * to these mappings. So, there is nothing to read/write.
+		 * When the trampoline is invoked, an execute fault must be
+		 * encountered so the kernel can intercept the invocation and
+		 * set up user context.
+		 */
+		if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
+			return -EINVAL;
+		vma->vm_flags = VM_SPECIAL | VM_DONTCOPY | VM_DONTDUMP;
+	}
+	vma->vm_private_data = trampfd;
+	return 0;
+}
+
+unsigned long
+trampfd_get_unmapped_area(struct file *file, unsigned long orig_addr,
+			  unsigned long len, unsigned long pgoff,
+			  unsigned long flags)
+{
+	struct trampfd		*trampfd = file->private_data;
+	struct trampfd_map	*map = &trampfd->map;
+	unsigned long		map_pgoff = map->offset >> PAGE_SHIFT;
+
+	const typeof_member(struct file_operations, get_unmapped_area)
+	get_area = current->mm->get_unmapped_area;
+
+	if (len != map->size || pgoff != map_pgoff || (flags != map->flags))
+		return -EINVAL;
+
+	return get_area(file, orig_addr, len, pgoff, flags);
+}
+
+/*
+ * Retrieve the mapping parameters of a trampoline.
+ */
+int trampfd_get_map(struct file *file, char __user *arg, size_t count)
+{
+	struct trampfd		*trampfd = file->private_data;
+
+	if (count != sizeof(trampfd->map))
+		return -EINVAL;
+	if (copy_to_user(arg, &trampfd->map, count))
+		return -EFAULT;
+	return 0;
+}
+
+bool is_trampfd_vma(struct vm_area_struct *vma)
+{
+	struct file	*file = vma->vm_file;
+
+	if (!file)
+		return false;
+	return !strcmp(file->f_path.dentry->d_name.name, trampfd_name);
+}
+EXPORT_SYMBOL_GPL(is_trampfd_vma);
diff --git a/fs/trampfd/trampfd_pcs.c b/fs/trampfd/trampfd_pcs.c
new file mode 100644
index 000000000000..0ed36fd2169f
--- /dev/null
+++ b/fs/trampfd/trampfd_pcs.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Allowed PCs context.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/trampfd.h>
+
+/*
+ * Copy list of allowed PCs from the user and validate it.
+ */
+static int trampfd_copy_allowed_pcs(struct trampfd_values *allowed_pcs,
+			    const void __user *arg, size_t count)
+{
+	u32			npcs;
+	size_t			size;
+	u64			*values;
+	int			i;
+
+	if (copy_from_user(allowed_pcs, arg, count))
+		return -EFAULT;
+
+	if (allowed_pcs->reserved)
+		return -EINVAL;
+
+	npcs = allowed_pcs->nvalues;
+	if (npcs > TRAMPFD_MAX_PCS)
+		return -EINVAL;
+
+	size = sizeof(*allowed_pcs);
+	size += npcs * sizeof(u64);
+	if (size != count)
+		return -EINVAL;
+
+	values = allowed_pcs->values;
+	for (i = 0; i < npcs; i++) {
+		if (!values[i])
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * Set the allowed PCs for a trampoline. If the trampoline has a register
+ * context at this point, the PC register value in that register context is
+ * not checked against this list of allowed PCs.
+ */
+int trampfd_set_allowed_pcs(struct file *file, const char __user *arg,
+			    size_t count)
+{
+	struct trampfd			*trampfd = file->private_data;
+	struct trampfd_values		*allowed_pcs, *cur_allowed_pcs;
+	int				rc;
+
+	if (count < sizeof(*allowed_pcs) || count > TRAMPFD_MAX_PCS_SIZE)
+		return -EINVAL;
+
+	allowed_pcs = kmalloc(count, GFP_KERNEL);
+	if (!allowed_pcs)
+		return -ENOMEM;
+
+	rc = trampfd_copy_allowed_pcs(allowed_pcs, arg, count);
+	if (rc)
+		goto out;
+
+	/*
+	 * If number of PCs is 0, there is no new PCS to set.
+	 */
+	if (!allowed_pcs->nvalues) {
+		kfree(allowed_pcs);
+		allowed_pcs = NULL;
+	}
+
+	/*
+	 * Swap the new PCs with the current one and free the current one,
+	 * if any.
+	 */
+	mutex_lock(&trampfd->lock);
+
+	cur_allowed_pcs = trampfd->allowed_pcs;
+	trampfd->allowed_pcs = allowed_pcs;
+	allowed_pcs = cur_allowed_pcs;
+
+	mutex_unlock(&trampfd->lock);
+out:
+	kfree(allowed_pcs);
+	return rc;
+}
diff --git a/fs/trampfd/trampfd_regs.c b/fs/trampfd/trampfd_regs.c
new file mode 100644
index 000000000000..35114d647385
--- /dev/null
+++ b/fs/trampfd/trampfd_regs.c
@@ -0,0 +1,137 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Register context.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/trampfd.h>
+
+/*
+ * Copy context from the user and validate it.
+ */
+static int trampfd_copy_regs(struct trampfd_regs *regs, const void __user *arg,
+			     size_t count)
+{
+	u32			nregs;
+	size_t			size;
+
+	if (copy_from_user(regs, arg, count))
+		return -EFAULT;
+
+	if (regs->reserved)
+		return -EINVAL;
+
+	nregs = regs->nregs;
+	if (nregs > TRAMPFD_MAX_REGS)
+		return -EINVAL;
+
+	size = sizeof(*regs);
+	size += nregs * sizeof(struct trampfd_reg);
+	if (size != count)
+		return -EINVAL;
+
+	if (nregs && !trampfd_valid_regs(regs))
+		return -EINVAL;
+	return 0;
+}
+
+/*
+ * Set the register context for a trampoline.
+ */
+int trampfd_set_regs(struct file *file, const char __user *arg, size_t count)
+{
+	struct trampfd			*trampfd = file->private_data;
+	struct trampfd_regs		*regs, *cur_regs;
+	int				rc;
+
+	if (count < sizeof(*regs) || count > TRAMPFD_MAX_REGS_SIZE)
+		return -EINVAL;
+
+	regs = kmalloc(count, GFP_KERNEL);
+	if (!regs)
+		return -ENOMEM;
+
+	rc = trampfd_copy_regs(regs, arg, count);
+	if (rc)
+		goto out;
+
+	/*
+	 * If nregs is 0, there is no new register context to set.
+	 */
+	if (!regs->nregs) {
+		kfree(regs);
+		regs = NULL;
+	}
+
+	/*
+	 * Swap the new register context with the current one and free the
+	 * current one, if any.
+	 */
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Check if the specified PC is allowed.
+	 */
+	if (!regs || trampfd_allowed_pc(trampfd, regs)) {
+		cur_regs = trampfd->regs;
+		trampfd->regs = regs;
+		regs = cur_regs;
+	} else {
+		rc = -EINVAL;
+	}
+
+	mutex_unlock(&trampfd->lock);
+out:
+	kfree(regs);
+	return rc;
+}
+
+/*
+ * Retrieve the register context of a trampoline.
+ */
+int trampfd_get_regs(struct file *file, char __user *arg, size_t count)
+{
+	struct trampfd			*trampfd = file->private_data;
+	struct trampfd_regs		*regs, *cur_regs;
+	size_t				size;
+	int				rc = 0;
+
+	if (count < sizeof(*regs) || count > TRAMPFD_MAX_REGS_SIZE)
+		return -EINVAL;
+
+	regs = kmalloc(count, GFP_KERNEL);
+	if (!regs)
+		return -ENOMEM;
+
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Copy the current register context into a local buffer so we can
+	 * copy it to the user outside the lock.
+	 */
+	cur_regs = trampfd->regs;
+	if (cur_regs) {
+		size = sizeof(*cur_regs);
+		size += sizeof(struct trampfd_reg) * cur_regs->nregs;
+		if (size > count)
+			size = count;
+		memcpy(regs, cur_regs, size);
+	} else {
+		size = sizeof(*regs);
+		memset(regs, 0, size);
+	}
+
+	mutex_unlock(&trampfd->lock);
+
+	if (copy_to_user(arg, regs, size))
+		rc = -EFAULT;
+
+	kfree(regs);
+	return rc;
+}
diff --git a/fs/trampfd/trampfd_stack.c b/fs/trampfd/trampfd_stack.c
new file mode 100644
index 000000000000..032c5ed70d57
--- /dev/null
+++ b/fs/trampfd/trampfd_stack.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Stack context.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/trampfd.h>
+
+/*
+ * Copy context from the user and validate it.
+ */
+static int trampfd_copy_stack(struct trampfd_stack *stack,
+			      const void __user *arg, size_t count)
+{
+	size_t			size;
+
+	if (copy_from_user(stack, arg, count))
+		return -EFAULT;
+
+	if (stack->reserved)
+		return -EINVAL;
+
+	size = stack->size;
+	if (size > TRAMPFD_MAX_DATA_SIZE)
+		return -EINVAL;
+
+	size += sizeof(*stack);
+	if (size != count)
+		return -EINVAL;
+
+	if (!stack->size)
+		return 0;
+
+	if ((stack->flags & ~TRAMPFD_SFLAGS) ||
+	    stack->offset > TRAMPFD_MAX_STACK_OFFSET)
+		return -EINVAL;
+	return 0;
+}
+
+/*
+ * Set the register context for a trampoline.
+ */
+int trampfd_set_stack(struct file *file, const char __user *arg, size_t count)
+{
+	struct trampfd			*trampfd = file->private_data;
+	struct trampfd_stack		*stack, *cur_stack;
+	int				rc;
+
+	if (count < sizeof(*stack) || count > TRAMPFD_MAX_STACK_SIZE)
+		return -EINVAL;
+
+	stack = kmalloc(count, GFP_KERNEL);
+	if (!stack)
+		return -ENOMEM;
+
+	rc = trampfd_copy_stack(stack, arg, count);
+	if (rc)
+		goto out;
+
+	/*
+	 * If size is 0, there is no new stack context to set.
+	 */
+	if (!stack->size) {
+		kfree(stack);
+		stack = NULL;
+	}
+
+	/*
+	 * Swap the new stack context with the current one and free the
+	 * current one, if any.
+	 */
+	mutex_lock(&trampfd->lock);
+
+	cur_stack = trampfd->stack;
+	trampfd->stack = stack;
+	stack = cur_stack;
+
+	mutex_unlock(&trampfd->lock);
+out:
+	kfree(stack);
+	return rc;
+}
+
+/*
+ * Retrieve the register context of a trampoline.
+ */
+int trampfd_get_stack(struct file *file, char __user *arg, size_t count)
+{
+	struct trampfd			*trampfd = file->private_data;
+	struct trampfd_stack		*stack, *cur_stack;
+	size_t				size;
+	int				rc = 0;
+
+	if (count < sizeof(*stack) || count > TRAMPFD_MAX_STACK_SIZE)
+		return -EINVAL;
+
+	stack = kmalloc(count, GFP_KERNEL);
+	if (!stack)
+		return -ENOMEM;
+
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Copy the current register context into a local buffer so we can
+	 * copy it to the user outside the lock.
+	 */
+	cur_stack = trampfd->stack;
+	if (cur_stack) {
+		size = sizeof(*cur_stack) + cur_stack->size;
+		if (size > count)
+			size = count;
+		memcpy(stack, cur_stack, size);
+	} else {
+		size = sizeof(*stack);
+		memset(stack, 0, size);
+	}
+
+	mutex_unlock(&trampfd->lock);
+
+	if (copy_to_user(arg, stack, size))
+		rc = -EFAULT;
+
+	kfree(stack);
+	return rc;
+}
diff --git a/fs/trampfd/trampfd_stubs.c b/fs/trampfd/trampfd_stubs.c
new file mode 100644
index 000000000000..8ca29dccbbf7
--- /dev/null
+++ b/fs/trampfd/trampfd_stubs.c
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Stub functions.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/trampfd.h>
+
+/*
+ * Stub for the arch function that checks if a trampoline type is supported
+ * by the architecture. Return an error for all types that require architecture
+ * support. Return success for the rest as they are generic.
+ */
+int __attribute__((weak)) trampfd_check_arch(struct trampfd *trampfd)
+{
+	if (trampfd->type == TRAMPFD_USER)
+		return -EINVAL;
+	return 0;
+}
+
+/*
+ * Stub for the arch function that checks if a specified register context
+ * is valid.
+ */
+bool __attribute__((weak)) trampfd_valid_regs(struct trampfd_regs *regs)
+{
+	return false;
+}
+
+/*
+ * Stub for the arch function that checks if the PC register in a specified
+ * register context is allowed.
+ */
+bool __attribute__((weak)) trampfd_allowed_pc(struct trampfd *trampfd,
+					      struct trampfd_regs *regs)
+{
+	return false;
+}
diff --git a/fs/trampfd/trampfd_syscall.c b/fs/trampfd/trampfd_syscall.c
new file mode 100644
index 000000000000..675460afc521
--- /dev/null
+++ b/fs/trampfd/trampfd_syscall.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - System call.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/mman.h>
+#include <linux/syscalls.h>
+#include <linux/anon_inodes.h>
+#include <linux/trampfd.h>
+
+char	*trampfd_name = "[trampfd]";
+
+struct kmem_cache	*trampfd_cache;
+
+SYSCALL_DEFINE3(trampfd_create,
+		int, tramp_type,
+		const void __user *, tramp_data,
+		unsigned int, flags)
+{
+	struct trampfd		*trampfd;
+	struct file		*file;
+	int			fd, rc = 0;
+
+	if (!trampfd_cache)
+		return -ENOMEM;
+
+	/*
+	 * Flags are for future use.
+	 */
+	if (flags || !tramp_data)
+		return -EINVAL;
+
+	if (tramp_type < 0 || tramp_type >= TRAMPFD_NUM_TYPES)
+		return -EINVAL;
+
+	trampfd = kmem_cache_zalloc(trampfd_cache, GFP_KERNEL);
+	if (!trampfd)
+		return -ENOMEM;
+
+	mutex_init(&trampfd->lock);
+	trampfd->type = tramp_type;
+
+	rc = trampfd_create_data(trampfd, tramp_data);
+	if (rc)
+		goto freetramp;
+
+	rc = trampfd_check_arch(trampfd);
+	if (rc)
+		goto freedata;
+
+	rc = get_unused_fd_flags(O_CLOEXEC);
+	if (rc < 0)
+		goto freedata;
+	fd = rc;
+
+	file = anon_inode_getfile(trampfd_name, &trampfd_fops, trampfd, O_RDWR);
+	if (IS_ERR(file)) {
+		rc = PTR_ERR(file);
+		goto freefd;
+	}
+	file->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
+
+	fd_install(fd, file);
+	return fd;
+freefd:
+	put_unused_fd(fd);
+freedata:
+	kfree(trampfd->data);
+freetramp:
+	kmem_cache_free(trampfd_cache, trampfd);
+	return rc;
+}
+
+int __init trampfd_init(void)
+{
+	trampfd_cache = kmem_cache_create("trampfd_cache",
+		sizeof(struct trampfd), 0, SLAB_HWCACHE_ALIGN, NULL);
+
+	if (trampfd_cache == NULL) {
+		pr_warn("%s: kmem_cache_create failed", __func__);
+		return -ENOMEM;
+	}
+	return 0;
+}
+core_initcall(trampfd_init);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b951a87da987..25ddf29477bc 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1005,6 +1005,9 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
 asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_trampfd_create(int tramp_type,
+				   const void __user *tramp_data,
+				   unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/linux/trampfd.h b/include/linux/trampfd.h
new file mode 100644
index 000000000000..383d7eeda2d1
--- /dev/null
+++ b/include/linux/trampfd.h
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Trampoline File Descriptor - Internal structures and definitions.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+#ifndef _LINUX_TRAMPFD_H
+#define _LINUX_TRAMPFD_H
+
+#include <uapi/linux/trampfd.h>
+
+#define TRAMPFD_MAX_REGS_SIZE						\
+	(sizeof(struct trampfd_regs) +					\
+	(sizeof(struct trampfd_reg) * TRAMPFD_MAX_REGS))
+
+#define TRAMPFD_MAX_STACK_SIZE						\
+	(sizeof(struct trampfd_stack) + TRAMPFD_MAX_DATA_SIZE)
+
+#define TRAMPFD_MAX_PCS_SIZE						\
+	(sizeof(struct trampfd_values) + sizeof(u64) * TRAMPFD_MAX_PCS)
+
+/*
+ * Trampoline structure.
+ */
+struct trampfd {
+	struct mutex		lock;		/* to serialize access */
+	enum trampfd_type	type;		/* type of trampoline */
+	void			*data;		/* type specific data */
+	struct trampfd_map	map;		/* mmap() parameters */
+	struct trampfd_regs	*regs;		/* register context */
+	struct trampfd_stack	*stack;		/* stack context */
+	struct trampfd_values	*allowed_pcs;	/* allowed PCs */
+};
+
+#ifdef CONFIG_TRAMPFD
+
+/* Trampoline mapping */
+int trampfd_mmap(struct file *file, struct vm_area_struct *vma);
+unsigned long trampfd_get_unmapped_area(struct file *file,
+					unsigned long orig_addr,
+					unsigned long len,
+					unsigned long pgoff,
+					unsigned long flags);
+bool is_trampfd_vma(struct vm_area_struct *vma);
+
+/* Trampoline context */
+int trampfd_get_map(struct file *file, char __user *arg, size_t count);
+int trampfd_set_regs(struct file *file, const char __user *arg, size_t count);
+int trampfd_get_regs(struct file *file, char __user *arg, size_t count);
+int trampfd_set_stack(struct file *file, const char __user *arg, size_t count);
+int trampfd_get_stack(struct file *file, char __user *arg, size_t count);
+int trampfd_set_allowed_pcs(struct file *file, const char __user *arg,
+			    size_t count);
+
+/* Arch functions */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs);
+bool trampfd_valid_regs(struct trampfd_regs *regs);
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *regs);
+int trampfd_check_arch(struct trampfd *trampfd);
+
+/* Trampoline type-specific */
+int trampfd_create_data(struct trampfd *trampfd, const void __user *tramp_data);
+
+extern char				*trampfd_name;
+extern struct kmem_cache		*trampfd_cache;
+extern const struct file_operations	trampfd_fops;
+
+#define USERPTR(ptr)	((void __user *)(uintptr_t)(ptr))
+
+#else
+
+static inline bool trampfd_fault(struct vm_area_struct *vma,
+				 struct pt_regs *pt_regs)
+{
+	return false;
+}
+
+#endif /* CONFIG_TRAMPFD */
+
+#endif /* _LINUX_TRAMPFD_H */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index f4a01305d9a6..14e526a45624 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -857,9 +857,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 #define __NR_faccessat2 439
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_trampfd_create 440
+__SYSCALL(__NR_trampfd_create, sys_trampfd_create)
 
 #undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 441
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/trampfd.h b/include/uapi/linux/trampfd.h
new file mode 100644
index 000000000000..bf9a6ef3683b
--- /dev/null
+++ b/include/uapi/linux/trampfd.h
@@ -0,0 +1,171 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Trampoline File Descriptor - API structures and definitions.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+#ifndef _UAPI_LINUX_TRAMPFD_H
+#define _UAPI_LINUX_TRAMPFD_H
+
+#include <linux/types.h>
+#include <linux/ptrace.h>
+
+/*
+ * All structure fields are defined so that they are the same width and at the
+ * same structure offset on 32-bit and 64-bit to avoid compat code.
+ *
+ * All fields named "reserved" must be set to 0. They are there primarily for
+ * alignment. But they may be used in the future.
+ */
+
+/* ------------------------- Types of Trampolines ------------------------- */
+
+/*
+ * TRAMPFD_USER
+ *	User programs use the kernel as a trampoline to setup a user context
+ *	and jump to a user function. This trampoline type can be used to
+ *	replace user trampoline code.
+ */
+enum trampfd_type {
+	TRAMPFD_USER,
+	TRAMPFD_NUM_TYPES,
+};
+
+/* ---------------------------- Context offsets ---------------------------- */
+
+/*
+ * A trampoline has different types of context associated with it. Each context
+ * type has a symbolic offset into trampfd. The context can be read from or
+ * written to at its symbolic offset in trampfd.
+ *
+ * TRAMPFD_MAP_OFFSET
+ *	To read trampoline mapping parameters - struct ktramp_map.
+ *
+ * TRAMPFD_REGS_OFFSET
+ *	To read/write trampoline register context - struct ktramp_regs.
+ *
+ * TRAMPFD_STACK_OFFSET
+ *	To read/write trampoline stack context - struct ktramp_stack.
+ *
+ * TRAMPFD_ALLOWED_PCS_OFFSET
+ *	To write a list of allowed PCs - struct trampfd_values.
+ */
+enum trampfd_offsets {
+	TRAMPFD_MAP_OFFSET,
+	TRAMPFD_REGS_OFFSET,
+	TRAMPFD_STACK_OFFSET,
+	TRAMPFD_ALLOWED_PCS_OFFSET,
+	TRAMPFD_NUM_OFFSETS,
+};
+
+/* ------------------- Trampoline type specific data -------------------- */
+
+/*
+ * For TRAMPFD_USER.
+ */
+struct trampfd_user {
+	__u32		flags;		/* for future enhancements */
+	__u32		reserved;
+};
+
+/* ------------------- Trampoline mapping parameters ---------------------- */
+
+/*
+ * Since the kernel implements the trampoline object, the kernel specifies
+ * how a trampoline should be mapped. User code must obtain these parameters
+ * and do an mmap() to map the trampoline. The first four parameters are used
+ * in the mmap() call. User code must add ioffset to the address returned by
+ * mmap() to get the actual invocation address for the trampoline.
+ */
+struct trampfd_map {
+	__u32			size;		/* Size of the mapping */
+	__u32			prot;		/* memory protection */
+	__u32			flags;		/* map flags */
+	__u32			offset;		/* file offset */
+	__u32			ioffset;	/* invocation offset */
+	__u32			reserved;
+};
+
+/* -------------------------- Register context -------------------------- */
+
+/*
+ * A register context may be specified for a trampoline, if applicable
+ * to the trampoline type. E.g., TRAMPFD_USER. The register context is
+ * an array of name-value pairs. When a trampoline is invoked, its user
+ * registers are loaded with the specified values. Register names are
+ * architecture specific and can be found in <linux/ptrace.h> for architectures
+ * that support trampolines. Enumerations reg_32_name and reg_64_name in
+ * <linux/ptrace.h> refer to 32-bit and 64-bit respectively.
+ */
+struct trampfd_reg {
+	__u32		name;		/* Register name */
+	__u32		reserved;
+	__u64		value;		/* Register value */
+};
+
+/*
+ * Register context. It is a variable sized structure sized by the number
+ * of registers.
+ */
+struct trampfd_regs {
+	__u32			nregs;		/* Number of registers */
+	__u32			reserved;
+	struct trampfd_reg	regs[0];	/* Array of registers */
+};
+
+#define TRAMPFD_MAX_REGS		40
+
+/* ---------------------------- Stack context ---------------------------- */
+
+/*
+ * A stack context may be specified for a trampoline, if applicable
+ * to the trampoline type. E.g., TRAMPFD_USER. The stack context contains
+ * a data buffer. When a trampoline is invoked, the specified data is pushed
+ * on the stack at a specified offset from the current stack pointer.
+ * Optionally, the stack pointer can be moved to the top of the data.
+ *
+ * This is a variable sized structure sized by the amount of data that is
+ * to be pushed on the user stack.
+ */
+struct trampfd_stack {
+	__u32		flags;			/* TRAMPFD_SFLAGS */
+	__u32		offset;			/* Offset from top of stack */
+	__u32		size;			/* Size of data to push */
+	__u32		reserved;
+	__u8		data[0];		/* Data to push on the stack */
+};
+
+#define TRAMPFD_MAX_DATA_SIZE		64
+#define TRAMPFD_MAX_STACK_OFFSET	256
+
+/*
+ * Stack context flags:
+ *
+ * TRAMPFD_SET_SP
+ *	After pushing the data to user stack, move the stack pointer to the
+ *	base of the data pushed. Note that the kernel will align the stack
+ *	pointer based on the alignment requirements of the architecture.
+ */
+#define TRAMPFD_SET_SP		0x1
+#define TRAMPFD_SFLAGS		(TRAMPFD_SET_SP)
+
+/* ---------------------------- Values context ---------------------------- */
+
+/*
+ * Some contexts may be just a list of values. For instance, the user can
+ * specify a list of allowed PCs for a trampoline. The following structure
+ * is used for those contexts.
+ */
+struct trampfd_values {
+	__u32		nvalues;		/* number of values */
+	__u32		reserved;
+	__u64		values[0];		/* Array of values */
+};
+
+#define TRAMPFD_MAX_PCS		16
+
+/* -------------------------------------------------------------------------- */
+
+#endif /* _UAPI_LINUX_TRAMPFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index 0498af567f70..783a0b98fce1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2313,3 +2313,11 @@ config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 # <asm/syscall_wrapper.h>.
 config ARCH_HAS_SYSCALL_WRAPPER
 	def_bool n
+
+config TRAMPFD
+	bool "Enable trampfd_create() system call"
+	depends on MMU
+	help
+	  Enable the trampfd_create() system call that allows a process to
+	  map trampolines within its address space that can be invoked
+	  with the help of the kernel.
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 3b69a560a7ac..136acf9234a3 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -349,6 +349,9 @@ COND_SYSCALL(pkey_mprotect);
 COND_SYSCALL(pkey_alloc);
 COND_SYSCALL(pkey_free);
 
+/* Trampoline fd */
+COND_SYSCALL(trampfd_create);
+
 
 /*
  * Architecture specific weak syscall entries.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
  2020-07-28 13:10   ` [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API madvenka
@ 2020-07-28 13:10   ` madvenka
  2020-07-30  9:06     ` Greg KH
  2020-07-28 13:10   ` [PATCH v1 3/4] [RFC] arm64/trampfd: " madvenka
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 49+ messages in thread
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
  To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86,
	madvenka

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Implement 32-bit and 64-bit X86 support for the trampoline file descriptor.

	- Define architecture specific register names
	- Handle the trampoline invocation page fault
	- Setup the user register context on trampoline invocation
	- Setup the user stack context on trampoline invocation

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 arch/x86/include/uapi/asm/ptrace.h     |  38 +++
 arch/x86/kernel/Makefile               |   2 +
 arch/x86/kernel/trampfd.c              | 313 +++++++++++++++++++++++++
 arch/x86/mm/fault.c                    |  11 +
 6 files changed, 366 insertions(+)
 create mode 100644 arch/x86/kernel/trampfd.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index d8f8a1a69ed1..77eb50414591 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -443,3 +443,4 @@
 437	i386	openat2			sys_openat2
 438	i386	pidfd_getfd		sys_pidfd_getfd
 439	i386	faccessat2		sys_faccessat2
+440	i386	trampfd_create		sys_trampfd_create
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 78847b32e137..9d962de1d21f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -360,6 +360,7 @@
 437	common	openat2			sys_openat2
 438	common	pidfd_getfd		sys_pidfd_getfd
 439	common	faccessat2		sys_faccessat2
+440	common	trampfd_create		sys_trampfd_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/x86/include/uapi/asm/ptrace.h b/arch/x86/include/uapi/asm/ptrace.h
index 85165c0edafc..b031598f857e 100644
--- a/arch/x86/include/uapi/asm/ptrace.h
+++ b/arch/x86/include/uapi/asm/ptrace.h
@@ -9,6 +9,44 @@
 
 #ifndef __ASSEMBLY__
 
+/*
+ * These register names are to be used by 32-bit applications.
+ */
+enum reg_32_name {
+	x32_eax,
+	x32_ebx,
+	x32_ecx,
+	x32_edx,
+	x32_esi,
+	x32_edi,
+	x32_ebp,
+	x32_eip,
+	x32_max,
+};
+
+/*
+ * These register names are to be used by 64-bit applications.
+ */
+enum reg_64_name {
+	x64_rax = x32_max,
+	x64_rbx,
+	x64_rcx,
+	x64_rdx,
+	x64_rsi,
+	x64_rdi,
+	x64_rbp,
+	x64_r8,
+	x64_r9,
+	x64_r10,
+	x64_r11,
+	x64_r12,
+	x64_r13,
+	x64_r14,
+	x64_r15,
+	x64_rip,
+	x64_max,
+};
+
 #ifdef __i386__
 /* this struct defines the way the registers are stored on the
    stack during a system call. */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index e77261db2391..5d968ac4c7d9 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -157,3 +157,5 @@ ifeq ($(CONFIG_X86_64),y)
 endif
 
 obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT)	+= ima_arch.o
+
+obj-$(CONFIG_TRAMPFD)			+= trampfd.o
diff --git a/arch/x86/kernel/trampfd.c b/arch/x86/kernel/trampfd.c
new file mode 100644
index 000000000000..f6b5507134d2
--- /dev/null
+++ b/arch/x86/kernel/trampfd.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - X86 support.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+
+#include <linux/thread_info.h>
+#include <linux/mm_types.h>
+#include <linux/trampfd.h>
+#include <linux/uaccess.h>
+
+/* ---------------------------- Register Context ---------------------------- */
+
+static inline bool is_compat(void)
+{
+	return (IS_ENABLED(CONFIG_X86_32) ||
+		(IS_ENABLED(CONFIG_COMPAT) && test_thread_flag(TIF_ADDR32)));
+}
+
+static void set_reg_32(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+	switch (name) {
+	case x32_eax:
+		pt_regs->ax = (unsigned long)value;
+		break;
+	case x32_ebx:
+		pt_regs->bx = (unsigned long)value;
+		break;
+	case x32_ecx:
+		pt_regs->cx = (unsigned long)value;
+		break;
+	case x32_edx:
+		pt_regs->dx = (unsigned long)value;
+		break;
+	case x32_esi:
+		pt_regs->si = (unsigned long)value;
+		break;
+	case x32_edi:
+		pt_regs->di = (unsigned long)value;
+		break;
+	case x32_ebp:
+		pt_regs->bp = (unsigned long)value;
+		break;
+	case x32_eip:
+		pt_regs->ip = (unsigned long)value;
+		break;
+	default:
+		WARN(1, "%s: Illegal register name %d\n", __func__, name);
+		break;
+	}
+}
+
+#ifdef __i386__
+
+static void set_reg_64(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+}
+
+#else
+
+static void set_reg_64(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+	switch (name) {
+	case x64_rax:
+		pt_regs->ax = (unsigned long)value;
+		break;
+	case x64_rbx:
+		pt_regs->bx = (unsigned long)value;
+		break;
+	case x64_rcx:
+		pt_regs->cx = (unsigned long)value;
+		break;
+	case x64_rdx:
+		pt_regs->dx = (unsigned long)value;
+		break;
+	case x64_rsi:
+		pt_regs->si = (unsigned long)value;
+		break;
+	case x64_rdi:
+		pt_regs->di = (unsigned long)value;
+		break;
+	case x64_rbp:
+		pt_regs->bp = (unsigned long)value;
+		break;
+	case x64_r8:
+		pt_regs->r8 = (unsigned long)value;
+		break;
+	case x64_r9:
+		pt_regs->r9 = (unsigned long)value;
+		break;
+	case x64_r10:
+		pt_regs->r10 = (unsigned long)value;
+		break;
+	case x64_r11:
+		pt_regs->r11 = (unsigned long)value;
+		break;
+	case x64_r12:
+		pt_regs->r12 = (unsigned long)value;
+		break;
+	case x64_r13:
+		pt_regs->r13 = (unsigned long)value;
+		break;
+	case x64_r14:
+		pt_regs->r14 = (unsigned long)value;
+		break;
+	case x64_r15:
+		pt_regs->r15 = (unsigned long)value;
+		break;
+	case x64_rip:
+		pt_regs->ip = (unsigned long)value;
+		break;
+	default:
+		WARN(1, "%s: Illegal register name %d\n", __func__, name);
+		break;
+	}
+}
+
+#endif /* __i386__ */
+
+static void set_regs(struct pt_regs *pt_regs, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	bool			compat = is_compat();
+
+	for (; reg < reg_end; reg++) {
+		if (compat)
+			set_reg_32(pt_regs, reg->name, reg->value);
+		else
+			set_reg_64(pt_regs, reg->name, reg->value);
+	}
+}
+
+/*
+ * Check if the register names are valid. Check if the user PC has been set.
+ */
+bool trampfd_valid_regs(struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	int			min, max, pc_name;
+	bool			pc_set = false;
+
+	if (is_compat()) {
+		min = 0;
+		pc_name = x32_eip;
+		max = x32_max;
+	} else {
+		min = x32_max;
+		pc_name = x64_rip;
+		max = x64_max;
+	}
+
+	for (; reg < reg_end; reg++) {
+		if (reg->name < min || reg->name >= max || reg->reserved)
+			return false;
+		if (reg->name == pc_name && reg->value)
+			pc_set = true;
+	}
+	return pc_set;
+}
+EXPORT_SYMBOL_GPL(trampfd_valid_regs);
+
+/*
+ * Check if the PC specified in a register context is allowed.
+ */
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	struct trampfd_values	*allowed_pcs = trampfd->allowed_pcs;
+	u64			*allowed_values, pc_value = 0;
+	u32			nvalues, pc_name;
+	int			i;
+
+	if (!allowed_pcs)
+		return true;
+
+	pc_name = is_compat() ? x32_eip : x64_rip;
+
+	/*
+	 * Find the PC register and its value. If the PC register has been
+	 * specified multiple times, only the last one counts.
+	 */
+	for (; reg < reg_end; reg++) {
+		if (reg->name == pc_name)
+			pc_value = reg->value;
+	}
+
+	allowed_values = allowed_pcs->values;
+	nvalues = allowed_pcs->nvalues;
+
+	for (i = 0; i < nvalues; i++) {
+		if (pc_value == allowed_values[i])
+			return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_allowed_pc);
+
+/* ---------------------------- Stack Context ---------------------------- */
+
+static int push_data(struct pt_regs *pt_regs, struct trampfd_stack *tstack)
+{
+	unsigned long	sp;
+
+	sp = user_stack_pointer(pt_regs) - tstack->size - tstack->offset;
+	if (tstack->flags & TRAMPFD_SET_SP) {
+		if (is_compat())
+			sp = ((sp + 4) & -16ul) - 4;
+		else
+			sp = round_down(sp, 16) - 8;
+	}
+
+	if (!access_ok(sp, user_stack_pointer(pt_regs) - sp))
+		return -EFAULT;
+
+	if (copy_to_user(USERPTR(sp), tstack->data, tstack->size))
+		return -EFAULT;
+
+	if (tstack->flags & TRAMPFD_SET_SP)
+		user_stack_pointer_set(pt_regs, sp);
+
+	return 0;
+}
+
+/* ---------------------------- Fault Handlers ---------------------------- */
+
+static int trampfd_user_fault(struct trampfd *trampfd,
+			      struct vm_area_struct *vma,
+			      struct pt_regs *pt_regs)
+{
+	char			buf[TRAMPFD_MAX_STACK_SIZE];
+	struct trampfd_regs	*tregs;
+	struct trampfd_stack	*tstack = NULL;
+	unsigned long		addr;
+	size_t			size;
+	int			rc = 0;
+
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Execution of the trampoline must start at the offset specfied by
+	 * the kernel.
+	 */
+	addr = vma->vm_start + trampfd->map.ioffset;
+	if (addr != pt_regs->ip) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * At a minimum, the user PC register must be specified for a
+	 * user trampoline.
+	 */
+	tregs = trampfd->regs;
+	if (!tregs) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * Set the register context for the trampoline.
+	 */
+	set_regs(pt_regs, tregs);
+
+	if (trampfd->stack) {
+		/*
+		 * Copy the stack context into a local buffer and push stack
+		 * data after dropping the lock.
+		 */
+		size = sizeof(*trampfd->stack) + trampfd->stack->size;
+		tstack = (struct trampfd_stack *) buf;
+		memcpy(tstack, trampfd->stack, size);
+	}
+unlock:
+	mutex_unlock(&trampfd->lock);
+
+	if (!rc && tstack) {
+		mmap_read_unlock(vma->vm_mm);
+		rc = push_data(pt_regs, tstack);
+		mmap_read_lock(vma->vm_mm);
+	}
+	return rc;
+}
+
+/*
+ * Handle it if it is a trampoline fault.
+ */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs)
+{
+	struct trampfd		*trampfd;
+
+	if (!is_trampfd_vma(vma))
+		return false;
+	trampfd = vma->vm_private_data;
+
+	if (trampfd->type == TRAMPFD_USER)
+		return !trampfd_user_fault(trampfd, vma, pt_regs);
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_fault);
+
+/* ------------------------- Arch Initialization ------------------------- */
+
+int trampfd_check_arch(struct trampfd *trampfd)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(trampfd_check_arch);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 1ead568c0101..a1432ee2a1a2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>		/* faulthandler_disabled()	*/
 #include <linux/efi.h>			/* efi_recover_from_page_fault()*/
 #include <linux/mm_types.h>
+#include <linux/trampfd.h>		/* trampoline invocation */
 
 #include <asm/cpufeature.h>		/* boot_cpu_has, ...		*/
 #include <asm/traps.h>			/* dotraplinkage, ...		*/
@@ -1142,6 +1143,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	struct mm_struct *mm;
 	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
+	unsigned long tflags = X86_PF_INSTR | X86_PF_USER;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1275,6 +1277,15 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 */
 good_area:
 	if (unlikely(access_error(hw_error_code, vma))) {
+		/*
+		 * If it is a user execute fault, it could be a trampoline
+		 * invocation.
+		 */
+		if ((hw_error_code & tflags) == tflags &&
+		    trampfd_fault(vma, regs)) {
+			mmap_read_unlock(mm);
+			return;
+		}
 		bad_area_access_error(regs, hw_error_code, address, vma);
 		return;
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v1 3/4] [RFC] arm64/trampfd: Provide support for the trampoline file descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
  2020-07-28 13:10   ` [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API madvenka
  2020-07-28 13:10   ` [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor madvenka
@ 2020-07-28 13:10   ` madvenka
  2020-07-28 13:10   ` [PATCH v1 4/4] [RFC] arm/trampfd: " madvenka
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
  To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86,
	madvenka

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Implement 64-bit ARM support for the trampoline file descriptor.

	- Define architecture specific register names
	- Handle the trampoline invocation page fault
	- Setup the user register context on trampoline invocation
	- Setup the user stack context on trampoline invocation

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/arm64/include/asm/ptrace.h      |   9 +
 arch/arm64/include/asm/unistd.h      |   2 +-
 arch/arm64/include/asm/unistd32.h    |   2 +
 arch/arm64/include/uapi/asm/ptrace.h |  57 ++++++
 arch/arm64/kernel/Makefile           |   2 +
 arch/arm64/kernel/trampfd.c          | 278 +++++++++++++++++++++++++++
 arch/arm64/mm/fault.c                |  15 +-
 7 files changed, 361 insertions(+), 4 deletions(-)
 create mode 100644 arch/arm64/kernel/trampfd.c

diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index 953b6a1ce549..dad6cdbd59c6 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -232,6 +232,15 @@ static inline unsigned long user_stack_pointer(struct pt_regs *regs)
 	return regs->sp;
 }
 
+static inline void user_stack_pointer_set(struct pt_regs *regs,
+					  unsigned long val)
+{
+	if (compat_user_mode(regs))
+		regs->compat_sp = val;
+	else
+		regs->sp = val;
+}
+
 extern int regs_query_register_offset(const char *name);
 extern unsigned long regs_get_kernel_stack_nth(struct pt_regs *regs,
 					       unsigned int n);
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 3b859596840d..b3b2019f8d16 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		440
+#define __NR_compat_syscalls		441
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 6d95d0c8bf2f..821ddcaf9683 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -885,6 +885,8 @@ __SYSCALL(__NR_openat2, sys_openat2)
 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 #define __NR_faccessat2 439
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_trampfd_create 440
+__SYSCALL(__NR_trampfd_create, sys_trampfd_create)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/arm64/include/uapi/asm/ptrace.h b/arch/arm64/include/uapi/asm/ptrace.h
index 42cbe34d95ce..f4d1974dd795 100644
--- a/arch/arm64/include/uapi/asm/ptrace.h
+++ b/arch/arm64/include/uapi/asm/ptrace.h
@@ -88,6 +88,63 @@ struct user_pt_regs {
 	__u64		pstate;
 };
 
+/*
+ * These register names are to be used by 32-bit applications.
+ */
+enum reg_32_name {
+	arm_r0,
+	arm_r1,
+	arm_r2,
+	arm_r3,
+	arm_r4,
+	arm_r5,
+	arm_r6,
+	arm_r7,
+	arm_r8,
+	arm_r9,
+	arm_r10,
+	arm_ip,
+	arm_pc,
+	arm_max,
+};
+
+/*
+ * These register names are to be used by 64-bit applications.
+ */
+enum reg_64_name {
+	arm64_r0 = arm_max,
+	arm64_r1,
+	arm64_r2,
+	arm64_r3,
+	arm64_r4,
+	arm64_r5,
+	arm64_r6,
+	arm64_r7,
+	arm64_r8,
+	arm64_r9,
+	arm64_r10,
+	arm64_r11,
+	arm64_r12,
+	arm64_r13,
+	arm64_r14,
+	arm64_r15,
+	arm64_r16,
+	arm64_r17,
+	arm64_r18,
+	arm64_r19,
+	arm64_r20,
+	arm64_r21,
+	arm64_r22,
+	arm64_r23,
+	arm64_r24,
+	arm64_r25,
+	arm64_r26,
+	arm64_r27,
+	arm64_r28,
+	arm64_pc,
+	arm64_max,
+};
+
 struct user_fpsimd_state {
 	__uint128_t	vregs[32];
 	__u32		fpsr;
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index a561cbb91d4d..18d373fb1208 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -71,3 +71,5 @@ extra-y					+= $(head-y) vmlinux.lds
 ifeq ($(CONFIG_DEBUG_EFI),y)
 AFLAGS_head.o += -DVMLINUX_PATH="\"$(realpath $(objtree)/vmlinux)\""
 endif
+
+obj-$(CONFIG_TRAMPFD)			+= trampfd.o
diff --git a/arch/arm64/kernel/trampfd.c b/arch/arm64/kernel/trampfd.c
new file mode 100644
index 000000000000..d79e749e0c30
--- /dev/null
+++ b/arch/arm64/kernel/trampfd.c
@@ -0,0 +1,278 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - ARM64 support.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+
+#include <linux/trampfd.h>
+#include <linux/mm_types.h>
+#include <linux/uaccess.h>
+
+/* ---------------------------- Register Context ---------------------------- */
+
+static inline bool is_compat(void)
+{
+	return is_compat_thread(task_thread_info(current));
+}
+
+static void set_reg_32(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+	switch (name) {
+	case arm_r0:
+	case arm_r1:
+	case arm_r2:
+	case arm_r3:
+	case arm_r4:
+	case arm_r5:
+	case arm_r6:
+	case arm_r7:
+	case arm_r8:
+	case arm_r9:
+	case arm_r10:
+		pt_regs->regs[name] = (__u64)value;
+		break;
+	case arm_ip:
+		pt_regs->regs[arm64_r16 - arm_max] = (__u64)value;
+		break;
+	case arm_pc:
+		pt_regs->pc = (__u64)value;
+		break;
+	default:
+		WARN(1, "%s: Illegal register name %d\n", __func__, name);
+		break;
+	}
+}
+
+static void set_reg_64(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+	switch (name) {
+	case arm64_r0:
+	case arm64_r1:
+	case arm64_r2:
+	case arm64_r3:
+	case arm64_r4:
+	case arm64_r5:
+	case arm64_r6:
+	case arm64_r7:
+	case arm64_r8:
+	case arm64_r9:
+	case arm64_r10:
+	case arm64_r11:
+	case arm64_r12:
+	case arm64_r13:
+	case arm64_r14:
+	case arm64_r15:
+	case arm64_r16:
+	case arm64_r17:
+	case arm64_r18:
+	case arm64_r19:
+	case arm64_r20:
+	case arm64_r21:
+	case arm64_r22:
+	case arm64_r23:
+	case arm64_r24:
+	case arm64_r25:
+	case arm64_r26:
+	case arm64_r27:
+	case arm64_r28:
+		pt_regs->regs[name - arm_max] = (__u64)value;
+		break;
+	case arm64_pc:
+		pt_regs->pc = (__u64)value;
+		break;
+	default:
+		WARN(1, "%s: Illegal register name %d\n", __func__, name);
+		break;
+	}
+}
+
+static void set_regs(struct pt_regs *pt_regs, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	bool			compat = is_compat();
+
+	for (; reg < reg_end; reg++) {
+		if (compat)
+			set_reg_32(pt_regs, reg->name, reg->value);
+		else
+			set_reg_64(pt_regs, reg->name, reg->value);
+	}
+}
+
+/*
+ * Check if the register names are valid. Check if the user PC has been set.
+ */
+bool trampfd_valid_regs(struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	int			min, max, pc_name;
+	bool			pc_set = false;
+
+	if (is_compat()) {
+		min = 0;
+		pc_name = arm_pc;
+		max = arm_max;
+	} else {
+		min = arm_max;
+		pc_name = arm64_pc;
+		max = arm64_max;
+	}
+
+	for (; reg < reg_end; reg++) {
+		if (reg->name < min || reg->name >= max || reg->reserved)
+			return false;
+		if (reg->name == pc_name && reg->value)
+			pc_set = true;
+	}
+	return pc_set;
+}
+EXPORT_SYMBOL_GPL(trampfd_valid_regs);
+
+/*
+ * Check if the PC specified in a register context is allowed.
+ */
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	struct trampfd_values	*allowed_pcs = trampfd->allowed_pcs;
+	u64			*allowed_values, pc_value = 0;
+	u32			nvalues, pc_name;
+	int			i;
+
+	if (!allowed_pcs)
+		return true;
+
+	pc_name = is_compat() ? arm_pc : arm64_pc;
+
+	/*
+	 * Find the PC register and its value. If the PC register has been
+	 * specified multiple times, only the last one counts.
+	 */
+	for (; reg < reg_end; reg++) {
+		if (reg->name == pc_name)
+			pc_value = reg->value;
+	}
+
+	allowed_values = allowed_pcs->values;
+	nvalues = allowed_pcs->nvalues;
+
+	for (i = 0; i < nvalues; i++) {
+		if (pc_value == allowed_values[i])
+			return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_allowed_pc);
+
+/* ---------------------------- Stack Context ---------------------------- */
+
+static int push_data(struct pt_regs *pt_regs, struct trampfd_stack *tstack)
+{
+	unsigned long	sp;
+
+	sp = user_stack_pointer(pt_regs) - tstack->size - tstack->offset;
+	if (tstack->flags & TRAMPFD_SET_SP)
+		sp = round_down(sp, 16);
+
+	if (!access_ok((void *)sp, user_stack_pointer(pt_regs) - sp))
+		return -EFAULT;
+
+	if (copy_to_user(USERPTR(sp), tstack->data, tstack->size))
+		return -EFAULT;
+
+	if (tstack->flags & TRAMPFD_SET_SP)
+		user_stack_pointer_set(pt_regs, sp);
+
+	return 0;
+}
+
+/* ---------------------------- Fault Handlers ---------------------------- */
+
+static int trampfd_user_fault(struct trampfd *trampfd,
+			      struct vm_area_struct *vma,
+			      struct pt_regs *pt_regs)
+{
+	char			buf[TRAMPFD_MAX_STACK_SIZE];
+	struct trampfd_regs	*tregs;
+	struct trampfd_stack	*tstack = NULL;
+	unsigned long		addr;
+	size_t			size;
+	int			rc = 0;
+
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Execution of the trampoline must start at the offset specfied by
+	 * the kernel.
+	 */
+	addr = vma->vm_start + trampfd->map.ioffset;
+	if (addr != pt_regs->pc) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * At a minimum, the user PC register must be specified for a
+	 * user trampoline.
+	 */
+	tregs = trampfd->regs;
+	if (!tregs) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * Set the register context for the trampoline.
+	 */
+	set_regs(pt_regs, tregs);
+
+	if (trampfd->stack) {
+		/*
+		 * Copy the stack context into a local buffer and push stack
+		 * data after dropping the lock.
+		 */
+		size = sizeof(*trampfd->stack) + trampfd->stack->size;
+		tstack = (struct trampfd_stack *) buf;
+		memcpy(tstack, trampfd->stack, size);
+	}
+unlock:
+	mutex_unlock(&trampfd->lock);
+
+	if (!rc && tstack) {
+		mmap_read_unlock(vma->vm_mm);
+		rc = push_data(pt_regs, tstack);
+		mmap_read_lock(vma->vm_mm);
+	}
+	return rc;
+}
+
+/*
+ * Handle it if it is a trampoline fault.
+ */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs)
+{
+	struct trampfd		*trampfd;
+
+	if (!is_trampfd_vma(vma))
+		return false;
+	trampfd = vma->vm_private_data;
+
+	if (trampfd->type == TRAMPFD_USER)
+		return !trampfd_user_fault(trampfd, vma, pt_regs);
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_fault);
+
+/* ---------------------------- Miscellaneous ---------------------------- */
+
+int trampfd_check_arch(struct trampfd *trampfd)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(trampfd_check_arch);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 8afb238ff335..6e5e3193919a 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -23,6 +23,7 @@
 #include <linux/perf_event.h>
 #include <linux/preempt.h>
 #include <linux/hugetlb.h>
+#include <linux/trampfd.h>
 
 #include <asm/acpi.h>
 #include <asm/bug.h>
@@ -404,7 +405,8 @@ static void do_bad_area(unsigned long addr, unsigned int esr, struct pt_regs *re
 #define VM_FAULT_BADACCESS	0x020000
 
 static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr,
-			   unsigned int mm_flags, unsigned long vm_flags)
+			   unsigned int mm_flags, unsigned long vm_flags,
+			   struct pt_regs *regs)
 {
 	struct vm_area_struct *vma = find_vma(mm, addr);
 
@@ -426,8 +428,15 @@ static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr,
 	 * Check that the permissions on the VMA allow for the fault which
 	 * occurred.
 	 */
-	if (!(vma->vm_flags & vm_flags))
+	if (!(vma->vm_flags & vm_flags)) {
+		/*
+		 * If it is an execute fault, it could be a trampoline
+		 * invocation.
+		 */
+		if ((vm_flags & VM_EXEC) && trampfd_fault(vma, regs))
+			return 0;
 		return VM_FAULT_BADACCESS;
+	}
 	return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags);
 }
 
@@ -516,7 +525,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 #endif
 	}
 
-	fault = __do_page_fault(mm, addr, mm_flags, vm_flags);
+	fault = __do_page_fault(mm, addr, mm_flags, vm_flags, regs);
 	major |= fault & VM_FAULT_MAJOR;
 
 	/* Quick path to respond to signals */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v1 4/4] [RFC] arm/trampfd: Provide support for the trampoline file descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
                     ` (2 preceding siblings ...)
  2020-07-28 13:10   ` [PATCH v1 3/4] [RFC] arm64/trampfd: " madvenka
@ 2020-07-28 13:10   ` madvenka
  2020-07-28 15:13   ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor David Laight
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 49+ messages in thread
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
  To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86,
	madvenka

From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

Implement 32-bit ARM support for the trampoline file descriptor.

	- Define architecture specific register names
	- Handle the trampoline invocation page fault
	- Setup the user register context on trampoline invocation
	- Setup the user stack context on trampoline invocation

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/arm/include/uapi/asm/ptrace.h |  20 +++
 arch/arm/kernel/Makefile           |   1 +
 arch/arm/kernel/trampfd.c          | 214 +++++++++++++++++++++++++++++
 arch/arm/mm/fault.c                |  12 +-
 arch/arm/tools/syscall.tbl         |   1 +
 5 files changed, 246 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm/kernel/trampfd.c

diff --git a/arch/arm/include/uapi/asm/ptrace.h b/arch/arm/include/uapi/asm/ptrace.h
index e61c65b4018d..47b1c5e2f32c 100644
--- a/arch/arm/include/uapi/asm/ptrace.h
+++ b/arch/arm/include/uapi/asm/ptrace.h
@@ -151,6 +151,26 @@ struct pt_regs {
 #define ARM_r0		uregs[0]
 #define ARM_ORIG_r0	uregs[17]
 
+/*
+ * These register names are to be used by 32-bit applications.
+ */
+enum reg_32_name {
+	arm_r0,
+	arm_r1,
+	arm_r2,
+	arm_r3,
+	arm_r4,
+	arm_r5,
+	arm_r6,
+	arm_r7,
+	arm_r8,
+	arm_r9,
+	arm_r10,
+	arm_ip,
+	arm_pc,
+	arm_max,
+};
+
 /*
  * The size of the user-visible VFP state as seen by PTRACE_GET/SETVFPREGS
  * and core dumps.
diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile
index 89e5d864e923..652c54c2f19a 100644
--- a/arch/arm/kernel/Makefile
+++ b/arch/arm/kernel/Makefile
@@ -105,5 +105,6 @@ obj-$(CONFIG_SMP)		+= psci_smp.o
 endif
 
 obj-$(CONFIG_HAVE_ARM_SMCCC)	+= smccc-call.o
+obj-$(CONFIG_TRAMPFD)		+= trampfd.o
 
 extra-y := $(head-y) vmlinux.lds
diff --git a/arch/arm/kernel/trampfd.c b/arch/arm/kernel/trampfd.c
new file mode 100644
index 000000000000..50fc5706e85b
--- /dev/null
+++ b/arch/arm/kernel/trampfd.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - ARM support.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+
+#include <linux/trampfd.h>
+#include <linux/mm_types.h>
+#include <linux/uaccess.h>
+
+/* ---------------------------- Register Context ---------------------------- */
+
+static void set_reg(long *uregs, u32 name, u64 value)
+{
+	switch (name) {
+	case arm_r0:
+	case arm_r1:
+	case arm_r2:
+	case arm_r3:
+	case arm_r4:
+	case arm_r5:
+	case arm_r6:
+	case arm_r7:
+	case arm_r8:
+	case arm_r9:
+	case arm_r10:
+		uregs[name] = (__u64)value;
+		break;
+	case arm_ip:
+		ARM_ip = (__u64)value;
+		break;
+	case arm_pc:
+		ARM_pc = (__u64)value;
+		break;
+	default:
+		WARN(1, "%s: Illegal register name %d\n", __func__, name);
+		break;
+	}
+}
+
+static void set_regs(long *uregs, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+
+	for (; reg < reg_end; reg++)
+		set_reg(uregs, reg->name, reg->value);
+}
+
+/*
+ * Check if the register names are valid. Check if the user PC has been set.
+ */
+bool trampfd_valid_regs(struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	bool			pc_set = false;
+
+	for (; reg < reg_end; reg++) {
+		if (reg->name >= arm_max || reg->reserved)
+			return false;
+		if (reg->name == arm_pc && reg->value)
+			pc_set = true;
+	}
+	return pc_set;
+}
+EXPORT_SYMBOL_GPL(trampfd_valid_regs);
+
+/*
+ * Check if the PC specified in a register context is allowed.
+ */
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *tregs)
+{
+	struct trampfd_reg	*reg = tregs->regs;
+	struct trampfd_reg	*reg_end = reg + tregs->nregs;
+	struct trampfd_values	*allowed_pcs = trampfd->allowed_pcs;
+	u64			*allowed_values, pc_value = 0;
+	u32			nvalues, pc_name;
+	int			i;
+
+	if (!allowed_pcs)
+		return true;
+
+	pc_name = arm_pc;
+
+	/*
+	 * Find the PC register and its value. If the PC register has been
+	 * specified multiple times, only the last one counts.
+	 */
+	for (; reg < reg_end; reg++) {
+		if (reg->name == pc_name)
+			pc_value = reg->value;
+	}
+
+	allowed_values = allowed_pcs->values;
+	nvalues = allowed_pcs->nvalues;
+
+	for (i = 0; i < nvalues; i++) {
+		if (pc_value == allowed_values[i])
+			return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_allowed_pc);
+
+/* ---------------------------- Stack Context ---------------------------- */
+
+static int push_data(long *uregs, struct trampfd_stack *tstack)
+{
+	unsigned long	sp;
+
+	sp = ARM_sp - tstack->size - tstack->offset;
+	if (tstack->flags & TRAMPFD_SET_SP)
+		sp &= ~7;
+
+	if (!access_ok(sp, ARM_sp - sp))
+		return -EFAULT;
+
+	if (copy_to_user(USERPTR(sp), tstack->data, tstack->size))
+		return -EFAULT;
+
+	if (tstack->flags & TRAMPFD_SET_SP)
+		ARM_sp = sp;
+	return 0;
+}
+
+/* ---------------------------- Fault Handlers ---------------------------- */
+
+static int trampfd_user_fault(struct trampfd *trampfd,
+			      struct vm_area_struct *vma,
+			      long *uregs)
+{
+	char			buf[TRAMPFD_MAX_STACK_SIZE];
+	struct trampfd_regs	*tregs;
+	struct trampfd_stack	*tstack = NULL;
+	unsigned long		addr;
+	size_t			size;
+	int			rc;
+
+	mutex_lock(&trampfd->lock);
+
+	/*
+	 * Execution of the trampoline must start at the offset specfied by
+	 * the kernel.
+	 */
+	addr = vma->vm_start + trampfd->map.ioffset;
+	if (addr != ARM_pc) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * At a minimum, the user PC register must be specified for a
+	 * user trampoline.
+	 */
+	tregs = trampfd->regs;
+	if (!tregs) {
+		rc = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * Set the register context for the trampoline.
+	 */
+	set_regs(uregs, tregs);
+
+	if (trampfd->stack) {
+		/*
+		 * Copy the stack context into a local buffer and push stack
+		 * data after dropping the lock.
+		 */
+		size = sizeof(*trampfd->stack) + trampfd->stack->size;
+		tstack = (struct trampfd_stack *) buf;
+		memcpy(tstack, trampfd->stack, size);
+	}
+unlock:
+	mutex_unlock(&trampfd->lock);
+
+	if (!rc && tstack) {
+		mmap_read_unlock(vma->vm_mm);
+		rc = push_data(uregs, tstack);
+		mmap_read_lock(vma->vm_mm);
+	}
+	return rc;
+}
+
+/*
+ * Handle it if it is a trampoline fault.
+ */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs)
+{
+	struct trampfd		*trampfd;
+	unsigned long		*uregs = pt_regs->uregs;
+
+	if (!is_trampfd_vma(vma))
+		return false;
+	trampfd = vma->vm_private_data;
+
+	if (trampfd->type == TRAMPFD_USER)
+		return !trampfd_user_fault(trampfd, vma, uregs);
+	return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_fault);
+
+/* ---------------------------- Miscellaneous ---------------------------- */
+
+int trampfd_check_arch(struct trampfd *trampfd)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(trampfd_check_arch);
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index c6550eddfce1..21a81d19336b 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -17,6 +17,7 @@
 #include <linux/sched/debug.h>
 #include <linux/highmem.h>
 #include <linux/perf_event.h>
+#include <linux/trampfd.h>
 
 #include <asm/system_misc.h>
 #include <asm/system_info.h>
@@ -202,7 +203,8 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
 
 static vm_fault_t __kprobes
 __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
-		unsigned int flags, struct task_struct *tsk)
+		unsigned int flags, struct task_struct *tsk,
+		struct pt_regs *regs)
 {
 	struct vm_area_struct *vma;
 	vm_fault_t fault;
@@ -220,6 +222,12 @@ __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
 	 */
 good_area:
 	if (access_error(fsr, vma)) {
+		/*
+		 * If it is an execute fault, it could be a trampoline
+		 * invocation.
+		 */
+		if ((fsr & FSR_LNX_PF) && trampfd_fault(vma, regs))
+			return 0;
 		fault = VM_FAULT_BADACCESS;
 		goto out;
 	}
@@ -290,7 +298,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 #endif
 	}
 
-	fault = __do_page_fault(mm, addr, fsr, flags, tsk);
+	fault = __do_page_fault(mm, addr, fsr, flags, tsk, regs);
 
 	/* If we need to retry but a fatal signal is pending, handle the
 	 * signal first. We do not need to release the mmap_lock because
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index d5cae5ffede0..88cf4c45069a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -452,3 +452,4 @@
 437	common	openat2				sys_openat2
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
+440	common	trampfd_create			sys_trampfd_create
-- 
2.17.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
  2020-07-28 13:10   ` [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API madvenka
@ 2020-07-28 14:50     ` Oleg Nesterov
  2020-07-28 14:58       ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2020-07-28 14:50 UTC (permalink / raw)
  To: madvenka
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, x86

On 07/28, madvenka@linux.microsoft.com wrote:
>
> +bool is_trampfd_vma(struct vm_area_struct *vma)
> +{
> +	struct file	*file = vma->vm_file;
> +
> +	if (!file)
> +		return false;
> +	return !strcmp(file->f_path.dentry->d_name.name, trampfd_name);

Hmm, this looks obviously wrong or I am totally confused. A user can
create a file named "[trampfd]", mmap it, and fool trampfd_fault() ?

Why not

	return file->f_op == trampfd_fops;

?

> +EXPORT_SYMBOL_GPL(is_trampfd_vma);

why is it exported?

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
  2020-07-28 14:50     ` Oleg Nesterov
@ 2020-07-28 14:58       ` Madhavan T. Venkataraman
  2020-07-28 16:06         ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 14:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, x86

Thanks. See inline..

On 7/28/20 9:50 AM, Oleg Nesterov wrote:
> On 07/28, madvenka@linux.microsoft.com wrote:
>> +bool is_trampfd_vma(struct vm_area_struct *vma)
>> +{
>> +	struct file	*file = vma->vm_file;
>> +
>> +	if (!file)
>> +		return false;
>> +	return !strcmp(file->f_path.dentry->d_name.name, trampfd_name);
> Hmm, this looks obviously wrong or I am totally confused. A user can
> create a file named "[trampfd]", mmap it, and fool trampfd_fault() ?
>
> Why not
>
> 	return file->f_op == trampfd_fops;

This is definitely the correct check. I will fix it.
>
> ?
>
>> +EXPORT_SYMBOL_GPL(is_trampfd_vma);
> why is it exported?

This is in common code and is called by arch code. Should I not export it?
I guess since the symbol is not used by any modules, I don't need to
export it. Please confirm and I will fix this.

Madhavan


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
                     ` (3 preceding siblings ...)
  2020-07-28 13:10   ` [PATCH v1 4/4] [RFC] arm/trampfd: " madvenka
@ 2020-07-28 15:13   ` David Laight
  2020-07-28 16:32     ` Madhavan T. Venkataraman
  2020-07-28 16:05   ` Casey Schaufler
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 49+ messages in thread
From: David Laight @ 2020-07-28 15:13 UTC (permalink / raw)
  To: 'madvenka@linux.microsoft.com',
	kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86

From:  madvenka@linux.microsoft.com
> Sent: 28 July 2020 14:11
...
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.

Isn't the performance of this going to be horrid?

If you don't care that much about performance the fixup can
all be done in userspace within the fault signal handler.

Since whatever you do needs the application changed why
not change the implementation of nested functions to not
need on-stack executable trampolines.

I can think of other alternatives that don't need much more
than an array of 'push constant; jump trampoline' instructions
be created (all jump to the same place).

You might want something to create an executable page of such
instructions.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
                     ` (4 preceding siblings ...)
  2020-07-28 15:13   ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor David Laight
@ 2020-07-28 16:05   ` Casey Schaufler
  2020-07-28 16:49     ` Madhavan T. Venkataraman
  2020-07-28 17:05     ` James Morris
  2020-07-28 17:31   ` Andy Lutomirski
  2020-07-31 18:09   ` Mark Rutland
  7 siblings, 2 replies; 49+ messages in thread
From: Casey Schaufler @ 2020-07-28 16:05 UTC (permalink / raw)
  To: madvenka, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86

On 7/28/2020 6:10 AM, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> Introduction
> ------------
>
> Trampolines are used in many different user applications. Trampoline
> code is often generated at runtime. Trampoline code can also just be a
> pre-defined sequence of machine instructions in a data buffer.
>
> Trampoline code is placed either in a data page or in a stack page. In
> order to execute a trampoline, the page it resides in needs to be mapped
> with execute permissions. Writable pages with execute permissions provide
> an attack surface for hackers. Attackers can use this to inject malicious
> code, modify existing code or do other harm.
>
> To mitigate this, LSMs such as SELinux may not allow pages to have both
> write and execute permissions. This prevents trampolines from executing
> and blocks applications that use trampolines. To allow genuine applications
> to run, exceptions have to be made for them (by setting execmem, etc).
> In this case, the attack surface is just the pages of such applications.
>
> An application that is not allowed to have writable executable pages
> may try to load trampoline code into a file and map the file with execute
> permissions. In this case, the attack surface is just the buffer that
> contains trampoline code. However, a successful exploit may provide the
> hacker with means to load his own code in a file, map it and execute it.
>
> LSMs (such as the IPE proposal [1]) may allow only properly signed object
> files to be mapped with execute permissions. This will prevent trampoline
> files from being mapped. Again, exceptions have to be made for genuine
> applications.
>
> We need a way to execute trampolines without making security exceptions
> where possible and to reduce the attack surface even further.
>
> Examples of trampolines
> -----------------------
>
> libffi (A Portable Foreign Function Interface Library):
>
> libffi allows a user to define functions with an arbitrary list of
> arguments and return value through a feature called "Closures".
> Closures use trampolines to jump to ABI handlers that handle calling
> conventions and call a target function. libffi is used by a lot
> of different applications. To name a few:
>
> 	- Python
> 	- Java
> 	- Javascript
> 	- Ruby FFI
> 	- Lisp
> 	- Objective C
>
> GCC nested functions:
>
> GCC has traditionally used trampolines for implementing nested
> functions. The trampoline is placed on the user stack. So, the stack
> needs to be executable.
>
> Currently available solution
> ----------------------------
>
> One solution that has been proposed to allow trampolines to be executed
> without making security exceptions is Trampoline Emulation. See:
>
> https://pax.grsecurity.net/docs/emutramp.txt
>
> In this solution, the kernel recognizes certain sequences of instructions
> as "well-known" trampolines. When such a trampoline is executed, a page
> fault happens because the trampoline page does not have execute permission.
> The kernel recognizes the trampoline and emulates it. Basically, the
> kernel does the work of the trampoline on behalf of the application.

What prevents a malicious process from using the "well-known" trampoline
to its own purposes? I expect it is obvious, but I'm not seeing it. Old
eyes, I suppose.

> Here, the attack surface is the buffer that contains the trampoline.
> The attack surface is narrower than before. A hacker may still be able to
> modify what gets loaded in the registers or modify the target PC to point
> to arbitrary locations.
>
> Currently, the emulated trampolines are the ones used in libffi and GCC
> nested functions. To my knowledge, only X86 is supported at this time.
>
> As noted in emutramp.txt, this is not a generic solution. For every new
> trampoline that needs to be supported, new instruction sequences need to
> be recognized by the kernel and emulated. And this has to be done for
> every architecture that needs to be supported.
>
> emutramp.txt notes the following:
>
> "... the real solution is not in emulation but by designing a kernel API
> for runtime code generation and modifying userland to make use of it."
>
> Trampoline File Descriptor (trampfd)
> --------------------------
>
> I am proposing a kernel API using anonymous file descriptors that
> can be used to create and execute trampolines with the help of the
> kernel. In this solution also, the kernel does the work of the trampoline.
> The API is described in patch 1/4 of this patchset. I provide a
> summary here:
>
> Trampolines commonly execute the following sequence:
>
> 	- Load some values in some registers and/or
> 	- Push some values on the stack
> 	- Jump to a target PC
>
> libffi and GCC nested function trampolines fit into this model.
>
> Using the kernel API, applications and libraries can:
>
> 	- Create a trampoline object
> 	- Associate a register context with the trampoline (including
> 	  a target PC)
> 	- Associate a stack context with the trampoline
> 	- Map the trampoline into a process address space
> 	- Execute the trampoline by executing at the trampoline address
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
>
> In this case, the attack surface is the context buffer. A hacker may
> attack an application with a vulnerability and may be able to modify the
> context buffer. So, when the register or stack context is set for
> a trampoline, the values may have been tampered with. From an attack
> surface perspective, this is similar to Trampoline Emulation. But
> with trampfd, user code can retrieve a trampoline's context from the
> kernel and add defensive checks to see if the context has been
> tampered with.
>
> As for the target PC, trampfd implements a measure called the
> "Allowed PCs" context (see Advantages) to prevent a hacker from making
> the target PC point to arbitrary locations. So, the attack surface is
> narrower than Trampoline Emulation.
>
> Advantages of the Trampoline File Descriptor approach
> -----------------------------------------------------
>
> - trampfd is customizable. The user can specify any combination of
>   allowed register name-value pairs in the register context and the kernel
>   will set it up accordingly. This allows different user trampolines to be
>   converted to use trampfd.
>
> - trampfd allows a stack context to be set up so that trampolines that
>   need to push values on the user stack can do that.
>
> - The initial work is targeted for X86 and ARM. But the implementation
>   leverages small portions of existing signal delivery code. Specifically,
>   it uses pt_regs for setting up user registers and copy_to_user()
>   to push values on the stack. So, this can be very easily ported to other
>   architectures.
>
> - trampfd provides a basic framework. In the future, new trampoline types
>   can be implemented, new contexts can be defined, and additional rules
>   can be implemented for security purposes.
>
> - For instance, trampfd defines an "Allowed PCs" context in this initial
>   work. As an example, libffi can create a read-only array of all ABI
>   handlers for an architecture at build time. This array can be used to
>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>   cannot hack the PC part of the register context and make it point to
>   arbitrary locations.
>
> - An SELinux setting called "exectramp" can be implemented along the
>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>   use of trampolines on a per application basis.
>
> - User code can add defensive checks in the code before invoking a
>   trampoline to make sure that a hacker has not modified the context data.
>   It can do this by getting the trampoline context from the kernel and
>   double checking it.
>
> - In the future, if the kernel can be enhanced to use a safe code
>   generation component, that code can be placed in the trampoline mapping
>   pages. Then, the trampoline invocation does not have to incur a trip
>   into the kernel.
>
> - Also, if the kernel can be enhanced to use a safe code generation
>   component, other forms of dynamic code such as JIT code can be
>   addressed by the trampfd framework.
>
> - Trampolines can be shared across processes which can give rise to
>   interesting uses in the future.
>
> - Trampfd can be used for other purposes to extend the kernel's
>   functionality.
>
> libffi
> ------
>
> I have implemented my solution for libffi and provided the changes for
> X86 and ARM, 32-bit and 64-bit. Here is the reference patch:
>
> http://linux.microsoft.com/~madvenka/libffi/libffi.txt
>
> If the trampfd patchset gets accepted, I will send the libffi changes
> to the maintainers for a review. BTW, I have also successfully executed
> the libffi self tests.
>
> Work that is pending
> --------------------
>
> - I am working on implementing an SELinux setting called "exectramp"
>   similar to "execmem" to allow the use of trampfd on a per application
>   basis.

You could make a separate LSM to do these checks instead of limiting
it to SELinux. Your use case, your call, of course.

>
> - I have a comprehensive test program to test the kernel API. I am
>   working on adding it to selftests.
>
> References
> ----------
>
> [1] https://microsoft.github.io/ipe/
> ---
> Madhavan T. Venkataraman (4):
>   fs/trampfd: Implement the trampoline file descriptor API
>   x86/trampfd: Support for the trampoline file descriptor
>   arm64/trampfd: Support for the trampoline file descriptor
>   arm/trampfd: Support for the trampoline file descriptor
>
>  arch/arm/include/uapi/asm/ptrace.h     |  20 ++
>  arch/arm/kernel/Makefile               |   1 +
>  arch/arm/kernel/trampfd.c              | 214 +++++++++++++++++
>  arch/arm/mm/fault.c                    |  12 +-
>  arch/arm/tools/syscall.tbl             |   1 +
>  arch/arm64/include/asm/ptrace.h        |   9 +
>  arch/arm64/include/asm/unistd.h        |   2 +-
>  arch/arm64/include/asm/unistd32.h      |   2 +
>  arch/arm64/include/uapi/asm/ptrace.h   |  57 +++++
>  arch/arm64/kernel/Makefile             |   2 +
>  arch/arm64/kernel/trampfd.c            | 278 ++++++++++++++++++++++
>  arch/arm64/mm/fault.c                  |  15 +-
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  arch/x86/include/uapi/asm/ptrace.h     |  38 +++
>  arch/x86/kernel/Makefile               |   2 +
>  arch/x86/kernel/trampfd.c              | 313 +++++++++++++++++++++++++
>  arch/x86/mm/fault.c                    |  11 +
>  fs/Makefile                            |   1 +
>  fs/trampfd/Makefile                    |   6 +
>  fs/trampfd/trampfd_data.c              |  43 ++++
>  fs/trampfd/trampfd_fops.c              | 131 +++++++++++
>  fs/trampfd/trampfd_map.c               |  78 ++++++
>  fs/trampfd/trampfd_pcs.c               |  95 ++++++++
>  fs/trampfd/trampfd_regs.c              | 137 +++++++++++
>  fs/trampfd/trampfd_stack.c             | 131 +++++++++++
>  fs/trampfd/trampfd_stubs.c             |  41 ++++
>  fs/trampfd/trampfd_syscall.c           |  92 ++++++++
>  include/linux/syscalls.h               |   3 +
>  include/linux/trampfd.h                |  82 +++++++
>  include/uapi/asm-generic/unistd.h      |   4 +-
>  include/uapi/linux/trampfd.h           | 171 ++++++++++++++
>  init/Kconfig                           |   8 +
>  kernel/sys_ni.c                        |   3 +
>  34 files changed, 1998 insertions(+), 7 deletions(-)
>  create mode 100644 arch/arm/kernel/trampfd.c
>  create mode 100644 arch/arm64/kernel/trampfd.c
>  create mode 100644 arch/x86/kernel/trampfd.c
>  create mode 100644 fs/trampfd/Makefile
>  create mode 100644 fs/trampfd/trampfd_data.c
>  create mode 100644 fs/trampfd/trampfd_fops.c
>  create mode 100644 fs/trampfd/trampfd_map.c
>  create mode 100644 fs/trampfd/trampfd_pcs.c
>  create mode 100644 fs/trampfd/trampfd_regs.c
>  create mode 100644 fs/trampfd/trampfd_stack.c
>  create mode 100644 fs/trampfd/trampfd_stubs.c
>  create mode 100644 fs/trampfd/trampfd_syscall.c
>  create mode 100644 include/linux/trampfd.h
>  create mode 100644 include/uapi/linux/trampfd.h
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
  2020-07-28 14:58       ` Madhavan T. Venkataraman
@ 2020-07-28 16:06         ` Oleg Nesterov
  0 siblings, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2020-07-28 16:06 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, x86

On 07/28, Madhavan T. Venkataraman wrote:
>
> I guess since the symbol is not used by any modules, I don't need to
> export it.

Yes,

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 15:13   ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor David Laight
@ 2020-07-28 16:32     ` Madhavan T. Venkataraman
  2020-07-28 17:16       ` Andy Lutomirski
  0 siblings, 1 reply; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 16:32 UTC (permalink / raw)
  To: David Laight, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86

Thanks. See inline..

On 7/28/20 10:13 AM, David Laight wrote:
> From:  madvenka@linux.microsoft.com
>> Sent: 28 July 2020 14:11
> ...
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> Isn't the performance of this going to be horrid?

It takes about the same amount of time as getpid(). So, it is
one quick trip into the kernel. I expect that applications will
typically not care about this extra overhead as long as
they are able to run.

But I agree that if there is an application that cannot tolerate
this extra overhead, then it is an issue. See below for further
discussion.

In the libffi changes I have included in the cover letter, I have
done it in such a way that trampfd is chosen when current
security settings don't allow other methods such as
loading trampoline code into a file and mapping it. In this
case, the application can at least run with trampfd.

>
> If you don't care that much about performance the fixup can
> all be done in userspace within the fault signal handler.

I do care about performance.

This is a framework to address trampolines. In this initial
work, I want to establish one basic way for things to work.
In the future, trampfd can be enhanced for performance.
For instance, it is easy for an architecture to generate
the exact instructions required to load specified registers,
push specified values on the stack and jump to a target
PC. The kernel can map a page with the generated code
with execute permissions. In this case, the performance
issue goes away.
> Since whatever you do needs the application changed why
> not change the implementation of nested functions to not
> need on-stack executable trampolines.

I kinda agree with your suggestion.

But it is up to the GCC folks to change its implementation.
I am trying to provide a way for their existing implementation
to work in a more secure way.
> I can think of other alternatives that don't need much more
> than an array of 'push constant; jump trampoline' instructions
> be created (all jump to the same place).
>
> You might want something to create an executable page of such
> instructions.

Agreed. And that can be done within this framework as
I have mentioned above.

But it is not just this trampoline type that I have implemented
in this patchset. In the future, other types can be implemented
and other contexts can be defined. Basically, the approach is
for the user to supply a recipe to the kernel and leave it up to
the kernel to do it in the best way possible. I am hoping that
other forms of dynamic code can be addressed in the future
using the same framework.

*Purely as a hypothetical example*, a user can supply
instructions in a language such as BPF that the kernel
understands and have the kernel arrange for that to
be executed in user context.

Madhavan

> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 16:05   ` Casey Schaufler
@ 2020-07-28 16:49     ` Madhavan T. Venkataraman
  2020-07-28 17:05     ` James Morris
  1 sibling, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 16:49 UTC (permalink / raw)
  To: Casey Schaufler, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86

Thanks.

On 7/28/20 11:05 AM, Casey Schaufler wrote:
>> In this solution, the kernel recognizes certain sequences of instructions
>> as "well-known" trampolines. When such a trampoline is executed, a page
>> fault happens because the trampoline page does not have execute permission.
>> The kernel recognizes the trampoline and emulates it. Basically, the
>> kernel does the work of the trampoline on behalf of the application.
> What prevents a malicious process from using the "well-known" trampoline
> to its own purposes? I expect it is obvious, but I'm not seeing it. Old
> eyes, I suppose.

You are quite right. As I note below, the attack surface is the
buffer that contains the trampoline code. Since the kernel does
check the instruction sequence, the sequence cannot be
changed by a hacker. But the hacker can presumably change
the register values and redirect the PC to his desired location.

The assumption with trampoline emulation is that the
system will have security settings that will prevent pages from
having both write and execute permissions. So, a hacker
cannot load his own code in a page and redirect the PC to
it and execute his own code. But he can probably set the
PC to point to arbitrary locations. For instance, jump to
the middle of a C library function.
>
>> Here, the attack surface is the buffer that contains the trampoline.
>> The attack surface is narrower than before. A hacker may still be able to
>> modify what gets loaded in the registers or modify the target PC to point
>> to arbitrary locations.
...
>> Work that is pending
>> --------------------
>>
>> - I am working on implementing an SELinux setting called "exectramp"
>>   similar to "execmem" to allow the use of trampfd on a per application
>>   basis.
> You could make a separate LSM to do these checks instead of limiting
> it to SELinux. Your use case, your call, of course.

OK. I will research this.

Madhavan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 16:05   ` Casey Schaufler
  2020-07-28 16:49     ` Madhavan T. Venkataraman
@ 2020-07-28 17:05     ` James Morris
  2020-07-28 17:08       ` Madhavan T. Venkataraman
  1 sibling, 1 reply; 49+ messages in thread
From: James Morris @ 2020-07-28 17:05 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: madvenka, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86

On Tue, 28 Jul 2020, Casey Schaufler wrote:

> You could make a separate LSM to do these checks instead of limiting
> it to SELinux. Your use case, your call, of course.

It's not limited to SELinux. This is hooked via the LSM API and 
implementable by any LSM (similar to execmem, execstack etc.)


-- 
James Morris
<jmorris@namei.org>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:05     ` James Morris
@ 2020-07-28 17:08       ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 17:08 UTC (permalink / raw)
  To: James Morris, Casey Schaufler
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86



On 7/28/20 12:05 PM, James Morris wrote:
> On Tue, 28 Jul 2020, Casey Schaufler wrote:
>
>> You could make a separate LSM to do these checks instead of limiting
>> it to SELinux. Your use case, your call, of course.
> It's not limited to SELinux. This is hooked via the LSM API and 
> implementable by any LSM (similar to execmem, execstack etc.)

Yes. I have an implementation that I am testing right now that
defines the hook for exectramp and implements it for
SELinux. That is why I mentioned SELinux.

Madhavan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 16:32     ` Madhavan T. Venkataraman
@ 2020-07-28 17:16       ` Andy Lutomirski
  2020-07-28 17:39         ` Madhavan T. Venkataraman
  2020-07-28 18:52         ` Madhavan T. Venkataraman
  0 siblings, 2 replies; 49+ messages in thread
From: Andy Lutomirski @ 2020-07-28 17:16 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: David Laight, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86

On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
> Thanks. See inline..
>
> On 7/28/20 10:13 AM, David Laight wrote:
> > From:  madvenka@linux.microsoft.com
> >> Sent: 28 July 2020 14:11
> > ...
> >> The kernel creates the trampoline mapping without any permissions. When
> >> the trampoline is executed by user code, a page fault happens and the
> >> kernel gets control. The kernel recognizes that this is a trampoline
> >> invocation. It sets up the user registers based on the specified
> >> register context, and/or pushes values on the user stack based on the
> >> specified stack context, and sets the user PC to the requested target
> >> PC. When the kernel returns, execution continues at the target PC.
> >> So, the kernel does the work of the trampoline on behalf of the
> >> application.
> > Isn't the performance of this going to be horrid?
>
> It takes about the same amount of time as getpid(). So, it is
> one quick trip into the kernel. I expect that applications will
> typically not care about this extra overhead as long as
> they are able to run.

What did you test this on?  A page fault on any modern x86_64 system
is much, much, much, much slower than a syscall.

--Andy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
                     ` (5 preceding siblings ...)
  2020-07-28 16:05   ` Casey Schaufler
@ 2020-07-28 17:31   ` Andy Lutomirski
  2020-07-28 19:01     ` Madhavan T. Venkataraman
                       ` (5 more replies)
  2020-07-31 18:09   ` Mark Rutland
  7 siblings, 6 replies; 49+ messages in thread
From: Andy Lutomirski @ 2020-07-28 17:31 UTC (permalink / raw)
  To: madvenka
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML

> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>

> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.

This is quite clever, but now I’m wondering just how much kernel help
is really needed. In your series, the trampoline is an non-executable
page.  I can think of at least two alternative approaches, and I'd
like to know the pros and cons.

1. Entirely userspace: a return trampoline would be something like:

1:
pushq %rax
pushq %rbc
pushq %rcx
...
pushq %r15
movq %rsp, %rdi # pointer to saved regs
leaq 1b(%rip), %rsi # pointer to the trampoline itself
callq trampoline_handler # see below

You would fill a page with a bunch of these, possibly compacted to get
more per page, and then you would remap as many copies as needed.  The
'callq trampoline_handler' part would need to be a bit clever to make
it continue to work despite this remapping.  This will be *much*
faster than trampfd. How much of your use case would it cover?  For
the inverse, it's not too hard to write a bit of asm to set all
registers and jump somewhere.

2. Use existing kernel functionality.  Raise a signal, modify the
state, and return from the signal.  This is very flexible and may not
be all that much slower than trampfd.

3. Use a syscall.  Instead of having the kernel handle page faults,
have the trampoline code push the syscall nr register, load a special
new syscall nr into the syscall nr register, and do a syscall. On
x86_64, this would be:

pushq %rax
movq __NR_magic_trampoline, %rax
syscall

with some adjustment if the stack slot you're clobbering is important.


Also, will using trampfd cause issues with various unwinders?  I can
easily imagine unwinders expecting code to be readable, although this
is slowly going away for other reasons.

All this being said, I think that the kernel should absolutely add a
sensible interface for JITs to use to materialize their code.  This
would integrate sanely with LSMs and wouldn't require hacks like using
files, etc.  A cleverly designed JIT interface could function without
seriailization IPIs, and even lame architectures like x86 could
potentially avoid shootdown IPIs if the interface copied code instead
of playing virtual memory games.  At its very simplest, this could be:

void *jit_create_code(const void *source, size_t len);

and the result would be a new anonymous mapping that contains exactly
the code requested.  There could also be:

int jittfd_create(...);

that does something similar but creates a memfd.  A nicer
implementation for short JIT sequences would allow appending more code
to an existing JIT region.  On x86, an appendable JIT region would
start filled with 0xCC, and I bet there's a way to materialize new
code into a previously 0xcc-filled virtual page wthout any
synchronization.  One approach would be to start with:

<some code>
0xcc
0xcc
...
0xcc

and to create a whole new page like:

<some code>
<some more code>
0xcc
...
0xcc

so that the only difference is that some code changed to some more
code.  Then replace the PTE to swap from the old page to the new page,
and arrange to avoid freeing the old page until we're sure it's gone
from all TLBs.  This may not work if <some more code> spans a page
boundary.  The #BP fixup would zap the TLB and retry.  Even just
directly copying code over some 0xcc bytes almost works, but there's a
nasty corner case involving instructions that fetch I$ fetch
boundaries.  I'm not sure to what extent I$ snooping helps.

--Andy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:16       ` Andy Lutomirski
@ 2020-07-28 17:39         ` Madhavan T. Venkataraman
  2020-07-29  5:16           ` Andy Lutomirski
  2020-07-28 18:52         ` Madhavan T. Venkataraman
  1 sibling, 1 reply; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 17:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Laight, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86


[-- Attachment #1: Type: text/plain, Size: 3416 bytes --]



On 7/28/20 12:16 PM, Andy Lutomirski wrote:
> On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>> Thanks. See inline..
>>
>> On 7/28/20 10:13 AM, David Laight wrote:
>>> From:  madvenka@linux.microsoft.com
>>>> Sent: 28 July 2020 14:11
>>> ...
>>>> The kernel creates the trampoline mapping without any permissions. When
>>>> the trampoline is executed by user code, a page fault happens and the
>>>> kernel gets control. The kernel recognizes that this is a trampoline
>>>> invocation. It sets up the user registers based on the specified
>>>> register context, and/or pushes values on the user stack based on the
>>>> specified stack context, and sets the user PC to the requested target
>>>> PC. When the kernel returns, execution continues at the target PC.
>>>> So, the kernel does the work of the trampoline on behalf of the
>>>> application.
>>> Isn't the performance of this going to be horrid?
>> It takes about the same amount of time as getpid(). So, it is
>> one quick trip into the kernel. I expect that applications will
>> typically not care about this extra overhead as long as
>> they are able to run.
> What did you test this on?  A page fault on any modern x86_64 system
> is much, much, much, much slower than a syscall.

I tested it in on a KVM guest running Ubuntu. So, when you say
that a page fault is much slower, do you mean a regular page
fault that is handled through the VM layer? Here is the relevant code
in do_user_addr_fault():

            if (unlikely(access_error(hw_error_code, vma))) {
                    /*
                     * If it is a user execute fault, it could be a trampoline
                     * invocation.
                     */
                    if ((hw_error_code & tflags) == tflags &&
                        trampfd_fault(vma, regs)) {
                            up_read(&mm->mmap_sem);
                            return;
                    }
                    bad_area_access_error(regs, hw_error_code, address, vma);
                    return;
            }

            /*
             * If for any reason at all we couldn't handle the fault,
             * make sure we exit gracefully rather than endlessly redo
             * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if
             * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
             *
             * Note that handle_userfault() may also release and reacquire mmap_sem
             * (and not return with VM_FAULT_RETRY), when returning to userland to
             * repeat the page fault later with a VM_FAULT_NOPAGE retval
             * (potentially after handling any pending signal during the return to
             * userland). The return to userland is identified whenever
             * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags.
             */
            fault = handle_mm_fault(vma, address, flags);

trampfd faults are instruction faults that go through a different code
path than the one that calls handle_mm_fault().

Could you clarify?

Thanks.

Madhavan


[-- Attachment #2: Type: text/html, Size: 5712 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:16       ` Andy Lutomirski
  2020-07-28 17:39         ` Madhavan T. Venkataraman
@ 2020-07-28 18:52         ` Madhavan T. Venkataraman
  2020-07-29  8:36           ` David Laight
  1 sibling, 1 reply; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 18:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Laight, kernel-hardening, linux-api, linux-arm-kernel,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86



On 7/28/20 12:16 PM, Andy Lutomirski wrote:
> On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>> Thanks. See inline..
>>
>> On 7/28/20 10:13 AM, David Laight wrote:
>>> From:  madvenka@linux.microsoft.com
>>>> Sent: 28 July 2020 14:11
>>> ...
>>>> The kernel creates the trampoline mapping without any permissions. When
>>>> the trampoline is executed by user code, a page fault happens and the
>>>> kernel gets control. The kernel recognizes that this is a trampoline
>>>> invocation. It sets up the user registers based on the specified
>>>> register context, and/or pushes values on the user stack based on the
>>>> specified stack context, and sets the user PC to the requested target
>>>> PC. When the kernel returns, execution continues at the target PC.
>>>> So, the kernel does the work of the trampoline on behalf of the
>>>> application.
>>> Isn't the performance of this going to be horrid?
>> It takes about the same amount of time as getpid(). So, it is
>> one quick trip into the kernel. I expect that applications will
>> typically not care about this extra overhead as long as
>> they are able to run.
> What did you test this on?  A page fault on any modern x86_64 system
> is much, much, much, much slower than a syscall.

I sent a response to this. But the mail was returned to me.
I am resending.

I tested it in on a KVM guest running Ubuntu. So, when you say that a
page fault is much slower, do you mean a regular page fault that is handled
through the VM layer? Here is the relevant code in do_user_addr_fault():

        if (unlikely(access_error(hw_error_code, vma))) {
                /*                 
                 * If it is a user execute fault, it could be a trampoline
                 * invocation.
                 */
                if ((hw_error_code & tflags) == tflags &&
                     trampfd_fault(vma, regs)) {
                         up_read(&mm->mmap_sem);
                         return;
                 }
                 bad_area_access_error(regs, hw_error_code, address, vma);
                 return;
         }
         ...
         fault = handle_mm_fault(vma, address, flags);

trampfd faults are instruction faults that go through a different code path than
the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that
is time consuming. Could you clarify?

Thanks.

Madhavan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:31   ` Andy Lutomirski
@ 2020-07-28 19:01     ` Madhavan T. Venkataraman
  2020-07-29 13:29     ` Florian Weimer
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-28 19:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML

I am working on a response to this. I will send it soon.

Thanks.

Madhavan

On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.  The
> 'callq trampoline_handler' part would need to be a bit clever to make
> it continue to work despite this remapping.  This will be *much*
> faster than trampfd. How much of your use case would it cover?  For
> the inverse, it's not too hard to write a bit of asm to set all
> registers and jump somewhere.
>
> 2. Use existing kernel functionality.  Raise a signal, modify the
> state, and return from the signal.  This is very flexible and may not
> be all that much slower than trampfd.
>
> 3. Use a syscall.  Instead of having the kernel handle page faults,
> have the trampoline code push the syscall nr register, load a special
> new syscall nr into the syscall nr register, and do a syscall. On
> x86_64, this would be:
>
> pushq %rax
> movq __NR_magic_trampoline, %rax
> syscall
>
> with some adjustment if the stack slot you're clobbering is important.
>
>
> Also, will using trampfd cause issues with various unwinders?  I can
> easily imagine unwinders expecting code to be readable, although this
> is slowly going away for other reasons.
>
> All this being said, I think that the kernel should absolutely add a
> sensible interface for JITs to use to materialize their code.  This
> would integrate sanely with LSMs and wouldn't require hacks like using
> files, etc.  A cleverly designed JIT interface could function without
> seriailization IPIs, and even lame architectures like x86 could
> potentially avoid shootdown IPIs if the interface copied code instead
> of playing virtual memory games.  At its very simplest, this could be:
>
> void *jit_create_code(const void *source, size_t len);
>
> and the result would be a new anonymous mapping that contains exactly
> the code requested.  There could also be:
>
> int jittfd_create(...);
>
> that does something similar but creates a memfd.  A nicer
> implementation for short JIT sequences would allow appending more code
> to an existing JIT region.  On x86, an appendable JIT region would
> start filled with 0xCC, and I bet there's a way to materialize new
> code into a previously 0xcc-filled virtual page wthout any
> synchronization.  One approach would be to start with:
>
> <some code>
> 0xcc
> 0xcc
> ...
> 0xcc
>
> and to create a whole new page like:
>
> <some code>
> <some more code>
> 0xcc
> ...
> 0xcc
>
> so that the only difference is that some code changed to some more
> code.  Then replace the PTE to swap from the old page to the new page,
> and arrange to avoid freeing the old page until we're sure it's gone
> from all TLBs.  This may not work if <some more code> spans a page
> boundary.  The #BP fixup would zap the TLB and retry.  Even just
> directly copying code over some 0xcc bytes almost works, but there's a
> nasty corner case involving instructions that fetch I$ fetch
> boundaries.  I'm not sure to what extent I$ snooping helps.
>
> --Andy


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:39         ` Madhavan T. Venkataraman
@ 2020-07-29  5:16           ` Andy Lutomirski
  0 siblings, 0 replies; 49+ messages in thread
From: Andy Lutomirski @ 2020-07-29  5:16 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, David Laight, kernel-hardening, linux-api,
	linux-arm-kernel, linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, oleg, x86

On Tue, Jul 28, 2020 at 10:40 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
>
>
> On 7/28/20 12:16 PM, Andy Lutomirski wrote:
>
> On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>
> Thanks. See inline..
>
> On 7/28/20 10:13 AM, David Laight wrote:
>
> From:  madvenka@linux.microsoft.com
>
> Sent: 28 July 2020 14:11
>
> ...
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
>
> Isn't the performance of this going to be horrid?
>
> It takes about the same amount of time as getpid(). So, it is
> one quick trip into the kernel. I expect that applications will
> typically not care about this extra overhead as long as
> they are able to run.
>
> What did you test this on?  A page fault on any modern x86_64 system
> is much, much, much, much slower than a syscall.
>
>
> I tested it in on a KVM guest running Ubuntu. So, when you say
> that a page fault is much slower, do you mean a regular page
> fault that is handled through the VM layer? Here is the relevant code
> in do_user_addr_fault():

I mean that x86 CPUs have reasonably SYSCALL and SYSRET instructions
(the former is used for 64-bit system calls on Linux and the latter is
mostly used to return from system calls), but hardware page fault
delivery and IRET (used to return from page faults) are very slow.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 18:52         ` Madhavan T. Venkataraman
@ 2020-07-29  8:36           ` David Laight
  2020-07-29 17:55             ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 49+ messages in thread
From: David Laight @ 2020-07-29  8:36 UTC (permalink / raw)
  To: 'Madhavan T. Venkataraman', Andy Lutomirski
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86

From: Madhavan T. Venkataraman
> Sent: 28 July 2020 19:52
...
> trampfd faults are instruction faults that go through a different code path than
> the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that
> is time consuming. Could you clarify?

Given that the expectation is a few instructions in userspace
(eg to pick up the original arguments for a nested call)
the (probable) thousands of clocks taken by entering the
kernel (especially with page table separation) is a massive
delta.

If entering the kernel were cheap no one would have added
the DSO functions for getting the time of day.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:31   ` Andy Lutomirski
  2020-07-28 19:01     ` Madhavan T. Venkataraman
@ 2020-07-29 13:29     ` Florian Weimer
  2020-07-30 13:09     ` David Laight
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 49+ messages in thread
From: Florian Weimer @ 2020-07-29 13:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: madvenka, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML

* Andy Lutomirski:

> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.

libffi does something like this for iOS, I believe.

The only thing you really need is a PC-relative indirect call, with the
target address loaded from a different page.  The trampoline handler can
do all the rest because it can identify the trampoline from the stack.
Having a closure parameter loaded into a register will speed things up,
of course.

I still hope to transition libffi to this model for most Linux targets.
It really simplifies things because you don't have to deal with cache
flushes (on both the data and code aliases for SELinux support).

But the key observation is that efficient trampolines do not need
run-time code generation at all because their code is so regular.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-29  8:36           ` David Laight
@ 2020-07-29 17:55             ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-29 17:55 UTC (permalink / raw)
  To: David Laight, Andy Lutomirski
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86



On 7/29/20 3:36 AM, David Laight wrote:
> From: Madhavan T. Venkataraman
>> Sent: 28 July 2020 19:52
> ...
>> trampfd faults are instruction faults that go through a different code path than
>> the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that
>> is time consuming. Could you clarify?
> Given that the expectation is a few instructions in userspace
> (eg to pick up the original arguments for a nested call)
> the (probable) thousands of clocks taken by entering the
> kernel (especially with page table separation) is a massive
> delta.
>
> If entering the kernel were cheap no one would have added
> the DSO functions for getting the time of day.

I hear you. BTW, I did not say that the overhead was trivial.
I only said that in most cases, applications may not mind that
extra overhead.

However, since multiple people have raised that as an issue,
I will address it. I mentioned before that the kernel can actually
supply the code page that sets the context and jumps to
a PC and map it so the performance issue can be addressed.
I was planning to do that as a future enhancement.

If there is a consensus that I must address it immediately, I
could do that.

I will continue this discussion in my reply to Andy's email. Let
us pick it up from there.

Thanks.

Madhavan
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor
  2020-07-28 13:10   ` [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor madvenka
@ 2020-07-30  9:06     ` Greg KH
  2020-07-30 14:25       ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 49+ messages in thread
From: Greg KH @ 2020-07-30  9:06 UTC (permalink / raw)
  To: madvenka
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86

On Tue, Jul 28, 2020 at 08:10:48AM -0500, madvenka@linux.microsoft.com wrote:
> +EXPORT_SYMBOL_GPL(trampfd_valid_regs);

Why are all of these exported?  I don't see a module user in this
series, or did I miss it somehow?

EXPORT_SYMBOL* is only needed for symbols to be used by modules, not by
code that is built into the kernel.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:31   ` Andy Lutomirski
  2020-07-28 19:01     ` Madhavan T. Venkataraman
  2020-07-29 13:29     ` Florian Weimer
@ 2020-07-30 13:09     ` David Laight
  2020-08-02 11:56       ` Pavel Machek
  2020-07-30 14:24     ` Madhavan T. Venkataraman
                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 49+ messages in thread
From: David Laight @ 2020-07-30 13:09 UTC (permalink / raw)
  To: 'Andy Lutomirski', madvenka
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML

> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
> 
> 1. Entirely userspace: a return trampoline would be something like:
> 
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below

For nested calls (where the trampoline needs to pass the
original stack frame to the nested function) I think you
just need a page full of:
	mov	$0, scratch_reg; jmp trampoline_handler
	mov	$1, scratch_reg; jmp trampoline_handler
You need an unused register, on x86-64 I think both
r10 and r11 are available.
On i386 I think eax can be used.
It might even be that the first argument register is
available - if that is used to pass in the stack frame.

The trampoline_handler then uses the passed in value
to index an array of stack frame and function pointers
and jumps to the real function.
You need to hold everything in __thread data.
And maybe be able to allocate an extra page for deeply
nested code paths (eg recursive nested functions).

You might then need a driver to create you a suitable
executable page. Somehow you need to pass in the address
of the trampoline_handler and the number for the first fault.
It need to pass back the 'stride' of the array and number
of elements created.

But if you can take the cost of the page fault, then
you can interpret the existing trampoline in userspace
within the signal handler.
This is two kernel entry/exits.

Arbitrary JIT is a different problem entirely.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:31   ` Andy Lutomirski
                       ` (2 preceding siblings ...)
  2020-07-30 13:09     ` David Laight
@ 2020-07-30 14:24     ` Madhavan T. Venkataraman
  2020-07-30 20:54       ` Andy Lutomirski
  2020-07-30 14:42     ` Madhavan T. Venkataraman
  2020-08-02 18:54     ` Madhavan T. Venkataraman
  5 siblings, 1 reply; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-30 14:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML


[-- Attachment #1: Type: text/plain, Size: 7120 bytes --]

Sorry for the delay. I just wanted to think about this a little.
In this email, I will respond to your first suggestion. I will
respond to the rest in separate emails if that is alright with
you.

On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.  The
> 'callq trampoline_handler' part would need to be a bit clever to make
> it continue to work despite this remapping.  This will be *much*
> faster than trampfd. How much of your use case would it cover?  For
> the inverse, it's not too hard to write a bit of asm to set all
> registers and jump somewhere.

Let me state what I have understood about this suggestion. Correct me if
I get anything wrong. If you don't mind, I will also take the liberty
of generalizing and paraphrasing your suggestion.

The goal is to create two page mappings that are adjacent to each other:

	- a code page that contains template code for a trampoline. Since the
	  template code would tend to be small in size, pack as many of them
	  as possible within a page to conserve memory. In other words, create
	  an array of the template code fragments. Each element in the array
	  would be used for one trampoline instance.

	- a data page that contains an array of data elements. Corresponding
	  to each code element in the code page, there would be a data element
	  in the data page that would contain data that is specific to a
	  trampoline instance.

	- Code will access data using PC-relative addressing.

The management of the code pages and allocation for each trampoline
instance would all be done in user space.

Is this the general idea?

Creating a code page
--------------------

We can do this in one of the following ways:

- Allocate a writable page at run time, write the template code into
  the page and have execute permissions on the page.

- Allocate a writable page at run time, write the template code into
  the page and remap the page with just execute permissions.

- Allocate a writable page at run time, write the template code into
  the page, write the page into a temporary file and map the file with
  execute permissions.

- Include the template code in a code page at build time itself and
  just remap the code page each time you need a code page.

Pros and Cons
-------------

As long as the OS provides the functionality to do this and the security
subsystem in the OS allows the actions, this is totally feasible. If not,
we need something like trampfd.

As Floren mentioned, libffi does implement something like this for MACH.

In fact, in my libffi changes, I use trampfd only after all the other methods
have failed because of security settings.

But the above approach only solves the problem for this simple type of
trampoline. It does not provide a framework for addressing more complex types
or even other forms of dynamic code.

Also, each application would need to implement this solution for itself
as opposed to relying on one implementation provided by the kernel.

Trampfd-based solution
----------------------

I outlined an enhancement to trampfd in a response to David Laight. In this
enhancement, the kernel is the one that would set up the code page.

The kernel would call an arch-specific support function to generate the
code required to load registers, push values on the stack and jump to a PC
for a trampoline instance based on its current context. The trampoline
instance data could be baked into the code.

My initial idea was to only have one trampoline instance per page. But I
think I can implement multiple instances per page. I just have to manage
the trampfd file private data and VMA private data accordingly to map an
element in a code page to its trampoline object.

The two approaches are similar except for the detail about who sets up
and manages the trampoline pages. In both approaches, the performance problem
is addressed. But trampfd can be used even when security settings are
restrictive.

Is my solution acceptable?

A couple of things
------------------

- In the current trampfd implementation, no physical pages are actually
  allocated. It is just a virtual mapping. From a memory footprint
  perspective, this is good. May be, we can let the user specify if
  he wants a fast trampoline that consumes memory or a slow one that
  doesn't?

- In the future, we may define additional types that need the kernel to do
  the job. Examples:

	- The kernel may have a trampoline type for which it is not willing
	  or able to generate code
	- The kernel could emulate dynamic code for the user
	- The kernel could interpret dynamic code for the user
	- The kernel could allow the user to access some kernel
	  functionality using the framework

  In such cases, there isn't any physical code page that gets mapped into
  the user address space. We need the kernel to handle the address fault
  cases and provide the functionality.

One question for the reviewers
------------------------------

Do you think that the file descriptor based approach is fine? Or, does this
need a regular system call based implementation? There are some advantages
with a regular system call:

	- We don't consume file descriptors. E.g., in libffi, we have to
	  keep the file descriptor open for a closure until the closure
	  is freed.

	- Trampoline operations can be performed based on the trampoline
	  address instead of an fd.

	- Sharing of objects across processes can be implemented through
	  a regular ID based method rather than sending the file descriptor
	  over a unix domain socket.

	- Shared objects can be persistent.

	- An fd based API does structure parsing in read()/write() calls
	  to obtain arguments. With a regular system call, that is not
	  necessary.

Please let me know your thoughts.

Madhavan


[-- Attachment #2: Type: text/html, Size: 8262 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor
  2020-07-30  9:06     ` Greg KH
@ 2020-07-30 14:25       ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-30 14:25 UTC (permalink / raw)
  To: Greg KH
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86

Yes. I will fix this.

Thanks.

Madhavan

On 7/30/20 4:06 AM, Greg KH wrote:
> On Tue, Jul 28, 2020 at 08:10:48AM -0500, madvenka@linux.microsoft.com wrote:
>> +EXPORT_SYMBOL_GPL(trampfd_valid_regs);
> Why are all of these exported?  I don't see a module user in this
> series, or did I miss it somehow?
>
> EXPORT_SYMBOL* is only needed for symbols to be used by modules, not by
> code that is built into the kernel.
>
> thanks,
>
> greg k-h


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:31   ` Andy Lutomirski
                       ` (3 preceding siblings ...)
  2020-07-30 14:24     ` Madhavan T. Venkataraman
@ 2020-07-30 14:42     ` Madhavan T. Venkataraman
  2020-08-02 18:54     ` Madhavan T. Venkataraman
  5 siblings, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-30 14:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML

For some reason my email program is not delivering to all the
recipients because of some formatting issues. I am resending.
I apologize. I will try to get this fixed.

Sorry for the delay. I just needed to think about it a little.
I will respond to your first suggestion in this email. I will
respond to the others in separate emails if that is alright
with you.

On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.  The
> 'callq trampoline_handler' part would need to be a bit clever to make
> it continue to work despite this remapping.  This will be *much*
> faster than trampfd. How much of your use case would it cover?  For
> the inverse, it's not too hard to write a bit of asm to set all
> registers and jump somewhere.

Let me state my understanding of what you are suggesting. Correct me if
I get anything wrong. If you don't mind, I will also take the liberty
of generalizing and paraphrasing your suggestion.

The goal is to create two page mappings that are adjacent to each other:

- a code page that contains template code for a trampoline. Since the
  template code would tend to be small in size, pack as many of them
  as possible within a page to conserve memory. In other words, create
  an array of the template code fragments. Each element in the array
  would be used for one trampoline instance.

- a data page that contains an array of data elements. Corresponding
  to each code element in the code page, there would be a data element
  in the data page that would contain data that is specific to a
  trampoline instance.

- Code will access data using PC-relative addressing.

The management of the code pages and allocation for each trampoline
instance would all be done in user space.

Is this the general idea?

Creating a code page
----------------------------

We can do this in one of the following ways:
- Allocate a writable page at run time, write the template code into
  the page and have execute permissions on the page.

- Allocate a writable page at run time, write the template code into
  the page and remap the page with just execute permissions.

- Allocate a writable page at run time, write the template code into
  the page, write the page into a temporary file and map the file with
  execute permissions.

- Include the template code in a code page at build time itself and
  just remap the code page each time you need a code page.

Pros and Cons
-------------------

As long as the OS provides the functionality to do this and the security
subsystem in the OS allows the actions, this is totally feasible. If not,
we need something like trampfd.

As Floren mentioned, libffi does implement something like this for MACH.

In fact, in my libffi changes, I use trampfd only after all the other methods
have failed because of security settings.

But the above approach only solves the problem for this simple type of
trampoline. It does not provide a framework for addressing more complex types
or even other forms of dynamic code.

Also, each application would need to implement this solution for itself
as opposed to relying on one implementation provided by the kernel.

Trampfd-based solution
-------------------------------

I outlined an enhancement to trampfd in a response to David Laight. In this
enhancement, the kernel is the one that would set up the code page.

The kernel would call an arch-specific support function to generate the
code required to load registers, push values on the stack and jump to a PC
for a trampoline instance based on its current context. The trampoline
instance data could be baked into the code.

My initial idea was to only have one trampoline instance per page. But I
think I can implement multiple instances per page. I just have to manage
the trampfd file private data and VMA private data accordingly to map an
element in a code page to its trampoline object.

The two approaches are similar except for the detail about who sets up
and manages the trampoline pages. In both approaches, the performance problem
is addressed. But trampfd can be used even when security settings are
restrictive.

Is my solution acceptable?

A couple of things
------------------------

- In the current trampfd implementation, no physical pages are actually
  allocated. It is just a virtual mapping. From a memory footprint
  perspective, this is good. May be, we can let the user specify if
  he wants a fast trampoline that consumes memory or a slow one that doesn't?

- In the future, we may define additional types that need the kernel to do
  the job. Examples:

    - The kernel may have a trampoline type for which it is not willing
       or able to generate code

    - The kernel could emulate dynamic code for the user

     - The kernel could interpret dynamic code for the user

     - The kernel could allow the user to access some kernel functionality
        using the framework

  In such cases, there isn't any physical code page that gets mapped into
  the user address space. We need the kernel to handle the address fault
  and provide the functionality.

One question for the reviewers
----------------------------------------

Do you think that the file descriptor based approach is fine? Or, does this
need a regular system call based implementation? There are some advantages
with a regular system call:

- We don't consume file descriptors. E.g., in libffi, we have to
  keep the file descriptor open for a closure until the closure
  is freed.

- Trampoline operations can be performed based on the trampoline
  address instead of an fd.

- Sharing of objects across processes can be implemented through
  a regular ID based method rather than sending the file descriptor
  over a unix domain socket.

- Shared objects can be persistent.

- An fd based API does structure parsing in read()/write() calls
  to obtain arguments. With a regular system call, that is not
  necessary.

Please let me know your thoughts.

Madhavan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-30 14:24     ` Madhavan T. Venkataraman
@ 2020-07-30 20:54       ` Andy Lutomirski
  2020-07-31 17:13         ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 49+ messages in thread
From: Andy Lutomirski @ 2020-07-30 20:54 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML

On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
> Sorry for the delay. I just wanted to think about this a little.
> In this email, I will respond to your first suggestion. I will
> respond to the rest in separate emails if that is alright with
> you.
>
> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>
> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
>
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page.  I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed.  The
> 'callq trampoline_handler' part would need to be a bit clever to make
> it continue to work despite this remapping.  This will be *much*
> faster than trampfd. How much of your use case would it cover?  For
> the inverse, it's not too hard to write a bit of asm to set all
> registers and jump somewhere.
>
> Let me state what I have understood about this suggestion. Correct me if
> I get anything wrong. If you don't mind, I will also take the liberty
> of generalizing and paraphrasing your suggestion.
>
> The goal is to create two page mappings that are adjacent to each other:
>
> - a code page that contains template code for a trampoline. Since the
>  template code would tend to be small in size, pack as many of them
>  as possible within a page to conserve memory. In other words, create
>  an array of the template code fragments. Each element in the array
>  would be used for one trampoline instance.
>
> - a data page that contains an array of data elements. Corresponding
>  to each code element in the code page, there would be a data element
>  in the data page that would contain data that is specific to a
>  trampoline instance.
>
> - Code will access data using PC-relative addressing.
>
> The management of the code pages and allocation for each trampoline
> instance would all be done in user space.
>
> Is this the general idea?

Yes.

>
> Creating a code page
> --------------------
>
> We can do this in one of the following ways:
>
> - Allocate a writable page at run time, write the template code into
>   the page and have execute permissions on the page.
>
> - Allocate a writable page at run time, write the template code into
>   the page and remap the page with just execute permissions.
>
> - Allocate a writable page at run time, write the template code into
>   the page, write the page into a temporary file and map the file with
>   execute permissions.
>
> - Include the template code in a code page at build time itself and
>   just remap the code page each time you need a code page.

This latter part shouldn't need any special permissions as far as I know.

>
> Pros and Cons
> -------------
>
> As long as the OS provides the functionality to do this and the security
> subsystem in the OS allows the actions, this is totally feasible. If not,
> we need something like trampfd.
>
> As Floren mentioned, libffi does implement something like this for MACH.
>
> In fact, in my libffi changes, I use trampfd only after all the other methods
> have failed because of security settings.
>
> But the above approach only solves the problem for this simple type of
> trampoline. It does not provide a framework for addressing more complex types
> or even other forms of dynamic code.
>
> Also, each application would need to implement this solution for itself
> as opposed to relying on one implementation provided by the kernel.

I would argue this is a benefit.  If the whole implementation is in
userspace, there is no ABI compatibility issue.  The user program
contains the trampoline code and the code that uses it.

>
> Trampfd-based solution
> ----------------------
>
> I outlined an enhancement to trampfd in a response to David Laight. In this
> enhancement, the kernel is the one that would set up the code page.
>
> The kernel would call an arch-specific support function to generate the
> code required to load registers, push values on the stack and jump to a PC
> for a trampoline instance based on its current context. The trampoline
> instance data could be baked into the code.
>
> My initial idea was to only have one trampoline instance per page. But I
> think I can implement multiple instances per page. I just have to manage
> the trampfd file private data and VMA private data accordingly to map an
> element in a code page to its trampoline object.
>
> The two approaches are similar except for the detail about who sets up
> and manages the trampoline pages. In both approaches, the performance problem
> is addressed. But trampfd can be used even when security settings are
> restrictive.
>
> Is my solution acceptable?

Perhaps.  In general, before adding a new ABI to the kernel, it's nice
to understand how it's better than doing the same thing in userspace.
Saying that it's easier for user code to work with if it's in the
kernel isn't necessarily an adequate justification.

Why would remapping two pages of actual application text ever fail?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-30 20:54       ` Andy Lutomirski
@ 2020-07-31 17:13         ` Madhavan T. Venkataraman
  2020-07-31 18:31           ` Mark Rutland
  2020-08-02 13:57           ` Florian Weimer
  0 siblings, 2 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-31 17:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML



On 7/30/20 3:54 PM, Andy Lutomirski wrote:
> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>> ...
>> Creating a code page
>> --------------------
>>
>> We can do this in one of the following ways:
>>
>> - Allocate a writable page at run time, write the template code into
>>   the page and have execute permissions on the page.
>>
>> - Allocate a writable page at run time, write the template code into
>>   the page and remap the page with just execute permissions.
>>
>> - Allocate a writable page at run time, write the template code into
>>   the page, write the page into a temporary file and map the file with
>>   execute permissions.
>>
>> - Include the template code in a code page at build time itself and
>>   just remap the code page each time you need a code page.
> This latter part shouldn't need any special permissions as far as I know.

Agreed.
>
>> Pros and Cons
>> -------------
>>
>> As long as the OS provides the functionality to do this and the security
>> subsystem in the OS allows the actions, this is totally feasible. If not,
>> we need something like trampfd.
>>
>> As Floren mentioned, libffi does implement something like this for MACH.
>>
>> In fact, in my libffi changes, I use trampfd only after all the other methods
>> have failed because of security settings.
>>
>> But the above approach only solves the problem for this simple type of
>> trampoline. It does not provide a framework for addressing more complex types
>> or even other forms of dynamic code.
>>
>> Also, each application would need to implement this solution for itself
>> as opposed to relying on one implementation provided by the kernel.
> I would argue this is a benefit.  If the whole implementation is in
> userspace, there is no ABI compatibility issue.  The user program
> contains the trampoline code and the code that uses it.

The current trampfd implementation also does not have an ABI issue.
ABI details are to be handled in user land. In the case of libffi, they
are. Trampfd only addresses the trampoline required to jump to the
ABI handler.

>
>> Trampfd-based solution
>> ----------------------
>>
>> I outlined an enhancement to trampfd in a response to David Laight. In this
>> enhancement, the kernel is the one that would set up the code page.
>>
>> The kernel would call an arch-specific support function to generate the
>> code required to load registers, push values on the stack and jump to a PC
>> for a trampoline instance based on its current context. The trampoline
>> instance data could be baked into the code.
>>
>> My initial idea was to only have one trampoline instance per page. But I
>> think I can implement multiple instances per page. I just have to manage
>> the trampfd file private data and VMA private data accordingly to map an
>> element in a code page to its trampoline object.
>>
>> The two approaches are similar except for the detail about who sets up
>> and manages the trampoline pages. In both approaches, the performance problem
>> is addressed. But trampfd can be used even when security settings are
>> restrictive.
>>
>> Is my solution acceptable?
> Perhaps.  In general, before adding a new ABI to the kernel, it's nice
> to understand how it's better than doing the same thing in userspace.
> Saying that it's easier for user code to work with if it's in the
> kernel isn't necessarily an adequate justification.

Fair enough.

Dealing with multiple architectures
-----------------------------------------------

One good reason to use trampfd is multiple architecture support. The
trampoline table in a code page approach is neat. I don't deny that at
all. But my question is - can it be used in all cases?

It requires PC-relative data references. I have not worked on all architectures.
So, I need to study this. But do all ISAs support PC-relative data references?

Even in an ISA that supports it, there would be a maximum supported offset
from the current PC that can be reached for a data reference. That maximum
needs to be at least the size of a base page in the architecture. This is because
the code page and the data page need to be separate for security reasons.
Do all ISAs support a sufficiently large offset?

When the kernel generates the code for a trampoline, it can hard code data values
in the generated code itself so it does not need PC-relative data referencing.

And, for ISAs that do support the large offset, we do have to implement and
maintain the code page stuff for different ISAs for each application and library
if we did not use trampfd.

If you look at the libffi reference patch that I have linked in the cover letter,
I have added functions in common code that wrap trampfd calls. From architecture
specific code, there is just one function call to one of those wrapper functions
to set the register context for the trampoline. This is a very small C code change
in each architecture. So, support can be extended to all architectures without
exception easily.

Runtime generated trampolines
-------------------------------------------

libffi trampolines are simple. But there may be many cases out there where the
trampoline code cannot be statically defined at build time. It may have to be
generated at runtime. For this, we will need trampfd.

Security
-----------

With the user level trampoline table approach, the data part of the trampoline table
can be hacked by an attacker if an application has a vulnerability. Specifically, the
target PC can be altered to some arbitrary location. Trampfd implements an
"Allowed PCS" context. In the libffi changes, I have created a read-only array of
all ABI handlers used in closures for each architecture. This read-only array
can be used to restrict the PC values for libffi trampolines to prevent hacking.

To generalize, we can implement security rules/features if the trampoline
object is in the kernel.

Standardization
---------------------

Trampfd is a framework that can be used to implement multiple things. May be,
a few of those things can also be implemented in user land itself. But I think having
just one mechanism to execute dynamic code objects is preferable to having
multiple mechanisms not standardized across all applications.

As an example, let us say that I am able to implement support for JIT code. Let us
say that an interpreter uses libffi to execute a generated function. The interpreter
would use trampfd for the JIT code object and get an address. Then, it would pass
that to libffi which would then use trampfd for the trampoline. So, trampfd based
code objects can be chained.

> Why would remapping two pages of actual application text ever fail?

Remapping a page may not be available on all OSes. However, that is not a problem
for the code page approach. One can always memory map the code page from the
binary file directly. So, yes, this would not fail.

Madhavan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
                     ` (6 preceding siblings ...)
  2020-07-28 17:31   ` Andy Lutomirski
@ 2020-07-31 18:09   ` Mark Rutland
  2020-07-31 20:08     ` Madhavan T. Venkataraman
  2020-08-03 16:57     ` Madhavan T. Venkataraman
  7 siblings, 2 replies; 49+ messages in thread
From: Mark Rutland @ 2020-07-31 18:09 UTC (permalink / raw)
  To: madvenka
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86

Hi,

On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> Trampoline code is placed either in a data page or in a stack page. In
> order to execute a trampoline, the page it resides in needs to be mapped
> with execute permissions. Writable pages with execute permissions provide
> an attack surface for hackers. Attackers can use this to inject malicious
> code, modify existing code or do other harm.

For the purpose of below, IIUC this assumes the adversary has an
arbitrary write.

> To mitigate this, LSMs such as SELinux may not allow pages to have both
> write and execute permissions. This prevents trampolines from executing
> and blocks applications that use trampolines. To allow genuine applications
> to run, exceptions have to be made for them (by setting execmem, etc).
> In this case, the attack surface is just the pages of such applications.
> 
> An application that is not allowed to have writable executable pages
> may try to load trampoline code into a file and map the file with execute
> permissions. In this case, the attack surface is just the buffer that
> contains trampoline code. However, a successful exploit may provide the
> hacker with means to load his own code in a file, map it and execute it.

It's not clear to me what power the adversary is assumed to have here,
and consequently it's not clear to me how the proposal mitigates this.

For example, if the attack can control the arguments to syscalls, and
has an arbitrary write as above, what prevents them from creating a
trampfd of their own?

[...]

> GCC has traditionally used trampolines for implementing nested
> functions. The trampoline is placed on the user stack. So, the stack
> needs to be executable.

IIUC generally nested functions are avoided these days, specifically to
prevent the creation of gadgets on the stack. So I don't think those are
relevant as a cased to care about. Applications using them should move
to not using them, and would be more secure generally for doing so.

[...]

> Trampoline File Descriptor (trampfd)
> --------------------------
> 
> I am proposing a kernel API using anonymous file descriptors that
> can be used to create and execute trampolines with the help of the
> kernel. In this solution also, the kernel does the work of the trampoline.

What's the rationale for the kernel emulating the trampoline here?

In ther case of EMUTRAMP this was necessary to work with existing
application binaries and kernel ABIs which placed instructions onto the
stack, and the stack needed to remain RW for other reasons. That
restriction doesn't apply here.

Assuming trampfd creation is somehow authenticated, the code could be
placed in a r-x page (which the kernel could refuse to add write
permission), in order to prevent modification. If that's sufficient,
it's not much of a leap to allow userspace to generate the code.

> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
> 
> In this case, the attack surface is the context buffer. A hacker may
> attack an application with a vulnerability and may be able to modify the
> context buffer. So, when the register or stack context is set for
> a trampoline, the values may have been tampered with. From an attack
> surface perspective, this is similar to Trampoline Emulation. But
> with trampfd, user code can retrieve a trampoline's context from the
> kernel and add defensive checks to see if the context has been
> tampered with.

Can you elaborate on this: what sort of checks would be applied, and
how?

Why is this not possible in a r-x user page?

[...]

> - trampfd provides a basic framework. In the future, new trampoline types
>   can be implemented, new contexts can be defined, and additional rules
>   can be implemented for security purposes.

From a kernel developer perspective, this reads as "this ABI will become
more complex", which I think is worrisome.

I'm also worried that this is liable to have nasty interaction with HW
CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that
we bake incompatibility into ABI.

> - For instance, trampfd defines an "Allowed PCs" context in this initial
>   work. As an example, libffi can create a read-only array of all ABI
>   handlers for an architecture at build time. This array can be used to
>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>   cannot hack the PC part of the register context and make it point to
>   arbitrary locations.

I'm not exactly sure what's meant here. Do you mean that this prevents
userspace from branching into the middle of a trampoline, or that the
trampfd code prevents where the trampoline itself can branch to?

Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the
former, and I believe the latter can also be implemented in userspace
with defensive checks in the trampolines, provided that they are
protected read-only.

> - An SELinux setting called "exectramp" can be implemented along the
>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>   use of trampolines on a per application basis.
> 
> - User code can add defensive checks in the code before invoking a
>   trampoline to make sure that a hacker has not modified the context data.
>   It can do this by getting the trampoline context from the kernel and
>   double checking it.

As above, without examples it's not clear to me what sort of chacks are
possible nor where they wouild need to be made. So it's difficult to see
whether that's actually possible or subject to TOCTTOU races and
similar.

> - In the future, if the kernel can be enhanced to use a safe code
>   generation component, that code can be placed in the trampoline mapping
>   pages. Then, the trampoline invocation does not have to incur a trip
>   into the kernel.
> 
> - Also, if the kernel can be enhanced to use a safe code generation
>   component, other forms of dynamic code such as JIT code can be
>   addressed by the trampfd framework.

I don't see why it's necessary for the kernel to generate code at all.
If the trampfd creation requests can be trusted, what prevents trusting
a sealed set of instructions generated in userspace?

> - Trampolines can be shared across processes which can give rise to
>   interesting uses in the future.

This sounds like the use-case of a sealed memfd. Is a sealed executable
memfd not sufficient?

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 17:13         ` Madhavan T. Venkataraman
@ 2020-07-31 18:31           ` Mark Rutland
  2020-08-03  8:27             ` David Laight
  2020-08-03 17:58             ` Madhavan T. Venkataraman
  2020-08-02 13:57           ` Florian Weimer
  1 sibling, 2 replies; 49+ messages in thread
From: Mark Rutland @ 2020-07-31 18:31 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML

On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
> > On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
> > <madvenka@linux.microsoft.com> wrote:
> Dealing with multiple architectures
> -----------------------------------------------
> 
> One good reason to use trampfd is multiple architecture support. The
> trampoline table in a code page approach is neat. I don't deny that at
> all. But my question is - can it be used in all cases?
> 
> It requires PC-relative data references. I have not worked on all architectures.
> So, I need to study this. But do all ISAs support PC-relative data references?

Not all do, but pretty much any recent ISA will as it's a practical
necessity for fast position-independent code.

> Even in an ISA that supports it, there would be a maximum supported offset
> from the current PC that can be reached for a data reference. That maximum
> needs to be at least the size of a base page in the architecture. This is because
> the code page and the data page need to be separate for security reasons.
> Do all ISAs support a sufficiently large offset?

ISAs with pc-relative addessing can usually generate PC-relative
addresses into a GPR, from which they can apply an arbitrarily large
offset.

> When the kernel generates the code for a trampoline, it can hard code data values
> in the generated code itself so it does not need PC-relative data referencing.
> 
> And, for ISAs that do support the large offset, we do have to implement and
> maintain the code page stuff for different ISAs for each application and library
> if we did not use trampfd.

Trampoline code is architecture specific today, so I don't see that as a
major issue. Common structural bits can probably be shared even if the
specifid machine code cannot.

[...]

> Security
> -----------
> 
> With the user level trampoline table approach, the data part of the trampoline table
> can be hacked by an attacker if an application has a vulnerability. Specifically, the
> target PC can be altered to some arbitrary location. Trampfd implements an
> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
> all ABI handlers used in closures for each architecture. This read-only array
> can be used to restrict the PC values for libffi trampolines to prevent hacking.
> 
> To generalize, we can implement security rules/features if the trampoline
> object is in the kernel.

I don't follow this argument. If it's possible to statically define that
in the kernel, it's also possible to do that in userspace without any
new kernel support.

[...]

> Trampfd is a framework that can be used to implement multiple things. May be,
> a few of those things can also be implemented in user land itself. But I think having
> just one mechanism to execute dynamic code objects is preferable to having
> multiple mechanisms not standardized across all applications.

In abstract, having a common interface sounds nice, but in practice
elements of this are always architecture-specific (e.g. interactiosn
with HW CFI), and that common interface can result in more pain as it
doesn't fit naturally into the context that ISAs were designed for (e.g. 
where control-flow instructions are extended with new semantics).

It also meass that you can't share the rough approach across OSs which
do not implement an identical mechanism, so for code abstracting by ISA
first, then by platform/ABI, there isn't much saving.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 18:09   ` Mark Rutland
@ 2020-07-31 20:08     ` Madhavan T. Venkataraman
  2020-08-03 16:57     ` Madhavan T. Venkataraman
  1 sibling, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-07-31 20:08 UTC (permalink / raw)
  To: Mark Rutland
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86

Thanks for the comments. I will respond to these and your next
email on Monday.

Madhavan

On 7/31/20 1:09 PM, Mark Rutland wrote:
> Hi,
>
> On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>> Trampoline code is placed either in a data page or in a stack page. In
>> order to execute a trampoline, the page it resides in needs to be mapped
>> with execute permissions. Writable pages with execute permissions provide
>> an attack surface for hackers. Attackers can use this to inject malicious
>> code, modify existing code or do other harm.
> For the purpose of below, IIUC this assumes the adversary has an
> arbitrary write.
>
>> To mitigate this, LSMs such as SELinux may not allow pages to have both
>> write and execute permissions. This prevents trampolines from executing
>> and blocks applications that use trampolines. To allow genuine applications
>> to run, exceptions have to be made for them (by setting execmem, etc).
>> In this case, the attack surface is just the pages of such applications.
>>
>> An application that is not allowed to have writable executable pages
>> may try to load trampoline code into a file and map the file with execute
>> permissions. In this case, the attack surface is just the buffer that
>> contains trampoline code. However, a successful exploit may provide the
>> hacker with means to load his own code in a file, map it and execute it.
> It's not clear to me what power the adversary is assumed to have here,
> and consequently it's not clear to me how the proposal mitigates this.
>
> For example, if the attack can control the arguments to syscalls, and
> has an arbitrary write as above, what prevents them from creating a
> trampfd of their own?
>
> [...]
>
>> GCC has traditionally used trampolines for implementing nested
>> functions. The trampoline is placed on the user stack. So, the stack
>> needs to be executable.
> IIUC generally nested functions are avoided these days, specifically to
> prevent the creation of gadgets on the stack. So I don't think those are
> relevant as a cased to care about. Applications using them should move
> to not using them, and would be more secure generally for doing so.
>
> [...]
>
>> Trampoline File Descriptor (trampfd)
>> --------------------------
>>
>> I am proposing a kernel API using anonymous file descriptors that
>> can be used to create and execute trampolines with the help of the
>> kernel. In this solution also, the kernel does the work of the trampoline.
> What's the rationale for the kernel emulating the trampoline here?
>
> In ther case of EMUTRAMP this was necessary to work with existing
> application binaries and kernel ABIs which placed instructions onto the
> stack, and the stack needed to remain RW for other reasons. That
> restriction doesn't apply here.
>
> Assuming trampfd creation is somehow authenticated, the code could be
> placed in a r-x page (which the kernel could refuse to add write
> permission), in order to prevent modification. If that's sufficient,
> it's not much of a leap to allow userspace to generate the code.
>
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
>>
>> In this case, the attack surface is the context buffer. A hacker may
>> attack an application with a vulnerability and may be able to modify the
>> context buffer. So, when the register or stack context is set for
>> a trampoline, the values may have been tampered with. From an attack
>> surface perspective, this is similar to Trampoline Emulation. But
>> with trampfd, user code can retrieve a trampoline's context from the
>> kernel and add defensive checks to see if the context has been
>> tampered with.
> Can you elaborate on this: what sort of checks would be applied, and
> how?
>
> Why is this not possible in a r-x user page?
>
> [...]
>
>> - trampfd provides a basic framework. In the future, new trampoline types
>>   can be implemented, new contexts can be defined, and additional rules
>>   can be implemented for security purposes.
> >From a kernel developer perspective, this reads as "this ABI will become
> more complex", which I think is worrisome.
>
> I'm also worried that this is liable to have nasty interaction with HW
> CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that
> we bake incompatibility into ABI.
>
>> - For instance, trampfd defines an "Allowed PCs" context in this initial
>>   work. As an example, libffi can create a read-only array of all ABI
>>   handlers for an architecture at build time. This array can be used to
>>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>>   cannot hack the PC part of the register context and make it point to
>>   arbitrary locations.
> I'm not exactly sure what's meant here. Do you mean that this prevents
> userspace from branching into the middle of a trampoline, or that the
> trampfd code prevents where the trampoline itself can branch to?
>
> Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the
> former, and I believe the latter can also be implemented in userspace
> with defensive checks in the trampolines, provided that they are
> protected read-only.
>
>> - An SELinux setting called "exectramp" can be implemented along the
>>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>>   use of trampolines on a per application basis.
>>
>> - User code can add defensive checks in the code before invoking a
>>   trampoline to make sure that a hacker has not modified the context data.
>>   It can do this by getting the trampoline context from the kernel and
>>   double checking it.
> As above, without examples it's not clear to me what sort of chacks are
> possible nor where they wouild need to be made. So it's difficult to see
> whether that's actually possible or subject to TOCTTOU races and
> similar.
>
>> - In the future, if the kernel can be enhanced to use a safe code
>>   generation component, that code can be placed in the trampoline mapping
>>   pages. Then, the trampoline invocation does not have to incur a trip
>>   into the kernel.
>>
>> - Also, if the kernel can be enhanced to use a safe code generation
>>   component, other forms of dynamic code such as JIT code can be
>>   addressed by the trampfd framework.
> I don't see why it's necessary for the kernel to generate code at all.
> If the trampfd creation requests can be trusted, what prevents trusting
> a sealed set of instructions generated in userspace?
>
>> - Trampolines can be shared across processes which can give rise to
>>   interesting uses in the future.
> This sounds like the use-case of a sealed memfd. Is a sealed executable
> memfd not sufficient?
>
> Thanks,
> Mark.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-30 13:09     ` David Laight
@ 2020-08-02 11:56       ` Pavel Machek
  2020-08-03  8:08         ` David Laight
  0 siblings, 1 reply; 49+ messages in thread
From: Pavel Machek @ 2020-08-02 11:56 UTC (permalink / raw)
  To: David Laight
  Cc: 'Andy Lutomirski',
	madvenka, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML

Hi!

> > This is quite clever, but now I???m wondering just how much kernel help
> > is really needed. In your series, the trampoline is an non-executable
> > page.  I can think of at least two alternative approaches, and I'd
> > like to know the pros and cons.
> > 
> > 1. Entirely userspace: a return trampoline would be something like:
> > 
> > 1:
> > pushq %rax
> > pushq %rbc
> > pushq %rcx
> > ...
> > pushq %r15
> > movq %rsp, %rdi # pointer to saved regs
> > leaq 1b(%rip), %rsi # pointer to the trampoline itself
> > callq trampoline_handler # see below
> 
> For nested calls (where the trampoline needs to pass the
> original stack frame to the nested function) I think you
> just need a page full of:
> 	mov	$0, scratch_reg; jmp trampoline_handler

I believe you could do with mov %pc, scratch_reg; jmp ...

That has advantage of being able to share single physical page across multiple virtual pages...


									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 17:13         ` Madhavan T. Venkataraman
  2020-07-31 18:31           ` Mark Rutland
@ 2020-08-02 13:57           ` Florian Weimer
  1 sibling, 0 replies; 49+ messages in thread
From: Florian Weimer @ 2020-08-02 13:57 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML

* Madhavan T. Venkataraman:

> Standardization
> ---------------------
>
> Trampfd is a framework that can be used to implement multiple
> things. May be, a few of those things can also be implemented in
> user land itself. But I think having just one mechanism to execute
> dynamic code objects is preferable to having multiple mechanisms not
> standardized across all applications.
>
> As an example, let us say that I am able to implement support for
> JIT code. Let us say that an interpreter uses libffi to execute a
> generated function. The interpreter would use trampfd for the JIT
> code object and get an address. Then, it would pass that to libffi
> which would then use trampfd for the trampoline. So, trampfd based
> code objects can be chained.

There is certainly value in coordination.  For example, it would be
nice if unwinders could recognize the trampolines during all phases
and unwind correctly through them (including when interrupted by an
asynchronous symbol).  That requires some level of coordination with
the unwinder and dynamic linker.

A kernel solution could hide the intermediate state in a kernel-side
trap handler, but I think it wouldn't reduce the overall complexity.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-28 17:31   ` Andy Lutomirski
                       ` (4 preceding siblings ...)
  2020-07-30 14:42     ` Madhavan T. Venkataraman
@ 2020-08-02 18:54     ` Madhavan T. Venkataraman
  2020-08-02 20:00       ` Andy Lutomirski
  2020-08-03  8:23       ` David Laight
  5 siblings, 2 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-02 18:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML

More responses inline..

On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>
> 2. Use existing kernel functionality.  Raise a signal, modify the
> state, and return from the signal.  This is very flexible and may not
> be all that much slower than trampfd.

Let me understand this. You are saying that the trampoline code
would raise a signal and, in the signal handler, set up the context
so that when the signal handler returns, we end up in the target
function with the context correctly set up. And, this trampoline code
can be generated statically at build time so that there are no
security issues using it.

Have I understood your suggestion correctly?

So, my argument would be that this would always incur the overhead
of a trip to the kernel. I think twice the overhead if I am not mistaken.
With trampfd, we can have the kernel generate the code so that there
is no performance penalty at all.

Signals have many problems. Which signal number should we use for this
purpose? If we use an existing one, that might conflict with what the application
is already handling. Getting a new signal number for this could meet
with resistance from the community.

Also, signals are asynchronous. So, they are vulnerable to race conditions.
To prevent other signals from coming in while handling the raised signal,
we would need to block and unblock signals. This will cause more
overhead.

> 3. Use a syscall.  Instead of having the kernel handle page faults,
> have the trampoline code push the syscall nr register, load a special
> new syscall nr into the syscall nr register, and do a syscall. On
> x86_64, this would be:
>
> pushq %rax
> movq __NR_magic_trampoline, %rax
> syscall
>
> with some adjustment if the stack slot you're clobbering is important.

How is this better than the kernel handling an address fault?
The system call still needs to do the same work as the fault handler.
We do need to specify the register and stack contexts before hand
so the system call can do its job.

Also, this always incurs a trip to the kernel. With trampfd, the kernel
could generate the code to avoid the performance penalty.


>
> Also, will using trampfd cause issues with various unwinders?  I can
> easily imagine unwinders expecting code to be readable, although this
> is slowly going away for other reasons.

I need to study unwinders a little before I respond to this question.
So, bear with me.

> All this being said, I think that the kernel should absolutely add a
> sensible interface for JITs to use to materialize their code.  This
> would integrate sanely with LSMs and wouldn't require hacks like using
> files, etc.  A cleverly designed JIT interface could function without
> seriailization IPIs, and even lame architectures like x86 could
> potentially avoid shootdown IPIs if the interface copied code instead
> of playing virtual memory games.  At its very simplest, this could be:
>
> void *jit_create_code(const void *source, size_t len);
>
> and the result would be a new anonymous mapping that contains exactly
> the code requested.  There could also be:
>
> int jittfd_create(...);
>
> that does something similar but creates a memfd.  A nicer
> implementation for short JIT sequences would allow appending more code
> to an existing JIT region.  On x86, an appendable JIT region would
> start filled with 0xCC, and I bet there's a way to materialize new
> code into a previously 0xcc-filled virtual page wthout any
> synchronization.  One approach would be to start with:
>
> <some code>
> 0xcc
> 0xcc
> ...
> 0xcc
>
> and to create a whole new page like:
>
> <some code>
> <some more code>
> 0xcc
> ...
> 0xcc
>
> so that the only difference is that some code changed to some more
> code.  Then replace the PTE to swap from the old page to the new page,
> and arrange to avoid freeing the old page until we're sure it's gone
> from all TLBs.  This may not work if <some more code> spans a page
> boundary.  The #BP fixup would zap the TLB and retry.  Even just
> directly copying code over some 0xcc bytes almost works, but there's a
> nasty corner case involving instructions that fetch I$ fetch
> boundaries.  I'm not sure to what extent I$ snooping helps.

I am thinking that the trampfd API can be used for addressing JIT
code as well. I have not yet started thinking about the details. But I
think the API is sufficient. E.g.,

    struct trampfd_jit {
        void    *source;
        size_t    len;
    };

    struct trampfd_jit    jit;
    struct trampfd_map    map;
    void    *addr;

    jit.source = blah;
    jit.size = blah;

    fd = syscall(440, TRAMPFD_JIT, &jit, flags);
    pread(fd, &map, sizeof(map), TRAMPFD_MAP_OFFSET);
    addr = mmap(NULL, map.size, map.prot, map.flags, fd, map.offset);

And addr would be used to invoke the generated JIT code.

Madhavan


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-02 18:54     ` Madhavan T. Venkataraman
@ 2020-08-02 20:00       ` Andy Lutomirski
  2020-08-02 22:58         ` Madhavan T. Venkataraman
  2020-08-03 18:36         ` Madhavan T. Venkataraman
  2020-08-03  8:23       ` David Laight
  1 sibling, 2 replies; 49+ messages in thread
From: Andy Lutomirski @ 2020-08-02 20:00 UTC (permalink / raw)
  To: Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML

On Sun, Aug 2, 2020 at 11:54 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
> More responses inline..
>
> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
> >> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
> >>
> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> >>
> >
> > 2. Use existing kernel functionality.  Raise a signal, modify the
> > state, and return from the signal.  This is very flexible and may not
> > be all that much slower than trampfd.
>
> Let me understand this. You are saying that the trampoline code
> would raise a signal and, in the signal handler, set up the context
> so that when the signal handler returns, we end up in the target
> function with the context correctly set up. And, this trampoline code
> can be generated statically at build time so that there are no
> security issues using it.
>
> Have I understood your suggestion correctly?

yes.

>
> So, my argument would be that this would always incur the overhead
> of a trip to the kernel. I think twice the overhead if I am not mistaken.
> With trampfd, we can have the kernel generate the code so that there
> is no performance penalty at all.

I feel like trampfd is too poorly defined at this point to evaluate.
There are three general things it could do.  It could generate actual
code that varies by instance.  It could have static code that does not
vary.  And it could actually involve a kernel entry.

If it involves a kernel entry, then it's slow.  Maybe this is okay for
some use cases.

If it involves only static code, I see no good reason that it should
be in the kernel.

If it involves dynamic code, then I think it needs a clearly defined
use case that actually requires dynamic code.

> Also, signals are asynchronous. So, they are vulnerable to race conditions.
> To prevent other signals from coming in while handling the raised signal,
> we would need to block and unblock signals. This will cause more
> overhead.

If you're worried about raise() racing against signals from out of
thread, you have bigger problems to deal with.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-02 20:00       ` Andy Lutomirski
@ 2020-08-02 22:58         ` Madhavan T. Venkataraman
  2020-08-03 18:36         ` Madhavan T. Venkataraman
  1 sibling, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-02 22:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML



On 8/2/20 3:00 PM, Andy Lutomirski wrote:
> On Sun, Aug 2, 2020 at 11:54 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>> More responses inline..
>>
>> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>>>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>>>
>>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>>>
>>> 2. Use existing kernel functionality.  Raise a signal, modify the
>>> state, and return from the signal.  This is very flexible and may not
>>> be all that much slower than trampfd.
>> Let me understand this. You are saying that the trampoline code
>> would raise a signal and, in the signal handler, set up the context
>> so that when the signal handler returns, we end up in the target
>> function with the context correctly set up. And, this trampoline code
>> can be generated statically at build time so that there are no
>> security issues using it.
>>
>> Have I understood your suggestion correctly?
> yes.
>
>> So, my argument would be that this would always incur the overhead
>> of a trip to the kernel. I think twice the overhead if I am not mistaken.
>> With trampfd, we can have the kernel generate the code so that there
>> is no performance penalty at all.
> I feel like trampfd is too poorly defined at this point to evaluate.
> There are three general things it could do.  It could generate actual
> code that varies by instance.  It could have static code that does not
> vary.  And it could actually involve a kernel entry.
>
> If it involves a kernel entry, then it's slow.  Maybe this is okay for
> some use cases.

Yes. IMO, it is OK for most cases except where dynamic code
is used specifically for enhancing performance such as interpreters
using JIT code for frequently executed sequences and dynamic
binary translation.

> If it involves only static code, I see no good reason that it should
> be in the kernel.

It does not involve only static code. This is meant for dynamic code.
However, see below.

> If it involves dynamic code, then I think it needs a clearly defined
> use case that actually requires dynamic code.

Fair enough. I will work on this and get back to you. This might take
a little time. So, bear with me.

But I would like to make one point here. There are many applications
and libraries out there that use trampolines. They may all require the
same sort of things:

    - set register context
    - push stuff on stack
    - jump to a target PC

But in each case, the context would be different:

    - only register context
    - only stack context
    - both register and stack contexts
    - different registers
    - different values pushed on the stack
    - different target PCs

If we had to do this purely at user level, each application/library would
need to roll its own solution, the solution has to be implemented for
each supported architecture and maintained. While the code is static
in each separate case, it is dynamic across all of them.

That is, the kernel will generate the code on the fly for each trampoline
instance based on its current context. It will not maintain any static
trampoline code at all.

Basically, it will supply the context to an arch-specific function and say:

    - generate instructions for loading these regs with these values
    - generate instructions to push these values on the stack
    - generate an instruction to jump to this target PC

It will place all of those generated instructions on a page and return the address.

So, even with the static case, there is a lot of value in the kernel providing
this. Plus, it has the framework to handle dynamic code.

>> Also, signals are asynchronous. So, they are vulnerable to race conditions.
>> To prevent other signals from coming in while handling the raised signal,
>> we would need to block and unblock signals. This will cause more
>> overhead.
> If you're worried about raise() racing against signals from out of
> thread, you have bigger problems to deal with.

Agreed. The signal blocking is just one example of problems related
to signals. There are other bigger problems as well. So, let us remove
the signal-based approach from our discussions.

Thanks.

Madhavan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-02 11:56       ` Pavel Machek
@ 2020-08-03  8:08         ` David Laight
  2020-08-03 15:57           ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 49+ messages in thread
From: David Laight @ 2020-08-03  8:08 UTC (permalink / raw)
  To: 'Pavel Machek'
  Cc: 'Andy Lutomirski',
	madvenka, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML

From: Pavel Machek <pavel@ucw.cz>
> Sent: 02 August 2020 12:56
> Hi!
> 
> > > This is quite clever, but now I???m wondering just how much kernel help
> > > is really needed. In your series, the trampoline is an non-executable
> > > page.  I can think of at least two alternative approaches, and I'd
> > > like to know the pros and cons.
> > >
> > > 1. Entirely userspace: a return trampoline would be something like:
> > >
> > > 1:
> > > pushq %rax
> > > pushq %rbc
> > > pushq %rcx
> > > ...
> > > pushq %r15
> > > movq %rsp, %rdi # pointer to saved regs
> > > leaq 1b(%rip), %rsi # pointer to the trampoline itself
> > > callq trampoline_handler # see below
> >
> > For nested calls (where the trampoline needs to pass the
> > original stack frame to the nested function) I think you
> > just need a page full of:
> > 	mov	$0, scratch_reg; jmp trampoline_handler
> 
> I believe you could do with mov %pc, scratch_reg; jmp ...
> 
> That has advantage of being able to share single physical
> page across multiple virtual pages...

A lot of architecture don't let you copy %pc that way so you would
have to use 'call' - but that trashes the return address cache.
It also needs the trampoline handler to know the addresses
of the trampolines.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-02 18:54     ` Madhavan T. Venkataraman
  2020-08-02 20:00       ` Andy Lutomirski
@ 2020-08-03  8:23       ` David Laight
  2020-08-03 15:59         ` Madhavan T. Venkataraman
  1 sibling, 1 reply; 49+ messages in thread
From: David Laight @ 2020-08-03  8:23 UTC (permalink / raw)
  To: 'Madhavan T. Venkataraman', Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML

From: Madhavan T. Venkataraman
> Sent: 02 August 2020 19:55
> To: Andy Lutomirski <luto@kernel.org>
> Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>; Linux API <linux-api@vger.kernel.org>;
> linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; Linux FS Devel <linux-
> fsdevel@vger.kernel.org>; linux-integrity <linux-integrity@vger.kernel.org>; LKML <linux-
> kernel@vger.kernel.org>; LSM List <linux-security-module@vger.kernel.org>; Oleg Nesterov
> <oleg@redhat.com>; X86 ML <x86@kernel.org>
> Subject: Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
> 
> More responses inline..
> 
> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
> >> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
> >>
> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> >>
> >
> > 2. Use existing kernel functionality.  Raise a signal, modify the
> > state, and return from the signal.  This is very flexible and may not
> > be all that much slower than trampfd.
> 
> Let me understand this. You are saying that the trampoline code
> would raise a signal and, in the signal handler, set up the context
> so that when the signal handler returns, we end up in the target
> function with the context correctly set up. And, this trampoline code
> can be generated statically at build time so that there are no
> security issues using it.
> 
> Have I understood your suggestion correctly?

I was thinking that you'd just let the 'not executable' page fault
signal happen (SIGSEGV?) when the code jumps to on-stack trampoline
is executed.

The user signal handler can then decode the faulting instruction
and, if it matches the expected on-stack trampoline, modify the
saved registers before returning from the signal.

No kernel changes and all you need to add to the program is
an architecture-dependant signal handler.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 18:31           ` Mark Rutland
@ 2020-08-03  8:27             ` David Laight
  2020-08-03 16:03               ` Madhavan T. Venkataraman
  2020-08-03 17:58             ` Madhavan T. Venkataraman
  1 sibling, 1 reply; 49+ messages in thread
From: David Laight @ 2020-08-03  8:27 UTC (permalink / raw)
  To: 'Mark Rutland', Madhavan T. Venkataraman
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML

From: Mark Rutland
> Sent: 31 July 2020 19:32
...
> > It requires PC-relative data references. I have not worked on all architectures.
> > So, I need to study this. But do all ISAs support PC-relative data references?
> 
> Not all do, but pretty much any recent ISA will as it's a practical
> necessity for fast position-independent code.

i386 has neither PC-relative addressing nor moves from %pc.
The cpu architecture knows that the sequence:
	call	1f  
1:	pop	%reg  
is used to get the %pc value so is treated specially so that
it doesn't 'trash' the return stack.

So PIC code isn't too bad, but you have to use the correct
sequence.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03  8:08         ` David Laight
@ 2020-08-03 15:57           ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 15:57 UTC (permalink / raw)
  To: David Laight, 'Pavel Machek'
  Cc: 'Andy Lutomirski',
	Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML



On 8/3/20 3:08 AM, David Laight wrote:
> From: Pavel Machek <pavel@ucw.cz>
>> Sent: 02 August 2020 12:56
>> Hi!
>>
>>>> This is quite clever, but now I???m wondering just how much kernel help
>>>> is really needed. In your series, the trampoline is an non-executable
>>>> page.  I can think of at least two alternative approaches, and I'd
>>>> like to know the pros and cons.
>>>>
>>>> 1. Entirely userspace: a return trampoline would be something like:
>>>>
>>>> 1:
>>>> pushq %rax
>>>> pushq %rbc
>>>> pushq %rcx
>>>> ...
>>>> pushq %r15
>>>> movq %rsp, %rdi # pointer to saved regs
>>>> leaq 1b(%rip), %rsi # pointer to the trampoline itself
>>>> callq trampoline_handler # see below
>>> For nested calls (where the trampoline needs to pass the
>>> original stack frame to the nested function) I think you
>>> just need a page full of:
>>> 	mov	$0, scratch_reg; jmp trampoline_handler
>> I believe you could do with mov %pc, scratch_reg; jmp ...
>>
>> That has advantage of being able to share single physical
>> page across multiple virtual pages...
> A lot of architecture don't let you copy %pc that way so you would
> have to use 'call' - but that trashes the return address cache.
> It also needs the trampoline handler to know the addresses
> of the trampolines.

Do you which ones don't allow you to copy %pc?

Some of the architctures do not have PC-relative data references.
If they do not allow you to copy the PC into a general purpose
register, then there is no way to implement the statically defined
trampoline that has been discussed so far. In these cases, the
trampoline has to be generate at runtime.

Thanks.

Madhavan


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03  8:23       ` David Laight
@ 2020-08-03 15:59         ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 15:59 UTC (permalink / raw)
  To: David Laight, Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML



On 8/3/20 3:23 AM, David Laight wrote:
> From: Madhavan T. Venkataraman
>> Sent: 02 August 2020 19:55
>> To: Andy Lutomirski <luto@kernel.org>
>> Cc: Kernel Hardening <kernel-hardening@lists.openwall.com>; Linux API <linux-api@vger.kernel.org>;
>> linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; Linux FS Devel <linux-
>> fsdevel@vger.kernel.org>; linux-integrity <linux-integrity@vger.kernel.org>; LKML <linux-
>> kernel@vger.kernel.org>; LSM List <linux-security-module@vger.kernel.org>; Oleg Nesterov
>> <oleg@redhat.com>; X86 ML <x86@kernel.org>
>> Subject: Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
>>
>> More responses inline..
>>
>> On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>>>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>>>
>>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>>>
>>> 2. Use existing kernel functionality.  Raise a signal, modify the
>>> state, and return from the signal.  This is very flexible and may not
>>> be all that much slower than trampfd.
>> Let me understand this. You are saying that the trampoline code
>> would raise a signal and, in the signal handler, set up the context
>> so that when the signal handler returns, we end up in the target
>> function with the context correctly set up. And, this trampoline code
>> can be generated statically at build time so that there are no
>> security issues using it.
>>
>> Have I understood your suggestion correctly?
> I was thinking that you'd just let the 'not executable' page fault
> signal happen (SIGSEGV?) when the code jumps to on-stack trampoline
> is executed.
>
> The user signal handler can then decode the faulting instruction
> and, if it matches the expected on-stack trampoline, modify the
> saved registers before returning from the signal.
>
> No kernel changes and all you need to add to the program is
> an architecture-dependant signal handler.

Understood.

Madhavan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03  8:27             ` David Laight
@ 2020-08-03 16:03               ` Madhavan T. Venkataraman
  2020-08-03 16:57                 ` David Laight
  0 siblings, 1 reply; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 16:03 UTC (permalink / raw)
  To: David Laight, 'Mark Rutland'
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML



On 8/3/20 3:27 AM, David Laight wrote:
> From: Mark Rutland
>> Sent: 31 July 2020 19:32
> ...
>>> It requires PC-relative data references. I have not worked on all architectures.
>>> So, I need to study this. But do all ISAs support PC-relative data references?
>> Not all do, but pretty much any recent ISA will as it's a practical
>> necessity for fast position-independent code.
> i386 has neither PC-relative addressing nor moves from %pc.
> The cpu architecture knows that the sequence:
> 	call	1f  
> 1:	pop	%reg  
> is used to get the %pc value so is treated specially so that
> it doesn't 'trash' the return stack.
>
> So PIC code isn't too bad, but you have to use the correct
> sequence.

Is that true only for 32-bit systems only? I thought RIP-relative addressing was
introduced in 64-bit mode. Please confirm.

Madhavan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03 16:03               ` Madhavan T. Venkataraman
@ 2020-08-03 16:57                 ` David Laight
  2020-08-03 17:00                   ` Madhavan T. Venkataraman
  0 siblings, 1 reply; 49+ messages in thread
From: David Laight @ 2020-08-03 16:57 UTC (permalink / raw)
  To: 'Madhavan T. Venkataraman', 'Mark Rutland'
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML

From: Madhavan T. Venkataraman
> Sent: 03 August 2020 17:03
> 
> On 8/3/20 3:27 AM, David Laight wrote:
> > From: Mark Rutland
> >> Sent: 31 July 2020 19:32
> > ...
> >>> It requires PC-relative data references. I have not worked on all architectures.
> >>> So, I need to study this. But do all ISAs support PC-relative data references?
> >> Not all do, but pretty much any recent ISA will as it's a practical
> >> necessity for fast position-independent code.
> > i386 has neither PC-relative addressing nor moves from %pc.
> > The cpu architecture knows that the sequence:
> > 	call	1f
> > 1:	pop	%reg
> > is used to get the %pc value so is treated specially so that
> > it doesn't 'trash' the return stack.
> >
> > So PIC code isn't too bad, but you have to use the correct
> > sequence.
> 
> Is that true only for 32-bit systems only? I thought RIP-relative addressing was
> introduced in 64-bit mode. Please confirm.

I said i386 not amd64 or x86-64.

So yes, 64bit code has PC-relative addressing.
But I'm pretty sure it has no other way to get the PC itself
except using call - certainly nothing in the 'usual' instructions.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 18:09   ` Mark Rutland
  2020-07-31 20:08     ` Madhavan T. Venkataraman
@ 2020-08-03 16:57     ` Madhavan T. Venkataraman
  1 sibling, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 16:57 UTC (permalink / raw)
  To: Mark Rutland
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86

Responses inline..

On 7/31/20 1:09 PM, Mark Rutland wrote:
> Hi,
>
> On Tue, Jul 28, 2020 at 08:10:46AM -0500, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>> Trampoline code is placed either in a data page or in a stack page. In
>> order to execute a trampoline, the page it resides in needs to be mapped
>> with execute permissions. Writable pages with execute permissions provide
>> an attack surface for hackers. Attackers can use this to inject malicious
>> code, modify existing code or do other harm.
> For the purpose of below, IIUC this assumes the adversary has an
> arbitrary write.
>
>> To mitigate this, LSMs such as SELinux may not allow pages to have both
>> write and execute permissions. This prevents trampolines from executing
>> and blocks applications that use trampolines. To allow genuine applications
>> to run, exceptions have to be made for them (by setting execmem, etc).
>> In this case, the attack surface is just the pages of such applications.
>>
>> An application that is not allowed to have writable executable pages
>> may try to load trampoline code into a file and map the file with execute
>> permissions. In this case, the attack surface is just the buffer that
>> contains trampoline code. However, a successful exploit may provide the
>> hacker with means to load his own code in a file, map it and execute it.
> It's not clear to me what power the adversary is assumed to have here,
> and consequently it's not clear to me how the proposal mitigates this.
>
> For example, if the attack can control the arguments to syscalls, and
> has an arbitrary write as above, what prevents them from creating a
> trampfd of their own?

That is the point. If a process is allowed to have pages that are both
writable and executable, a hacker can exploit some vulnerability such
as buffer overflow to write his own code into a page and somehow
contrive to execute that.

So, the context is - if security settings in a system disallow a page to have
both write and execute permissions, how do you allow the execution of
genuine trampolines that are runtime generated and placed in a data
page or a stack page?

trampfd tries to address that. So, trampfd is not a measure that increases
the security of a system or mitigates a security problem. It is a framework
to allow safe forms of dynamic code to execute when security settings
will block them otherwise.

>
> [...]
>
>> GCC has traditionally used trampolines for implementing nested
>> functions. The trampoline is placed on the user stack. So, the stack
>> needs to be executable.
> IIUC generally nested functions are avoided these days, specifically to
> prevent the creation of gadgets on the stack. So I don't think those are
> relevant as a cased to care about. Applications using them should move
> to not using them, and would be more secure generally for doing so.

Could not agree with you more.
>
> [...]
>
>> Trampoline File Descriptor (trampfd)
>> --------------------------
>>
>> I am proposing a kernel API using anonymous file descriptors that
>> can be used to create and execute trampolines with the help of the
>> kernel. In this solution also, the kernel does the work of the trampoline.
> What's the rationale for the kernel emulating the trampoline here?
>
> In ther case of EMUTRAMP this was necessary to work with existing
> application binaries and kernel ABIs which placed instructions onto the
> stack, and the stack needed to remain RW for other reasons. That
> restriction doesn't apply here.

In addition to the stack, EMUTRAMP also allows the emulation
of the same well-known trampolines placed in a non-stack data page.
For instance, libffi closures embed a trampoline in a closure structure.
That gets executed when the caller of libffi invokes it.

The goal of EMUTRAMP is to allow safe trampolines to execute when
security settings disallow their execution. Mainly, it permits applications
that use libffi to run. A lot of applications use libffi.

They chose the emulation method so that no changes need to be made
to application code to use them. But the EMUTRAMP implementors note
in their description that the real solution to the problem is a kernel
API that is backed by a safe code generator.

trampd is an attempt to define such an API. This is just a starting point.
I realize that we need to have a lot of discussion to refine the approach.

> Assuming trampfd creation is somehow authenticated, the code could be
> placed in a r-x page (which the kernel could refuse to add write
> permission), in order to prevent modification. If that's sufficient,
> it's not much of a leap to allow userspace to generate the code.

IIUC, you are suggesting that the user hands the kernel a code fragment
and requests it to be placed in an r-x page, correct? However, the
kernel cannot trust any code given to it by the user. Nor can it scan any
piece of code and reliably decide if it is safe or not.

So, the problem of executing dynamic code when security settings are
restrictive cannot be solved in userland. The only option I can think of is
to have the kernel provide support for dynamic code. It must have one
or more safe, trusted code generation components and an API to use
the components.

My goal is to introduce an API and start off by supporting simple, regular
trampolines that are widely used. Then, evolve the feature over a period
of time to include other forms of dynamic code such as JIT code.

>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
>>
>> In this case, the attack surface is the context buffer. A hacker may
>> attack an application with a vulnerability and may be able to modify the
>> context buffer. So, when the register or stack context is set for
>> a trampoline, the values may have been tampered with. From an attack
>> surface perspective, this is similar to Trampoline Emulation. But
>> with trampfd, user code can retrieve a trampoline's context from the
>> kernel and add defensive checks to see if the context has been
>> tampered with.
> Can you elaborate on this: what sort of checks would be applied, and
> how?

So, an application that uses trampfd would do the following steps:

1. Create a trampoline by calling trampfd_create()
2. Set the register and/or stack contexts for the trampoline.
3. mmap() the trampoline to get an address
4. Invoke the trampoline using the address

Let us say that the application has a vulnerability such as buffer overflow
that allows a hacker to modify the data that is used to do step 2.

Potentially, a hacker could modify the following things:
    - register values specified in the register context
    - values specified in the stack context
    - the target PC specified in the register context

When the trampoline is invoked in step 4, the kernel will gain control,
load the registers, push stuff on the stack and transfer control to the target
PC. Whatever the hacker had modified in step 2 will take effect in step 4.
His values will get loaded and his PC is the one that will get control.

A paranoid application could add a step to this sequence. So, the steps
would be:

1. Create a trampoline by calling trampfd_create()
2. Set the register and/or stack contexts for the trampoline.
3. mmap() the trampoline to get an address
4a. Retrieve the register and stack context for the trampoline from the
      kernel and check if anything has been altered. If yes, abort.
4b. Invoke the trampoline using the address

The check that I mentioned will be in step 4a. Now, the hacker has to
hack both step 2 and step 4a to let his stuff take effect. That is far
less likely to succeed because there needs to exist a vulnerability in
both places.

> Why is this not possible in a r-x user page?

This is answered above.
>
> [...]
>
>> - trampfd provides a basic framework. In the future, new trampoline types
>>   can be implemented, new contexts can be defined, and additional rules
>>   can be implemented for security purposes.
> >From a kernel developer perspective, this reads as "this ABI will become
> more complex", which I think is worrisome.

I hear you. My goal from the beginning is to not have the kernel deal
with ABI issues. ABI handling is best left to userland (except in cases
like signal handlers where the kernel does have to deal with it).

In the libffi changes, this is certainly true. The kernel only helps with
the trampoline that passes control to the ABI handler. The ABI handler
itself is part of libffi.

> I'm also worried that this is liable to have nasty interaction with HW
> CFI mechanisms (e.g. PAC+BTI on arm64) either now or in future, and that
> we bake incompatibility into ABI.

I will study CFI and then answer this question. So, bear with me.
>> - For instance, trampfd defines an "Allowed PCs" context in this initial
>>   work. As an example, libffi can create a read-only array of all ABI
>>   handlers for an architecture at build time. This array can be used to
>>   set the list of allowed PCs for a trampoline. This will mean that a hacker
>>   cannot hack the PC part of the register context and make it point to
>>   arbitrary locations.
> I'm not exactly sure what's meant here. Do you mean that this prevents
> userspace from branching into the middle of a trampoline, or that the
> trampfd code prevents where the trampoline itself can branch to?
>
> Both x86 and arm64 have upcoming HW CFI (CET and BTI) to deal with the
> former, and I believe the latter can also be implemented in userspace
> with defensive checks in the trampolines, provided that they are
> protected read-only.

So, I mentioned before that a hacker can potentially alter the target
PC that a trampoline finally jumps to.

If a process were allowed to have pages with both write and execute
permissions, a hacker could load his own code in one of those pages and
point the PC to that.

In the context of trampfd, we are talking about the case where a process is
not permitted to have both write and execute permissions. In this case,
the hacker cannot load his own code anywhere and hope to execute it.
But a hacker can point the PC to some arbitrary place such as return
from glibc.

>
>> - An SELinux setting called "exectramp" can be implemented along the
>>   lines of "execmem", "execstack" and "execheap" to selectively allow the
>>   use of trampolines on a per application basis.
>>
>> - User code can add defensive checks in the code before invoking a
>>   trampoline to make sure that a hacker has not modified the context data.
>>   It can do this by getting the trampoline context from the kernel and
>>   double checking it.
> As above, without examples it's not clear to me what sort of chacks are
> possible nor where they wouild need to be made. So it's difficult to see
> whether that's actually possible or subject to TOCTTOU races and
> similar.

I have explained this above. If there are any further questions on that,
please let me know.

>
>> - In the future, if the kernel can be enhanced to use a safe code
>>   generation component, that code can be placed in the trampoline mapping
>>   pages. Then, the trampoline invocation does not have to incur a trip
>>   into the kernel.
>>
>> - Also, if the kernel can be enhanced to use a safe code generation
>>   component, other forms of dynamic code such as JIT code can be
>>   addressed by the trampfd framework.
> I don't see why it's necessary for the kernel to generate code at all.
> If the trampfd creation requests can be trusted, what prevents trusting
> a sealed set of instructions generated in userspace?

Let us consider a system in which:
    - a process is not permitted to have pages with both write and execute
    - a process is not permitted to map any file as executable unless it
      is properly signed. In other words, cryptographically verified.

Then, the process cannot execute any code that is runtime generated.
That includes trampolines. Only trampoline code that is part of program
text at build time would be permitted to execute.

In this scenario, trampfd requests are coming from signed code. So, they
are trusted by the kernel. But trampoline code could be dynamically generated.
The kernel will not trust it.

>> - Trampolines can be shared across processes which can give rise to
>>   interesting uses in the future.
> This sounds like the use-case of a sealed memfd. Is a sealed executable
> memfd not sufficient?

I will answer this in a separate email.

Thanks.

Madhavan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-03 16:57                 ` David Laight
@ 2020-08-03 17:00                   ` Madhavan T. Venkataraman
  0 siblings, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 17:00 UTC (permalink / raw)
  To: David Laight, 'Mark Rutland'
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML



On 8/3/20 11:57 AM, David Laight wrote:
> From: Madhavan T. Venkataraman
>> Sent: 03 August 2020 17:03
>>
>> On 8/3/20 3:27 AM, David Laight wrote:
>>> From: Mark Rutland
>>>> Sent: 31 July 2020 19:32
>>> ...
>>>>> It requires PC-relative data references. I have not worked on all architectures.
>>>>> So, I need to study this. But do all ISAs support PC-relative data references?
>>>> Not all do, but pretty much any recent ISA will as it's a practical
>>>> necessity for fast position-independent code.
>>> i386 has neither PC-relative addressing nor moves from %pc.
>>> The cpu architecture knows that the sequence:
>>> 	call	1f
>>> 1:	pop	%reg
>>> is used to get the %pc value so is treated specially so that
>>> it doesn't 'trash' the return stack.
>>>
>>> So PIC code isn't too bad, but you have to use the correct
>>> sequence.
>> Is that true only for 32-bit systems only? I thought RIP-relative addressing was
>> introduced in 64-bit mode. Please confirm.
> I said i386 not amd64 or x86-64.

I am sorry. My bad.

>
> So yes, 64bit code has PC-relative addressing.
> But I'm pretty sure it has no other way to get the PC itself
> except using call - certainly nothing in the 'usual' instructions.

OK.

Madhavan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-07-31 18:31           ` Mark Rutland
  2020-08-03  8:27             ` David Laight
@ 2020-08-03 17:58             ` Madhavan T. Venkataraman
  1 sibling, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 17:58 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Andy Lutomirski, Kernel Hardening, Linux API, linux-arm-kernel,
	Linux FS Devel, linux-integrity, LKML, LSM List, Oleg Nesterov,
	X86 ML



On 7/31/20 1:31 PM, Mark Rutland wrote:
> On Fri, Jul 31, 2020 at 12:13:49PM -0500, Madhavan T. Venkataraman wrote:
>> On 7/30/20 3:54 PM, Andy Lutomirski wrote:
>>> On Thu, Jul 30, 2020 at 7:24 AM Madhavan T. Venkataraman
>>> <madvenka@linux.microsoft.com> wrote:
>> Dealing with multiple architectures
>> -----------------------------------------------
>>
>> One good reason to use trampfd is multiple architecture support. The
>> trampoline table in a code page approach is neat. I don't deny that at
>> all. But my question is - can it be used in all cases?
>>
>> It requires PC-relative data references. I have not worked on all architectures.
>> So, I need to study this. But do all ISAs support PC-relative data references?
> Not all do, but pretty much any recent ISA will as it's a practical
> necessity for fast position-independent code.

So, two questions:

1. IIUC, for position independent code, we need PC-relative control transfers. I know that
    PC-relative control transfers are kinda fundamental. So, I expect most architectures
    support it. But to implement the trampoline table suggestion, we need PC-relative
    data references. Like:

    movq    X(%rip), %rax

2. Do you know which architectures do not support PC-relative data references? I am
    going to study this. But if you have some information, I would appreciate it.

In any case, I think we should support all of the architectures on which Linux currently
runs even if they are legacy.

>
>> Even in an ISA that supports it, there would be a maximum supported offset
>> from the current PC that can be reached for a data reference. That maximum
>> needs to be at least the size of a base page in the architecture. This is because
>> the code page and the data page need to be separate for security reasons.
>> Do all ISAs support a sufficiently large offset?
> ISAs with pc-relative addessing can usually generate PC-relative
> addresses into a GPR, from which they can apply an arbitrarily large
> offset.

I will study this. I need to nail down the list of architectures that cannot do this.

>
>> When the kernel generates the code for a trampoline, it can hard code data values
>> in the generated code itself so it does not need PC-relative data referencing.
>>
>> And, for ISAs that do support the large offset, we do have to implement and
>> maintain the code page stuff for different ISAs for each application and library
>> if we did not use trampfd.
> Trampoline code is architecture specific today, so I don't see that as a
> major issue. Common structural bits can probably be shared even if the
> specifid machine code cannot.

True. But an implementor may prefer a standard mechanism provided by
the kernel so all of his architectures can be supported easily with less
effort.

If you look at the libffi reference patch I have included, the architecture
specific changes to use trampfd just involve a single C function call to
a common code function.

So, from the point of view of adoption, IMHO, the kernel provided method
is preferable.

>
> [...]
>
>> Security
>> -----------
>>
>> With the user level trampoline table approach, the data part of the trampoline table
>> can be hacked by an attacker if an application has a vulnerability. Specifically, the
>> target PC can be altered to some arbitrary location. Trampfd implements an
>> "Allowed PCS" context. In the libffi changes, I have created a read-only array of
>> all ABI handlers used in closures for each architecture. This read-only array
>> can be used to restrict the PC values for libffi trampolines to prevent hacking.
>>
>> To generalize, we can implement security rules/features if the trampoline
>> object is in the kernel.
> I don't follow this argument. If it's possible to statically define that
> in the kernel, it's also possible to do that in userspace without any
> new kernel support.
It is not statically defined in the kernel.

Let us take the libffi example. In the 64-bit X86 arch code, there are 3
ABI handlers:

    ffi_closure_unix64_sse
    ffi_closure_unix64
    ffi_closure_win64

I could create an "Allowed PCs" context like this:

struct my_allowed_pcs {
    struct trampfd_values    pcs;
    __u64                             pc_values[3];
};

const struct my_allowed_pcs    my_allowed_pcs = {
    { 3, 0 },
    (uintptr_t) ffi_closure_unix64_sse,
    (uintptr_t) ffi_closure_unix64,
    (uintptr_t) ffi_closure_win64,
};

I have created a read-only array of allowed ABI handlers that closures use.

When I set up the context for a closure trampoline, I could do this:

    pwrite(trampfd, &my_allowed_pcs, sizeof(my_allowed_pcs), TRAMPFD_ALLOWED_PCS_OFFSET);
   
This copies the array into the trampoline object in the kernel.
When the register context is set for the trampoline, the kernel checks
the PC register value against allowed PCs.

Because my_allowed_pcs is read-only, a hacker cannot modify it. So, the only
permitted target PCs enforced by the kernel are the ABI handlers.
>
> [...]
>
>> Trampfd is a framework that can be used to implement multiple things. May be,
>> a few of those things can also be implemented in user land itself. But I think having
>> just one mechanism to execute dynamic code objects is preferable to having
>> multiple mechanisms not standardized across all applications.
> In abstract, having a common interface sounds nice, but in practice
> elements of this are always architecture-specific (e.g. interactiosn
> with HW CFI), and that common interface can result in more pain as it
> doesn't fit naturally into the context that ISAs were designed for (e.g. 
> where control-flow instructions are extended with new semantics).

In the case of trampfd, the code generation is indeed architecture
specific. But that is in the kernel. The application is not affected by it.

Again, referring to the libffi reference patch, I have defined wrapper
functions for trampfd in common code. The architecture specific code
in libffi only calls the set_context function defined in common code.
Even this is required only because register names are specific to each
architecture and the target PC (to the ABI handler) is specific to
each architecture-ABI combo.

> It also meass that you can't share the rough approach across OSs which
> do not implement an identical mechanism, so for code abstracting by ISA
> first, then by platform/ABI, there isn't much saving.

Why can you not share the same approach across OSes? In fact,
I have tried to design it so that other OSes can use the same
mechanism.

The only thing is that I have defined the API to be based on a file
descriptor since that is what is generally preferred by the Linux community
for a new API. If I were to implement it as a regular system call, the same
system call can be implemented in other OSes as well.

Thanks.

Madhavan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
  2020-08-02 20:00       ` Andy Lutomirski
  2020-08-02 22:58         ` Madhavan T. Venkataraman
@ 2020-08-03 18:36         ` Madhavan T. Venkataraman
  1 sibling, 0 replies; 49+ messages in thread
From: Madhavan T. Venkataraman @ 2020-08-03 18:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
	linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML



On 8/2/20 3:00 PM, Andy Lutomirski wrote:
> I feel like trampfd is too poorly defined at this point to evaluate.

Point taken. It is because I wanted to start with something small
and specific and expand it in the future. So, I did not really describe the big
picture - the overall vision, future work, that sort of thing. In retrospect,
may be, I should have done that.

I will take all of the input I have received so far and all of the responses
I have given, refine the definition of trampfd and send it out. Please
review that and let me know if anything is still missing from the
definition.

Thanks.

Madhavan


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, back to index

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <aefc85852ea518982e74b233e11e16d2e707bc32>
2020-07-28 13:10 ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor madvenka
2020-07-28 13:10   ` [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API madvenka
2020-07-28 14:50     ` Oleg Nesterov
2020-07-28 14:58       ` Madhavan T. Venkataraman
2020-07-28 16:06         ` Oleg Nesterov
2020-07-28 13:10   ` [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor madvenka
2020-07-30  9:06     ` Greg KH
2020-07-30 14:25       ` Madhavan T. Venkataraman
2020-07-28 13:10   ` [PATCH v1 3/4] [RFC] arm64/trampfd: " madvenka
2020-07-28 13:10   ` [PATCH v1 4/4] [RFC] arm/trampfd: " madvenka
2020-07-28 15:13   ` [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor David Laight
2020-07-28 16:32     ` Madhavan T. Venkataraman
2020-07-28 17:16       ` Andy Lutomirski
2020-07-28 17:39         ` Madhavan T. Venkataraman
2020-07-29  5:16           ` Andy Lutomirski
2020-07-28 18:52         ` Madhavan T. Venkataraman
2020-07-29  8:36           ` David Laight
2020-07-29 17:55             ` Madhavan T. Venkataraman
2020-07-28 16:05   ` Casey Schaufler
2020-07-28 16:49     ` Madhavan T. Venkataraman
2020-07-28 17:05     ` James Morris
2020-07-28 17:08       ` Madhavan T. Venkataraman
2020-07-28 17:31   ` Andy Lutomirski
2020-07-28 19:01     ` Madhavan T. Venkataraman
2020-07-29 13:29     ` Florian Weimer
2020-07-30 13:09     ` David Laight
2020-08-02 11:56       ` Pavel Machek
2020-08-03  8:08         ` David Laight
2020-08-03 15:57           ` Madhavan T. Venkataraman
2020-07-30 14:24     ` Madhavan T. Venkataraman
2020-07-30 20:54       ` Andy Lutomirski
2020-07-31 17:13         ` Madhavan T. Venkataraman
2020-07-31 18:31           ` Mark Rutland
2020-08-03  8:27             ` David Laight
2020-08-03 16:03               ` Madhavan T. Venkataraman
2020-08-03 16:57                 ` David Laight
2020-08-03 17:00                   ` Madhavan T. Venkataraman
2020-08-03 17:58             ` Madhavan T. Venkataraman
2020-08-02 13:57           ` Florian Weimer
2020-07-30 14:42     ` Madhavan T. Venkataraman
2020-08-02 18:54     ` Madhavan T. Venkataraman
2020-08-02 20:00       ` Andy Lutomirski
2020-08-02 22:58         ` Madhavan T. Venkataraman
2020-08-03 18:36         ` Madhavan T. Venkataraman
2020-08-03  8:23       ` David Laight
2020-08-03 15:59         ` Madhavan T. Venkataraman
2020-07-31 18:09   ` Mark Rutland
2020-07-31 20:08     ` Madhavan T. Venkataraman
2020-08-03 16:57     ` Madhavan T. Venkataraman

Kernel-hardening archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kernel-hardening/0 kernel-hardening/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kernel-hardening kernel-hardening/ https://lore.kernel.org/kernel-hardening \
		kernel-hardening@lists.openwall.com
	public-inbox-index kernel-hardening

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/com.openwall.lists.kernel-hardening


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git