linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/8] Introduce mseal() syscall
@ 2023-10-16 14:38 jeffxu
  2023-10-16 14:38 ` [RFC PATCH v1 1/8] Add mseal syscall jeffxu
                   ` (10 more replies)
  0 siblings, 11 replies; 43+ messages in thread
From: jeffxu @ 2023-10-16 14:38 UTC (permalink / raw)
  To: akpm, keescook, sroettger
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	jannh, surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen,
	ben, catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

From: Jeff Xu <jeffxu@google.com>

This patchset proposes a new mseal() syscall for the Linux kernel. 

Modern CPUs support memory permissions such as RW and NX bits. Linux has
supported NX since the release of kernel version 2.6.8 in August 2004 [1].
The memory permission feature improves security stance on memory
corruption bugs, i.e. the attacker can’t just write to arbitrary memory
and point the code to it, the memory has to be marked with X bit, or
else an exception will happen.

Memory sealing additionally protects the mapping itself against
modifications. This is useful to mitigate memory corruption issues where
a corrupted pointer is passed to a memory management syscall. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
Also, Chrome wants to adopt this feature for their CFI work [2] and this
patchset has been designed to be compatible with the Chrome use case.

The new mseal() is an architecture independent syscall, and with
following signature:

mseal(void addr, size_t len, unsigned int types, unsigned int flags)

addr/len: memory range.  Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to comments in mseal.c. Those are also fully
covered by the selftest.

types: bit mask to specify which syscall to seal, currently they are:
MM_SEAL_MSEAL 0x1
MM_SEAL_MPROTECT 0x2
MM_SEAL_MUNMAP 0x4
MM_SEAL_MMAP 0x8
MM_SEAL_MREMAP 0x10

Each bit represents sealing for one specific syscall type, e.g.
MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
is that the API is extendable, i.e. when needed, the sealing can be
extended to madvise, mlock, etc. Backward compatibility is also easy.

The kernel will remember which seal types are applied, and the application
doesn’t need to repeat all existing seal types in the next mseal().  Once
a seal type is applied, it can’t be unsealed. Call mseal() on an existing
seal type is a no-action, not a failure.

MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.

Internally, vm_area_struct adds a new field vm_seals, to store the bit
masks. 

For the affected syscalls, such as mprotect, a check(can_modify_mm) for
sealing is added, this usually happens at the early point of the syscall,
before any update is made to VMAs. The effect of that is: if any of the
VMAs in the given address range fails the sealing check, none of the VMA
will be updated. It might be worth noting that this is different from the
rest of mprotect(), where some updates can happen even when mprotect
returns fail. Consider can_modify_mm only checks vm_seals in
vm_area_struct, and it is not going deeper in the page table or updating
any HW, success or none behavior might fit better here. I would like to
listen to the community's feedback on this.

The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5],  Chrome browser in ChromeOS will be the first user of this API.

In addition, Stephen is working on glibc change to add sealing support
into the dynamic linker to seal all non-writable segments at startup. When
that work is completed, all applications can automatically benefit from
these new protections.

[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc

Jeff Xu (8):
  Add mseal syscall
  Wire up mseal syscall
  mseal: add can_modify_mm and can_modify_vma
  mseal: seal mprotect
  mseal munmap
  mseal mremap
  mseal mmap
  selftest mm/mseal mprotect/munmap/mremap/mmap

 arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
 arch/arm/tools/syscall.tbl                  |    1 +
 arch/arm64/include/asm/unistd.h             |    2 +-
 arch/arm64/include/asm/unistd32.h           |    2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
 arch/s390/kernel/syscalls/syscall.tbl       |    1 +
 arch/sh/kernel/syscalls/syscall.tbl         |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
 fs/aio.c                                    |    5 +-
 include/linux/mm.h                          |   55 +-
 include/linux/mm_types.h                    |    7 +
 include/linux/syscalls.h                    |    2 +
 include/uapi/asm-generic/unistd.h           |    5 +-
 include/uapi/linux/mman.h                   |    6 +
 ipc/shm.c                                   |    3 +-
 kernel/sys_ni.c                             |    1 +
 mm/Kconfig                                  |    8 +
 mm/Makefile                                 |    1 +
 mm/internal.h                               |    4 +-
 mm/mmap.c                                   |   49 +-
 mm/mprotect.c                               |    6 +
 mm/mremap.c                                 |   19 +-
 mm/mseal.c                                  |  328 +++++
 mm/nommu.c                                  |    6 +-
 mm/util.c                                   |    8 +-
 tools/testing/selftests/mm/Makefile         |    1 +
 tools/testing/selftests/mm/mseal_test.c     | 1428 +++++++++++++++++++
 37 files changed, 1934 insertions(+), 28 deletions(-)
 create mode 100644 mm/mseal.c
 create mode 100644 tools/testing/selftests/mm/mseal_test.c

-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH v1 1/8] Add mseal syscall
  2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
@ 2023-10-16 14:38 ` jeffxu
  2023-10-16 15:05   ` Greg KH
  2023-10-16 14:38 ` [RFC PATCH v1 2/8] Wire up " jeffxu
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: jeffxu @ 2023-10-16 14:38 UTC (permalink / raw)
  To: akpm, keescook, sroettger
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	jannh, surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen,
	ben, catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

From: Jeff Xu <jeffxu@google.com>

mseal() prevents system calls from modifying the metadata of
virtual addresses.

Five syscalls can be sealed, as specified by bitmasks:
MM_SEAL_MPROTECT: Deny mprotect(2)/pkey_mprotect(2).
MM_SEAL_MUNMAP: Deny munmap(2).
MM_SEAL_MMAP: Deny mmap(2).
MM_SEAL_MREMAP: Deny mremap(2).
MM_SEAL_MSEAL: Deny adding a new seal type.

Signed-off-by: Jeff Xu <jeffxu@google.com>
---
 include/linux/mm.h        |  14 ++
 include/linux/mm_types.h  |   7 +
 include/linux/syscalls.h  |   2 +
 include/uapi/linux/mman.h |   6 +
 kernel/sys_ni.c           |   1 +
 mm/Kconfig                |   8 ++
 mm/Makefile               |   1 +
 mm/mmap.c                 |  14 ++
 mm/mseal.c                | 268 ++++++++++++++++++++++++++++++++++++++
 9 files changed, 321 insertions(+)
 create mode 100644 mm/mseal.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 53efddc4d178..e790b91a0cd4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -257,6 +257,20 @@ extern struct rw_semaphore nommu_region_sem;
 extern unsigned int kobjsize(const void *objp);
 #endif
 
+/*
+ * vm_seals in vm_area_struct, see mm_types.h.
+ */
+#define VM_SEAL_NONE		0x00000000
+#define VM_SEAL_MSEAL		0x00000001
+#define VM_SEAL_MPROTECT	0x00000002
+#define VM_SEAL_MUNMAP		0x00000004
+#define VM_SEAL_MREMAP		0x00000008
+#define VM_SEAL_MMAP		0x00000010
+
+#define VM_SEAL_ALL                                                            \
+	(VM_SEAL_MSEAL | VM_SEAL_MPROTECT | VM_SEAL_MUNMAP | VM_SEAL_MMAP |    \
+	 VM_SEAL_MREMAP)
+
 /*
  * vm_flags in vm_area_struct, see mm_types.h.
  * When changing, update also include/trace/events/mmflags.h
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 36c5b43999e6..17d80f5a73dc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -660,6 +660,13 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_MSEAL
+	/*
+	 * bit masks for seal.
+	 * need this since vm_flags is full.
+	 */
+	unsigned long vm_seals;		/* seal flags, see mm.h. */
+#endif
 } __randomize_layout;
 
 #ifdef CONFIG_SCHED_MM_CID
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index c0cb22cd607d..f574c7dbee76 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -802,6 +802,8 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
+asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned int types,
+			  unsigned int flags);
 asmlinkage long sys_mbind(unsigned long start, unsigned long len,
 				unsigned long mode,
 				const unsigned long __user *nmask,
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index a246e11988d5..d7882b5984ce 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -55,4 +55,10 @@ struct cachestat {
 	__u64 nr_recently_evicted;
 };
 
+#define MM_SEAL_MSEAL		0x1
+#define MM_SEAL_MPROTECT	0x2
+#define MM_SEAL_MUNMAP		0x4
+#define MM_SEAL_MMAP		0x8
+#define MM_SEAL_MREMAP		0x10
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 781de7cc6a4e..06fabf379e33 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -192,6 +192,7 @@ COND_SYSCALL(migrate_pages);
 COND_SYSCALL(move_pages);
 COND_SYSCALL(set_mempolicy_home_node);
 COND_SYSCALL(cachestat);
+COND_SYSCALL(mseal);
 
 COND_SYSCALL(perf_event_open);
 COND_SYSCALL(accept4);
diff --git a/mm/Kconfig b/mm/Kconfig
index 264a2df5ecf5..db8a567cb4d3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1258,6 +1258,14 @@ config LOCK_MM_AND_FIND_VMA
 	bool
 	depends on !STACK_GROWSUP
 
+config MSEAL
+	default n
+	bool "Enable mseal() system call"
+	depends on MMU
+	help
+	  Enable the mseal() system call. Make memory areas's metadata immutable
+	  by selected system calls, i.e. mprotect(), munmap(), mremap(), mmap().
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index ec65984e2ade..643d8518dac0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,6 +120,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_SECRETMEM) += secretmem.o
+obj-$(CONFIG_MSEAL) += mseal.o
 obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
diff --git a/mm/mmap.c b/mm/mmap.c
index 514ced13c65c..9b6c477e713e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -730,6 +730,20 @@ static inline bool is_mergeable_vma(struct vm_area_struct *vma,
 		return false;
 	if (!anon_vma_name_eq(anon_vma_name(vma), anon_name))
 		return false;
+#ifdef CONFIG_MSEAL
+	/*
+	 * If a VMA is sealed, it won't be merged with another VMA.
+	 * This might be useful for diagnosis, i.e. the boundary used
+	 * in the mseal() call will be preserved.
+	 * There are chances of too many mseal() calls can create
+	 * many segmentations. Considering mseal() usually comes
+	 * with a careful memory layout design by the application,
+	 * this might not be an issue in real world.
+	 * Though, we could add merging support later if needed.
+	 */
+	if (vma->vm_seals & VM_SEAL_ALL)
+		return 0;
+#endif
 	return true;
 }
 
diff --git a/mm/mseal.c b/mm/mseal.c
new file mode 100644
index 000000000000..615b6e06ab44
--- /dev/null
+++ b/mm/mseal.c
@@ -0,0 +1,268 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  Implement mseal() syscall.
+ *
+ *  Copyright (c) 2023 Google, Inc.
+ *
+ *  Author: Jeff Xu <jeffxu@google.com>
+ */
+
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/syscalls.h>
+#include <linux/sched.h>
+#include "internal.h"
+
+/*
+ * MM_SEAL_ALL is all supported flags in mseal().
+ */
+#define MM_SEAL_ALL ( \
+	MM_SEAL_MSEAL | \
+	MM_SEAL_MPROTECT | \
+	MM_SEAL_MUNMAP | \
+	MM_SEAL_MMAP | \
+	MM_SEAL_MREMAP)
+
+static bool can_do_mseal(unsigned int types, unsigned int flags)
+{
+	/* check types is a valid bitmap */
+	if (types & ~MM_SEAL_ALL)
+		return false;
+
+	/* flags isn't used for now */
+	if (flags)
+		return false;
+
+	return true;
+}
+
+/*
+ * Check if a seal type can be added to VMA.
+ */
+static bool can_add_vma_seals(struct vm_area_struct *vma, unsigned int newSeals)
+{
+	/* When SEAL_MSEAL is set, reject if a new type of seal is added */
+	if ((vma->vm_seals & VM_SEAL_MSEAL) &&
+	    (newSeals & ~(vma->vm_seals & VM_SEAL_ALL)))
+		return false;
+
+	return true;
+}
+
+static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		struct vm_area_struct **prev, unsigned long start,
+		unsigned long end, unsigned int addtypes)
+{
+	int ret = 0;
+
+	if (addtypes & ~(vma->vm_seals & VM_SEAL_ALL)) {
+		/*
+		 * Handle split at start and end.
+		 * Note: sealed VMA doesn't merge with other VMAs.
+		 */
+		if (start != vma->vm_start) {
+			ret = split_vma(vmi, vma, start, 1);
+			if (ret)
+				goto out;
+		}
+
+		if (end != vma->vm_end) {
+			ret = split_vma(vmi, vma, end, 0);
+			if (ret)
+				goto out;
+		}
+
+		vma->vm_seals |= addtypes;
+	}
+
+out:
+	*prev = vma;
+	return ret;
+}
+
+/*
+ * convert user input to internal type for seal type.
+ */
+static unsigned int convert_user_seal_type(unsigned int types)
+{
+	unsigned int newtypes = VM_SEAL_NONE;
+
+	if (types & MM_SEAL_MSEAL)
+		newtypes |= VM_SEAL_MSEAL;
+
+	if (types & MM_SEAL_MPROTECT)
+		newtypes |= VM_SEAL_MPROTECT;
+
+	if (types & MM_SEAL_MUNMAP)
+		newtypes |= VM_SEAL_MUNMAP;
+
+	if (types & MM_SEAL_MMAP)
+		newtypes |= VM_SEAL_MMAP;
+
+	if (types & MM_SEAL_MREMAP)
+		newtypes |= VM_SEAL_MREMAP;
+
+	return newtypes;
+}
+
+/*
+ * Check for do_mseal:
+ * 1> start is part of a valid vma.
+ * 2> end is part of a valid vma.
+ * 3> No gap (unallocated address) between start and end.
+ * 4> requested seal type can be added in given address range.
+ */
+static int check_mm_seal(unsigned long start, unsigned long end,
+			 unsigned int newtypes)
+{
+	struct vm_area_struct *vma;
+	unsigned long nstart = start;
+
+	VMA_ITERATOR(vmi, current->mm, start);
+
+	/* going through each vma to check */
+	for_each_vma_range(vmi, vma, end) {
+		if (vma->vm_start > nstart)
+			/* unallocated memory found */
+			return -ENOMEM;
+
+		if (!can_add_vma_seals(vma, newtypes))
+			return -EACCES;
+
+		if (vma->vm_end >= end)
+			return 0;
+
+		nstart = vma->vm_end;
+	}
+
+	return -ENOMEM;
+}
+
+/*
+ * Apply sealing.
+ */
+static int apply_mm_seal(unsigned long start, unsigned long end,
+			 unsigned int newtypes)
+{
+	unsigned long nstart, nend;
+	struct vm_area_struct *vma, *prev = NULL;
+	struct vma_iterator vmi;
+	int error = 0;
+
+	vma_iter_init(&vmi, current->mm, start);
+	vma = vma_find(&vmi, end);
+
+	prev = vma_prev(&vmi);
+	if (start > vma->vm_start)
+		prev = vma;
+
+	nstart = start;
+
+	/* going through each vma to update */
+	for_each_vma_range(vmi, vma, end) {
+		nend = vma->vm_end;
+		if (nend > end)
+			nend = end;
+
+		error = mseal_fixup(&vmi, vma, &prev, nstart, nend, newtypes);
+		if (error)
+			break;
+
+		nstart = vma->vm_end;
+	}
+
+	return error;
+}
+
+/*
+ * mseal(2) seals the VM's meta data from
+ * selected syscalls.
+ *
+ * addr/len: VM address range.
+ *
+ *  The address range by addr/len must meet:
+ *   start (addr) must be in a valid VMA.
+ *   end (addr + len) must be in a valid VMA.
+ *   no gap (unallocated memory) between start and end.
+ *   start (addr) must be page aligned.
+ *
+ *  len: len will be page aligned implicitly.
+ *
+ *  types: bit mask for sealed syscalls.
+ *   MM_SEAL_MPROTECT: seal mprotect(2)/pkey_mprotect(2).
+ *   MM_SEAL_MUNMAP: seal munmap(2).
+ *   MM_SEAL_MMAP: seal mmap(2).
+ *   MM_SEAL_MREMAP: seal mremap(2).
+ *   MM_SEAL_MSEAL: adding new seal type will be rejected.
+ *
+ *  flags: reserved.
+ *
+ * return values:
+ *  zero: success
+ *  -EINVAL:
+ *   invalid seal type.
+ *   invalid input flags.
+ *   addr is not page aligned.
+ *   addr + len overflow.
+ *  -ENOMEM:
+ *   addr is not a valid address (not allocated).
+ *   end (addr + len) is not a valid address.
+ *   a gap (unallocated memory) between start and end.
+ *  -EACCES:
+ *   MM_SEAL_MSEAL is set, adding a new seal is rejected.
+ *
+ * Note:
+ *  user can call mseal(2) multiple times to add new seal types.
+ *  adding an already added seal type is a no-action (no error).
+ *  adding a new seal type after MM_SEAL_MSEAL will be rejected.
+ *  unseal() or removing a seal type is not supported.
+ */
+static int do_mseal(unsigned long start, size_t len_in, unsigned int types,
+		    unsigned int flags)
+{
+	int ret = 0;
+	unsigned long end;
+	struct mm_struct *mm = current->mm;
+	unsigned int newtypes;
+	size_t len;
+
+	if (!can_do_mseal(types, flags))
+		return -EINVAL;
+
+	newtypes = convert_user_seal_type(types);
+
+	start = untagged_addr(start);
+	if (!PAGE_ALIGNED(start))
+		return -EINVAL;
+
+	len = PAGE_ALIGN(len_in);
+	/* Check to see whether len was rounded up from small -ve to zero */
+	if (len_in && !len)
+		return -EINVAL;
+
+	end = start + len;
+	if (end < start)
+		return -EINVAL;
+
+	if (end == start)
+		return 0;
+
+	if (mmap_write_lock_killable(mm))
+		return -EINTR;
+
+	ret = check_mm_seal(start, end, newtypes);
+	if (ret)
+		goto out;
+
+	ret = apply_mm_seal(start, end, newtypes);
+
+out:
+	mmap_write_unlock(current->mm);
+	return ret;
+}
+
+SYSCALL_DEFINE4(mseal, unsigned long, start, size_t, len, unsigned int, types, unsigned int,
+		flags)
+{
+	return do_mseal(start, len, types, flags);
+}
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v1 2/8] Wire up mseal syscall
  2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
  2023-10-16 14:38 ` [RFC PATCH v1 1/8] Add mseal syscall jeffxu
@ 2023-10-16 14:38 ` jeffxu
  2023-10-16 14:38 ` [RFC PATCH v1 3/8] mseal: add can_modify_mm and can_modify_vma jeffxu
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: jeffxu @ 2023-10-16 14:38 UTC (permalink / raw)
  To: akpm, keescook, sroettger
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	jannh, surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen,
	ben, catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

From: Jeff Xu <jeffxu@google.com>

Wire up mseal syscall for all architectures.

Signed-off-by: Jeff Xu <jeffxu@google.com>
---
 arch/alpha/kernel/syscalls/syscall.tbl      | 1 +
 arch/arm/tools/syscall.tbl                  | 1 +
 arch/arm64/include/asm/unistd.h             | 2 +-
 arch/arm64/include/asm/unistd32.h           | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl       | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl       | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl     | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    | 1 +
 arch/s390/kernel/syscalls/syscall.tbl       | 1 +
 arch/sh/kernel/syscalls/syscall.tbl         | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl      | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     | 1 +
 include/uapi/asm-generic/unistd.h           | 5 ++++-
 19 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index ad37569d0507..b5847d53102a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -492,3 +492,4 @@
 560	common	set_mempolicy_home_node		sys_ni_syscall
 561	common	cachestat			sys_cachestat
 562	common	fchmodat2			sys_fchmodat2
+563	common  mseal				sys_mseal
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index c572d6c3dee0..b50c5ca5047d 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -466,3 +466,4 @@
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	common	cachestat			sys_cachestat
 452	common	fchmodat2			sys_fchmodat2
+453	common	mseal				sys_mseal
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index bd77253b62e0..6a28fb91b85d 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -39,7 +39,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		453
+#define __NR_compat_syscalls		454
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 78b68311ec81..1e9b3c098a8e 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -911,6 +911,8 @@ __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 __SYSCALL(__NR_cachestat, sys_cachestat)
 #define __NR_fchmodat2 452
 __SYSCALL(__NR_fchmodat2, sys_fchmodat2)
+#define __NR_mseal 453
+__SYSCALL(__NR_mseal, sys_mseal)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 83d8609aec03..babe34d221ee 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -373,3 +373,4 @@
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	common	cachestat			sys_cachestat
 452	common	fchmodat2			sys_fchmodat2
+453	common	mseal 				sys_mseal
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 259ceb125367..27cd3f7dbd5e 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -452,3 +452,4 @@
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	common	cachestat			sys_cachestat
 452	common	fchmodat2			sys_fchmodat2
+453	common	mseal				sys_mseal
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index a3798c2637fd..e49861f7c61f 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -458,3 +458,4 @@
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	common	cachestat			sys_cachestat
 452	common	fchmodat2			sys_fchmodat2
+453	common	mseal				sys_mseal
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 152034b8e0a0..78d15010cd77 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -391,3 +391,4 @@
 450	n32	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	n32	cachestat			sys_cachestat
 452	n32	fchmodat2			sys_fchmodat2
+453	n32	mseal				sys_mseal
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index cb5e757f6621..813614fedb72 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -367,3 +367,4 @@
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	n64	cachestat			sys_cachestat
 452	n64	fchmodat2			sys_fchmodat2
+453	n64	mseal				sys_mseal
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 1a646813afdc..01d88d3a6f3e 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -440,3 +440,4 @@
 450	o32	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	o32	cachestat			sys_cachestat
 452	o32	fchmodat2			sys_fchmodat2
+453	o32	mseal				sys_mseal
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index e97c175b56f9..d52d08f0a1ea 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -451,3 +451,4 @@
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	common	cachestat			sys_cachestat
 452	common	fchmodat2			sys_fchmodat2
+453	common	mseal				sys_mseal
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 20e50586e8a2..d38deba73a7b 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -539,3 +539,4 @@
 450 	nospu	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	common	cachestat			sys_cachestat
 452	common	fchmodat2			sys_fchmodat2
+453	common	mseal				sys_mseal
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 0122cc156952..cf3243c2978b 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -455,3 +455,4 @@
 450  common	set_mempolicy_home_node	sys_set_mempolicy_home_node	sys_set_mempolicy_home_node
 451  common	cachestat		sys_cachestat			sys_cachestat
 452  common	fchmodat2		sys_fchmodat2			sys_fchmodat2
+453  common	mseal			sys_mseal			sys_mseal
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index e90d585c4d3e..76f1cd33adaa 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -455,3 +455,4 @@
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	common	cachestat			sys_cachestat
 452	common	fchmodat2			sys_fchmodat2
+453	common	mseal				sys_mseal
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 4ed06c71c43f..d7728695d780 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -498,3 +498,4 @@
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	common	cachestat			sys_cachestat
 452	common	fchmodat2			sys_fchmodat2
+453	common	mseal 				sys_mseal
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 2d0b1bd866ea..6d4cc386df22 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -457,3 +457,4 @@
 450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	i386	cachestat		sys_cachestat
 452	i386	fchmodat2		sys_fchmodat2
+453	i386	mseal 			sys_mseal
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 814768249eae..73dcfc43d921 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -374,6 +374,7 @@
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
 451	common	cachestat		sys_cachestat
 452	common	fchmodat2		sys_fchmodat2
+453 	common  mseal			sys_mseal
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index fc1a4f3c81d9..e8fd3bf35d73 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -423,3 +423,4 @@
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
 451	common	cachestat			sys_cachestat
 452	common	fchmodat2			sys_fchmodat2
+453	common	mseal 				sys_mseal
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index abe087c53b4b..0c945a798208 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -823,8 +823,11 @@ __SYSCALL(__NR_cachestat, sys_cachestat)
 #define __NR_fchmodat2 452
 __SYSCALL(__NR_fchmodat2, sys_fchmodat2)
 
+#define __NR_mseal 453
+__SYSCALL(__NR_mseal, sys_mseal)
+
 #undef __NR_syscalls
-#define __NR_syscalls 453
+#define __NR_syscalls 454
 
 /*
  * 32 bit systems traditionally used different
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v1 3/8] mseal: add can_modify_mm and can_modify_vma
  2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
  2023-10-16 14:38 ` [RFC PATCH v1 1/8] Add mseal syscall jeffxu
  2023-10-16 14:38 ` [RFC PATCH v1 2/8] Wire up " jeffxu
@ 2023-10-16 14:38 ` jeffxu
  2023-10-16 14:38 ` [RFC PATCH v1 4/8] mseal: seal mprotect jeffxu
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: jeffxu @ 2023-10-16 14:38 UTC (permalink / raw)
  To: akpm, keescook, sroettger
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	jannh, surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen,
	ben, catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

From: Jeff Xu <jeffxu@google.com>

can_modify_mm:
 checks sealing flags for given memory range.

can_modify_vma:
  checks sealing flags for given vma.

Signed-off-by: Jeff Xu <jeffxu@google.com>
---
 include/linux/mm.h | 34 ++++++++++++++++++++++++++
 mm/mseal.c         | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e790b91a0cd4..aafdb68950f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -257,6 +257,18 @@ extern struct rw_semaphore nommu_region_sem;
 extern unsigned int kobjsize(const void *objp);
 #endif
 
+enum caller_origin {
+	ON_BEHALF_OF_KERNEL = 0,
+	ON_BEHALF_OF_USERSPACE,
+};
+
+enum mm_action {
+	MM_ACTION_MPROTECT,
+	MM_ACTION_MUNMAP,
+	MM_ACTION_MREMAP,
+	MM_ACTION_MMAP,
+};
+
 /*
  * vm_seals in vm_area_struct, see mm_types.h.
  */
@@ -3302,6 +3314,28 @@ static inline void mm_populate(unsigned long addr, unsigned long len)
 static inline void mm_populate(unsigned long addr, unsigned long len) {}
 #endif
 
+#ifdef CONFIG_MSEAL
+extern bool can_modify_mm(struct mm_struct *mm, unsigned long start,
+			  unsigned long end, enum mm_action action,
+			  enum caller_origin called);
+
+extern bool can_modify_vma(struct vm_area_struct *vma, enum mm_action action,
+		    enum caller_origin called);
+#else
+static inline bool can_modify_mm(struct mm_struct *mm, unsigned long start,
+			  unsigned long end, enum mm_action action,
+			  enum caller_origin called)
+{
+	return true;
+}
+
+static inline bool can_modify_vma(struct vm_area_struct *vma, enum mm_action action,
+		    enum caller_origin called)
+{
+	return true;
+}
+#endif
+
 /* These take the mm semaphore themselves */
 extern int __must_check vm_brk(unsigned long, unsigned long);
 extern int __must_check vm_brk_flags(unsigned long, unsigned long, unsigned long);
diff --git a/mm/mseal.c b/mm/mseal.c
index 615b6e06ab44..3285ef6b95a6 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -36,6 +36,66 @@ static bool can_do_mseal(unsigned int types, unsigned int flags)
 	return true;
 }
 
+/*
+ * check if a vma is sealed for modification.
+ * return true, if modification is allowed.
+ */
+bool can_modify_vma(struct vm_area_struct *vma, enum mm_action action,
+		    enum caller_origin called)
+{
+	if (called == ON_BEHALF_OF_KERNEL)
+		return true;
+
+	switch (action) {
+	case MM_ACTION_MPROTECT:
+		if (vma->vm_seals & VM_SEAL_MPROTECT)
+			return false;
+		break;
+
+	case MM_ACTION_MUNMAP:
+		if (vma->vm_seals & VM_SEAL_MUNMAP)
+			return false;
+		break;
+
+	case MM_ACTION_MREMAP:
+		if (vma->vm_seals & VM_SEAL_MREMAP)
+			return false;
+		break;
+
+	case MM_ACTION_MMAP:
+		if (vma->vm_seals & VM_SEAL_MMAP)
+			return false;
+		break;
+	}
+
+	return true;
+}
+
+/*
+ * Check if the vmas of a memory range are allowed to be modified.
+ * the memory ranger can have a gap (unallocated memory).
+ * return true, if it is allowed.
+ */
+bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end,
+		   enum mm_action action, enum caller_origin called)
+{
+	struct vm_area_struct *vma;
+
+	VMA_ITERATOR(vmi, mm, start);
+
+	if (called == ON_BEHALF_OF_KERNEL)
+		return true;
+
+	/* going through each vma to check */
+	for_each_vma_range(vmi, vma, end) {
+		if (!can_modify_vma(vma, action, called))
+			return false;
+	}
+
+	/* Allow by default. */
+	return true;
+}
+
 /*
  * Check if a seal type can be added to VMA.
  */
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v1 4/8] mseal: seal mprotect
  2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
                   ` (2 preceding siblings ...)
  2023-10-16 14:38 ` [RFC PATCH v1 3/8] mseal: add can_modify_mm and can_modify_vma jeffxu
@ 2023-10-16 14:38 ` jeffxu
  2023-10-16 14:38 ` [RFC PATCH v1 5/8] mseal munmap jeffxu
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: jeffxu @ 2023-10-16 14:38 UTC (permalink / raw)
  To: akpm, keescook, sroettger
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	jannh, surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen,
	ben, catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

From: Jeff Xu <jeffxu@google.com>

check sealing for mprotect(2).

Signed-off-by: Jeff Xu <jeffxu@google.com>
---
 mm/mprotect.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 130db91d3a8c..5b67c66d55f7 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -753,6 +753,12 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		}
 	}
 
+	if (!can_modify_mm(current->mm, start, end, MM_ACTION_MPROTECT,
+			   ON_BEHALF_OF_USERSPACE)) {
+		error = -EACCES;
+		goto out;
+	}
+
 	prev = vma_prev(&vmi);
 	if (start > vma->vm_start)
 		prev = vma;
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v1 5/8] mseal munmap
  2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
                   ` (3 preceding siblings ...)
  2023-10-16 14:38 ` [RFC PATCH v1 4/8] mseal: seal mprotect jeffxu
@ 2023-10-16 14:38 ` jeffxu
  2023-10-16 14:38 ` [RFC PATCH v1 6/8] mseal mremap jeffxu
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: jeffxu @ 2023-10-16 14:38 UTC (permalink / raw)
  To: akpm, keescook, sroettger
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	jannh, surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen,
	ben, catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

From: Jeff Xu <jeffxu@google.com>

check seal for munmap(2).

Signed-off-by: Jeff Xu <jeffxu@google.com>
---
 include/linux/mm.h |  2 +-
 mm/mmap.c          | 22 ++++++++++++++--------
 mm/mremap.c        |  5 +++--
 3 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index aafdb68950f8..95b793eb3a80 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3294,7 +3294,7 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
-			 bool unlock);
+			 bool unlock, enum caller_origin called);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
diff --git a/mm/mmap.c b/mm/mmap.c
index 9b6c477e713e..f4bfcc5d2c10 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2601,6 +2601,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
  * @len: The length of the range to munmap
  * @uf: The userfaultfd list_head
  * @unlock: set to true if the user wants to drop the mmap_lock on success
+ * @called: caller origin
  *
  * This function takes a @mas that is either pointing to the previous VMA or set
  * to MA_START and sets it up to remove the mapping(s).  The @len will be
@@ -2611,7 +2612,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
  */
 int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 		  unsigned long start, size_t len, struct list_head *uf,
-		  bool unlock)
+		  bool unlock, enum caller_origin called)
 {
 	unsigned long end;
 	struct vm_area_struct *vma;
@@ -2623,6 +2624,9 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 	if (end == start)
 		return -EINVAL;
 
+	if (!can_modify_mm(mm, start, end, MM_ACTION_MUNMAP, called))
+		return -EACCES;
+
 	 /* arch_unmap() might do unmaps itself.  */
 	arch_unmap(mm, start, end);
 
@@ -2650,7 +2654,8 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 {
 	VMA_ITERATOR(vmi, mm, start);
 
-	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
+	return do_vmi_munmap(&vmi, mm, start, len, uf, false,
+				ON_BEHALF_OF_KERNEL);
 }
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
@@ -2684,7 +2689,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	}
 
 	/* Unmap any existing mapping in the area */
-	if (do_vmi_munmap(&vmi, mm, addr, len, uf, false))
+	if (do_vmi_munmap(&vmi, mm, addr, len, uf, false, ON_BEHALF_OF_KERNEL))
 		return -ENOMEM;
 
 	/*
@@ -2909,7 +2914,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	return error;
 }
 
-static int __vm_munmap(unsigned long start, size_t len, bool unlock)
+static int __vm_munmap(unsigned long start, size_t len, bool unlock,
+			enum caller_origin called)
 {
 	int ret;
 	struct mm_struct *mm = current->mm;
@@ -2919,7 +2925,7 @@ static int __vm_munmap(unsigned long start, size_t len, bool unlock)
 	if (mmap_write_lock_killable(mm))
 		return -EINTR;
 
-	ret = do_vmi_munmap(&vmi, mm, start, len, &uf, unlock);
+	ret = do_vmi_munmap(&vmi, mm, start, len, &uf, unlock, called);
 	if (ret || !unlock)
 		mmap_write_unlock(mm);
 
@@ -2929,14 +2935,14 @@ static int __vm_munmap(unsigned long start, size_t len, bool unlock)
 
 int vm_munmap(unsigned long start, size_t len)
 {
-	return __vm_munmap(start, len, false);
+	return __vm_munmap(start, len, false, ON_BEHALF_OF_KERNEL);
 }
 EXPORT_SYMBOL(vm_munmap);
 
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
 	addr = untagged_addr(addr);
-	return __vm_munmap(addr, len, true);
+	return __vm_munmap(addr, len, true, ON_BEHALF_OF_USERSPACE);
 }
 
 
@@ -3168,7 +3174,7 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
 	if (ret)
 		goto limits_failed;
 
-	ret = do_vmi_munmap(&vmi, mm, addr, len, &uf, 0);
+	ret = do_vmi_munmap(&vmi, mm, addr, len, &uf, 0, ON_BEHALF_OF_KERNEL);
 	if (ret)
 		goto munmap_failed;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 056478c106ee..e43f9ceaa29d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -715,7 +715,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	}
 
 	vma_iter_init(&vmi, mm, old_addr);
-	if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false)) {
+	if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false,
+				ON_BEHALF_OF_KERNEL)) {
 		/* OOM: unable to split vma, just get accounts right */
 		if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
 			vm_acct_memory(old_len >> PAGE_SHIFT);
@@ -1009,7 +1010,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 		}
 
 		ret = do_vmi_munmap(&vmi, mm, addr + new_len, old_len - new_len,
-				    &uf_unmap, true);
+				    &uf_unmap, true, ON_BEHALF_OF_KERNEL);
 		if (ret)
 			goto out;
 
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v1 6/8] mseal mremap
  2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
                   ` (4 preceding siblings ...)
  2023-10-16 14:38 ` [RFC PATCH v1 5/8] mseal munmap jeffxu
@ 2023-10-16 14:38 ` jeffxu
  2023-10-16 14:38 ` [RFC PATCH v1 7/8] mseal mmap jeffxu
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: jeffxu @ 2023-10-16 14:38 UTC (permalink / raw)
  To: akpm, keescook, sroettger
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	jannh, surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen,
	ben, catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

From: Jeff Xu <jeffxu@google.com>

check seal for mremap(2)

Signed-off-by: Jeff Xu <jeffxu@google.com>
---
 mm/mremap.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index e43f9ceaa29d..2288f9d0b126 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -836,7 +836,15 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if ((mm->map_count + 2) >= sysctl_max_map_count - 3)
 		return -ENOMEM;
 
+	if (!can_modify_mm(mm, addr, addr + old_len, MM_ACTION_MREMAP,
+			   ON_BEHALF_OF_USERSPACE))
+		return -EACCES;
+
 	if (flags & MREMAP_FIXED) {
+		if (!can_modify_mm(mm, new_addr, new_addr + new_len,
+				   MM_ACTION_MREMAP, ON_BEHALF_OF_USERSPACE))
+			return -EACCES;
+
 		ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
 		if (ret)
 			goto out;
@@ -995,6 +1003,12 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 		goto out;
 	}
 
+	if (!can_modify_mm(mm, addr, addr + old_len, MM_ACTION_MREMAP,
+			   ON_BEHALF_OF_USERSPACE)) {
+		ret = -EACCES;
+		goto out;
+	}
+
 	/*
 	 * Always allow a shrinking remap: that just unmaps
 	 * the unnecessary pages..
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v1 7/8] mseal mmap
  2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
                   ` (5 preceding siblings ...)
  2023-10-16 14:38 ` [RFC PATCH v1 6/8] mseal mremap jeffxu
@ 2023-10-16 14:38 ` jeffxu
  2023-10-16 14:38 ` [RFC PATCH v1 8/8] selftest mm/mseal mprotect/munmap/mremap/mmap jeffxu
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: jeffxu @ 2023-10-16 14:38 UTC (permalink / raw)
  To: akpm, keescook, sroettger
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	jannh, surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen,
	ben, catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

From: Jeff Xu <jeffxu@google.com>

check seal for mmap(2)

Signed-off-by: Jeff Xu <jeffxu@google.com>
---
 fs/aio.c           |  5 +++--
 include/linux/mm.h |  5 ++++-
 ipc/shm.c          |  3 ++-
 mm/internal.h      |  4 ++--
 mm/mmap.c          | 13 +++++++++----
 mm/nommu.c         |  6 ++++--
 mm/util.c          |  8 +++++---
 7 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index b3174da80ff6..81040126dd45 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -557,8 +557,9 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 	}
 
 	ctx->mmap_base = do_mmap(ctx->aio_ring_file, 0, ctx->mmap_size,
-				 PROT_READ | PROT_WRITE,
-				 MAP_SHARED, 0, &unused, NULL);
+				PROT_READ | PROT_WRITE, MAP_SHARED, 0, &unused,
+				NULL, ON_BEHALF_OF_KERNEL);
+
 	mmap_write_unlock(mm);
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 95b793eb3a80..ffa2eb9bd475 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3289,9 +3289,12 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
 	struct list_head *uf);
+
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
-	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
+	unsigned long pgoff, unsigned long *populate, struct list_head *uf,
+	enum caller_origin called);
+
 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
 			 bool unlock, enum caller_origin called);
diff --git a/ipc/shm.c b/ipc/shm.c
index 60e45e7045d4..14aebeefe155 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1662,7 +1662,8 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 			goto invalid;
 	}
 
-	addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL);
+	addr = do_mmap(file, addr, size, prot, flags, 0, &populate, NULL,
+			ON_BEHALF_OF_KERNEL);
 	*raddr = addr;
 	err = 0;
 	if (IS_ERR_VALUE(addr))
diff --git a/mm/internal.h b/mm/internal.h
index d1d4bf4e63c0..4361eaf3d1c6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -800,8 +800,8 @@ extern u64 hwpoison_filter_memcg;
 extern u32 hwpoison_filter_enable;
 
 extern unsigned long  __must_check vm_mmap_pgoff(struct file *, unsigned long,
-        unsigned long, unsigned long,
-        unsigned long, unsigned long);
+	unsigned long, unsigned long, unsigned long, unsigned long,
+	enum caller_origin called);
 
 extern void set_pageblock_order(void);
 unsigned long reclaim_pages(struct list_head *folio_list);
diff --git a/mm/mmap.c b/mm/mmap.c
index f4bfcc5d2c10..a42fe27a7d04 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1197,7 +1197,8 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
 unsigned long do_mmap(struct file *file, unsigned long addr,
 			unsigned long len, unsigned long prot,
 			unsigned long flags, unsigned long pgoff,
-			unsigned long *populate, struct list_head *uf)
+			unsigned long *populate, struct list_head *uf,
+			enum caller_origin called)
 {
 	struct mm_struct *mm = current->mm;
 	vm_flags_t vm_flags;
@@ -1365,6 +1366,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
+	if (!can_modify_mm(mm, addr, addr + len, MM_ACTION_MMAP, called))
+		return -EACCES;
+
 	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
@@ -1411,7 +1415,8 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 			return PTR_ERR(file);
 	}
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff,
+				ON_BEHALF_OF_USERSPACE);
 out_fput:
 	if (file)
 		fput(file);
@@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 		flags |= MAP_LOCKED;
 
 	file = get_file(vma->vm_file);
-	ret = do_mmap(vma->vm_file, start, size,
-			prot, flags, pgoff, &populate, NULL);
+	ret = do_mmap(vma->vm_file, start, size, prot, flags, pgoff,
+			&populate, NULL, ON_BEHALF_OF_KERNEL);
 	fput(file);
 out:
 	mmap_write_unlock(mm);
diff --git a/mm/nommu.c b/mm/nommu.c
index 8dba41cfc44d..8c11de5dd8e6 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1018,7 +1018,8 @@ unsigned long do_mmap(struct file *file,
 			unsigned long flags,
 			unsigned long pgoff,
 			unsigned long *populate,
-			struct list_head *uf)
+			struct list_head *uf,
+			enum caller_origin called)
 {
 	struct vm_area_struct *vma;
 	struct vm_region *region;
@@ -1262,7 +1263,8 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 			goto out;
 	}
 
-	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff,
+				ON_BEHALF_OF_USERSPACE);
 
 	if (file)
 		fput(file);
diff --git a/mm/util.c b/mm/util.c
index 4ed8b9b5273c..aaf37c3af517 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -532,7 +532,8 @@ EXPORT_SYMBOL_GPL(account_locked_vm);
 
 unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot,
-	unsigned long flag, unsigned long pgoff)
+	unsigned long flag, unsigned long pgoff,
+	enum caller_origin called)
 {
 	unsigned long ret;
 	struct mm_struct *mm = current->mm;
@@ -544,7 +545,7 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 		if (mmap_write_lock_killable(mm))
 			return -EINTR;
 		ret = do_mmap(file, addr, len, prot, flag, pgoff, &populate,
-			      &uf);
+			      &uf, called);
 		mmap_write_unlock(mm);
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
@@ -562,7 +563,8 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 	if (unlikely(offset_in_page(offset)))
 		return -EINVAL;
 
-	return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
+	return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT,
+				ON_BEHALF_OF_KERNEL);
 }
 EXPORT_SYMBOL(vm_mmap);
 
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH v1 8/8] selftest mm/mseal mprotect/munmap/mremap/mmap
  2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
                   ` (6 preceding siblings ...)
  2023-10-16 14:38 ` [RFC PATCH v1 7/8] mseal mmap jeffxu
@ 2023-10-16 14:38 ` jeffxu
  2023-10-16 15:18 ` [RFC PATCH v1 0/8] Introduce mseal() syscall Matthew Wilcox
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 43+ messages in thread
From: jeffxu @ 2023-10-16 14:38 UTC (permalink / raw)
  To: akpm, keescook, sroettger
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	jannh, surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen,
	ben, catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

From: Jeff Xu <jeffxu@google.com>

selftest for sealing mprotect/munmap/mremap/mmap

Signed-off-by: Jeff Xu <jeffxu@google.com>
---
 tools/testing/selftests/mm/Makefile     |    1 +
 tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++++++
 2 files changed, 1429 insertions(+)
 create mode 100644 tools/testing/selftests/mm/mseal_test.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 6a9fc5693145..0c086cecc093 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -59,6 +59,7 @@ TEST_GEN_FILES += mlock2-tests
 TEST_GEN_FILES += mrelease_test
 TEST_GEN_FILES += mremap_dontunmap
 TEST_GEN_FILES += mremap_test
+TEST_GEN_FILES += mseal_test
 TEST_GEN_FILES += on-fault-limit
 TEST_GEN_FILES += thuge-gen
 TEST_GEN_FILES += transhuge-stress
diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
new file mode 100644
index 000000000000..d6ae09729394
--- /dev/null
+++ b/tools/testing/selftests/mm/mseal_test.c
@@ -0,0 +1,1428 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/mman.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <stdbool.h>
+#include "../kselftest.h"
+#include <syscall.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+
+#ifndef MM_SEAL_MSEAL
+#define MM_SEAL_MSEAL 0x1
+#endif
+
+#ifndef MM_SEAL_MPROTECT
+#define MM_SEAL_MPROTECT 0x2
+#endif
+
+#ifndef MM_SEAL_MUNMAP
+#define MM_SEAL_MUNMAP 0x4
+#endif
+
+#ifndef MM_SEAL_MMAP
+#define MM_SEAL_MMAP 0x8
+#endif
+
+#ifndef MM_SEAL_MREMAP
+#define MM_SEAL_MREMAP 0x10
+#endif
+
+#ifndef DEBUG
+#define LOG_TEST_ENTER()	{}
+#else
+#define LOG_TEST_ENTER()	{ printf("%s\n", __func__); }
+#endif
+
+static int sys_mseal(void *start, size_t len, int types)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(__NR_mseal, start, len, types, 0);
+	return sret;
+}
+
+int sys_mprotect(void *ptr, size_t size, unsigned long prot)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(SYS_mprotect, ptr, size, prot);
+	return sret;
+}
+
+int sys_munmap(void *ptr, size_t size)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(SYS_munmap, ptr, size);
+	return sret;
+}
+
+static int sys_madvise(void *start, size_t len, int types)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(__NR_madvise, start, len, types);
+	return sret;
+}
+
+void *addr1 = (void *)0x50000000;
+void *addr2 = (void *)0x50004000;
+void *addr3 = (void *)0x50008000;
+void setup_single_address(int size, void **ptrOut)
+{
+	void *ptr;
+
+	ptr = mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	assert(ptr != (void *)-1);
+	*ptrOut = ptr;
+}
+
+void setup_single_fixed_address(int size, void **ptrOut)
+{
+	void *ptr;
+
+	ptr = mmap(addr1, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	assert(ptr == (void *)addr1);
+
+	*ptrOut = ptr;
+}
+
+void clean_single_address(void *ptr, int size)
+{
+	int ret;
+
+	ret = munmap(ptr, size);
+	assert(!ret);
+}
+
+void seal_mprotect_single_address(void *ptr, int size)
+{
+	int ret;
+
+	ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT);
+	assert(!ret);
+}
+
+static void test_seal_addseals(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	/* adding seal one by one */
+	ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT);
+	assert(!ret);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_MMAP);
+	assert(!ret);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_MREMAP);
+	assert(!ret);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_MSEAL);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_addseals_combined(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT);
+	assert(!ret);
+
+	/* adding multiple seals */
+	ret = sys_mseal(ptr, size,
+			MM_SEAL_MPROTECT | MM_SEAL_MMAP | MM_SEAL_MREMAP |
+			MM_SEAL_MSEAL);
+	assert(!ret);
+
+	/* not adding more seal type, so ok. */
+	ret = sys_mseal(ptr, size,
+			MM_SEAL_MMAP | MM_SEAL_MREMAP | MM_SEAL_MSEAL);
+	assert(!ret);
+
+	/* not adding more seal type, so ok. */
+	ret = sys_mseal(ptr, size, MM_SEAL_MSEAL);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_addseals_reject(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT | MM_SEAL_MSEAL);
+	assert(!ret);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT);
+	assert(!ret);
+
+	/* MM_SEAL_MSEAL is set, so not allow new seal type . */
+	ret = sys_mseal(ptr, size,
+			MM_SEAL_MPROTECT | MM_SEAL_MMAP | MM_SEAL_MSEAL);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_MMAP);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_MMAP | MM_SEAL_MSEAL);
+	assert(ret < 0);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_unmapped_start(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	// munmap 2 pages from ptr.
+	ret = sys_munmap(ptr, 2 * page_size);
+	assert(!ret);
+
+	// mprotect will fail because 2 pages from ptr are unmapped.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	assert(ret < 0);
+
+	// mseal will fail because 2 pages from ptr are unmapped.
+	ret = sys_mseal(ptr, size, MM_SEAL_MSEAL);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr + 2 * page_size, 2 * page_size, MM_SEAL_MSEAL);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_unmapped_middle(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	// munmap 2 pages from ptr + page.
+	ret = sys_munmap(ptr + page_size, 2 * page_size);
+	assert(!ret);
+
+	// mprotect will fail, since size is 4 pages.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	assert(ret < 0);
+
+	// mseal will fail as well.
+	ret = sys_mseal(ptr, size, MM_SEAL_MSEAL);
+	assert(ret < 0);
+
+	/* we still can add seal to the first page and last page*/
+	ret = sys_mseal(ptr, page_size, MM_SEAL_MSEAL | MM_SEAL_MPROTECT);
+	assert(!ret);
+
+	ret = sys_mseal(ptr + 3 * page_size, page_size,
+			MM_SEAL_MSEAL | MM_SEAL_MPROTECT);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_unmapped_end(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	// unmap last 2 pages.
+	ret = sys_munmap(ptr + 2 * page_size, 2 * page_size);
+	assert(!ret);
+
+	//mprotect will fail since last 2 pages are unmapped.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	assert(ret < 0);
+
+	//mseal will fail as well.
+	ret = sys_mseal(ptr, size, MM_SEAL_MSEAL);
+	assert(ret < 0);
+
+	/* The first 2 pages is not sealed, and can add seals */
+	ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_MSEAL | MM_SEAL_MPROTECT);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_multiple_vmas(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	// use mprotect to split the vma into 3.
+	ret = sys_mprotect(ptr + page_size, 2 * page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// mprotect will get applied to all 4 pages - 3 VMAs.
+	ret = sys_mprotect(ptr, size, PROT_READ);
+	assert(!ret);
+
+	// use mprotect to split the vma into 3.
+	ret = sys_mprotect(ptr + page_size, 2 * page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// mseal get applied to all 4 pages - 3 VMAs.
+	ret = sys_mseal(ptr, size, MM_SEAL_MSEAL);
+	assert(!ret);
+
+	// verify additional seal type will fail after MM_SEAL_MSEAL set.
+	ret = sys_mseal(ptr, page_size, MM_SEAL_MSEAL | MM_SEAL_MPROTECT);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr + page_size, 2 * page_size,
+			MM_SEAL_MSEAL | MM_SEAL_MPROTECT);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr + 3 * page_size, page_size,
+			MM_SEAL_MSEAL | MM_SEAL_MPROTECT);
+	assert(ret < 0);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_split_start(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split at middle */
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	/* seal the first page, this will split the VMA */
+	ret = sys_mseal(ptr, page_size, MM_SEAL_MSEAL);
+	assert(!ret);
+
+	/* can't add seal to the first page */
+	ret = sys_mseal(ptr, page_size, MM_SEAL_MSEAL | MM_SEAL_MPROTECT);
+	assert(ret < 0);
+
+	/* add seal to the remain 3 pages */
+	ret = sys_mseal(ptr + page_size, 3 * page_size,
+			MM_SEAL_MSEAL | MM_SEAL_MPROTECT);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_split_end(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_fixed_address(size, &ptr);
+
+	/* use mprotect to split at middle */
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	/* seal the last page */
+	ret = sys_mseal(ptr + 3 * page_size, page_size, MM_SEAL_MSEAL);
+	assert(!ret);
+
+	/* adding seal to the last page is rejected. */
+	ret = sys_mseal(ptr + 3 * page_size, page_size,
+			MM_SEAL_MSEAL | MM_SEAL_MPROTECT);
+	assert(ret < 0);
+
+	/* Adding seals to the first 3 pages */
+	ret = sys_mseal(ptr, 3 * page_size, MM_SEAL_MSEAL | MM_SEAL_MPROTECT);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_invalid_input(void)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_fixed_address(size, &ptr);
+
+	/* invalid flag */
+	ret = sys_mseal(ptr, size, 0x20);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr, size, 0x31);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr, size, 0x3F);
+	assert(ret < 0);
+
+	/* unaligned address */
+	ret = sys_mseal(ptr + 1, 2 * page_size, MM_SEAL_MSEAL);
+	assert(ret < 0);
+
+	/* length too big */
+	ret = sys_mseal(ptr, 5 * page_size, MM_SEAL_MSEAL);
+	assert(ret < 0);
+
+	/* start is not in a valid VMA */
+	ret = sys_mseal(ptr - page_size, 5 * page_size, MM_SEAL_MSEAL);
+	assert(ret < 0);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_zero_length(void)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_mprotect(ptr, 0, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	/* seal 0 length will be OK, same as mprotect */
+	ret = sys_mseal(ptr, 0, MM_SEAL_MPROTECT);
+	assert(!ret);
+
+	// verify the 4 pages are not sealed by previous call.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_twice(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT);
+	assert(!ret);
+
+	// apply the same seal will be OK. idempotent.
+	ret = sys_mseal(ptr, size, MM_SEAL_MPROTECT);
+	assert(!ret);
+
+	ret = sys_mseal(ptr, size,
+			MM_SEAL_MPROTECT | MM_SEAL_MMAP | MM_SEAL_MREMAP |
+			MM_SEAL_MSEAL);
+	assert(!ret);
+
+	ret = sys_mseal(ptr, size,
+			MM_SEAL_MPROTECT | MM_SEAL_MMAP | MM_SEAL_MREMAP |
+			MM_SEAL_MSEAL);
+	assert(!ret);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_MSEAL);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_mprotect(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_mprotect_single_address(ptr, size);
+
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_start_mprotect(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_mprotect_single_address(ptr, page_size);
+
+	// the first page is sealed.
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	// pages after the first page is not sealed.
+	ret = sys_mprotect(ptr + page_size, page_size * 3,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_end_mprotect(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_mprotect_single_address(ptr + page_size, 3 * page_size);
+
+	/* first page is not sealed */
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	/* last 3 page are sealed */
+	ret = sys_mprotect(ptr + page_size, page_size * 3,
+			PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_mprotect_unalign_len(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_mprotect_single_address(ptr, page_size * 2 - 1);
+
+	// 2 pages are sealed.
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_mprotect(ptr + page_size * 2, page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_mprotect_unalign_len_variant_2(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+	if (seal)
+		seal_mprotect_single_address(ptr, page_size * 2 + 1);
+
+	// 3 pages are sealed.
+	ret = sys_mprotect(ptr, page_size * 3, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_mprotect(ptr + page_size * 3, page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_mprotect_two_vma(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split */
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	if (seal)
+		seal_mprotect_single_address(ptr, page_size * 4);
+
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_mprotect(ptr + page_size * 2, page_size * 2,
+			PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_mprotect_two_vma_with_split(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	// use mprotect to split as two vma.
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// mseal can apply across 2 vma, also split them.
+	if (seal)
+		seal_mprotect_single_address(ptr + page_size, page_size * 2);
+
+	// the first page is not sealed.
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// the second page is sealed.
+	ret = sys_mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	// the third page is sealed.
+	ret = sys_mprotect(ptr + 2 * page_size, page_size,
+			PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	// the fouth page is not sealed.
+	ret = sys_mprotect(ptr + 3 * page_size, page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_mprotect_partial_mprotect(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	// seal one page.
+	if (seal)
+		seal_mprotect_single_address(ptr, page_size);
+
+	// mprotect first 2 page will fail, since the first page are sealed.
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_mprotect_two_vma_with_gap(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	// use mprotect to split.
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// use mprotect to split.
+	ret = sys_mprotect(ptr + 3 * page_size, page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// use munmap to free two pages in the middle
+	ret = sys_munmap(ptr + page_size, 2 * page_size);
+	assert(!ret);
+
+	// mprotect will fail, because there is a gap in the address.
+	// notes, internally mprotect still updated the first page.
+	ret = sys_mprotect(ptr, 4 * page_size, PROT_READ);
+	assert(ret < 0);
+
+	// mseal will fail as well.
+	ret = sys_mseal(ptr, 4 * page_size, MM_SEAL_MPROTECT);
+	assert(ret < 0);
+
+	// unlike mprotect, the first page is not sealed.
+	ret = sys_mprotect(ptr, page_size, PROT_READ);
+	assert(ret == 0);
+
+	// the last page is not sealed.
+	ret = sys_mprotect(ptr + 3 * page_size, page_size, PROT_READ);
+	assert(ret == 0);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_mprotect_split(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	//use mprotect to split.
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	//seal all 4 pages.
+	if (seal) {
+		ret = sys_mseal(ptr, 4 * page_size, MM_SEAL_MPROTECT);
+		assert(!ret);
+	}
+
+	//madvice is OK.
+	ret = sys_madvise(ptr, page_size * 2, MADV_WILLNEED);
+	assert(!ret);
+
+	//mprotect is sealed.
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+
+	ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_mprotect_merge(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	// use mprotect to split one page.
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// seal first two pages.
+	if (seal) {
+		ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_MPROTECT);
+		assert(!ret);
+	}
+
+	ret = sys_madvise(ptr, page_size, MADV_WILLNEED);
+	assert(!ret);
+
+	// 2 pages are sealed.
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	// last 2 pages are not sealed.
+	ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ);
+	assert(ret == 0);
+
+	clean_single_address(ptr, size);
+}
+
+static void test_seal_munmap(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_MUNMAP);
+		assert(!ret);
+	}
+
+	// 4 pages are sealed.
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+/*
+ * allocate 4 pages,
+ * use mprotect to split it as two VMAs
+ * seal the whole range
+ * munmap will fail on both
+ */
+static void test_seal_munmap_two_vma(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split */
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_MUNMAP);
+		assert(!ret);
+	}
+
+	ret = sys_munmap(ptr, page_size * 2);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_munmap(ptr + page_size, page_size * 2);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+/*
+ * allocate a VMA with 4 pages.
+ * munmap the middle 2 pages.
+ * seal the whole 4 pages, will fail.
+ * note: one of the pages are sealed
+ * munmap the first page will be OK.
+ * munmap the last page will be OK.
+ */
+static void test_seal_munmap_vma_with_gap(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_munmap(ptr + page_size, page_size * 2);
+	assert(!ret);
+
+	if (seal) {
+		// can't have gap in the middle.
+		ret = sys_mseal(ptr, size, MM_SEAL_MUNMAP);
+		assert(ret < 0);
+	}
+
+	ret = sys_munmap(ptr, page_size);
+	assert(!ret);
+
+	ret = sys_munmap(ptr + page_size * 2, page_size);
+	assert(!ret);
+
+	ret = sys_munmap(ptr, size);
+	assert(!ret);
+}
+
+static void test_munmap_start_freed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	// unmap the first page.
+	ret = sys_munmap(ptr, page_size);
+	assert(!ret);
+
+	// seal the last 3 pages.
+	if (seal) {
+		ret = sys_mseal(ptr + page_size, 3 * page_size, MM_SEAL_MUNMAP);
+		assert(!ret);
+	}
+
+	// unmap from the first page.
+	ret = sys_munmap(ptr, size);
+	if (seal) {
+		assert(ret < 0);
+
+		// use mprotect to verify page is not unmapped.
+		ret = sys_mprotect(ptr + page_size, 3 * page_size, PROT_READ);
+		assert(!ret);
+	} else
+		// note: this will be OK, even the first page is
+		// already unmapped.
+		assert(!ret);
+}
+
+static void test_munmap_end_freed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+	// unmap last page.
+	ret = sys_munmap(ptr + page_size * 3, page_size);
+	assert(!ret);
+
+	// seal the first 3 pages.
+	if (seal) {
+		ret = sys_mseal(ptr, 3 * page_size, MM_SEAL_MUNMAP);
+		assert(!ret);
+	}
+
+	// unmap all pages.
+	ret = sys_munmap(ptr, size);
+	if (seal) {
+		assert(ret < 0);
+
+		// use mprotect to verify page is not unmapped.
+		ret = sys_mprotect(ptr, 3 * page_size, PROT_READ);
+		assert(!ret);
+	} else
+		assert(!ret);
+}
+
+static void test_munmap_middle_freed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+	// unmap 2 pages in the middle.
+	ret = sys_munmap(ptr + page_size, page_size * 2);
+	assert(!ret);
+
+	// seal the first page.
+	if (seal) {
+		ret = sys_mseal(ptr, page_size, MM_SEAL_MUNMAP);
+		assert(!ret);
+	}
+
+	// munmap all 4 pages.
+	ret = sys_munmap(ptr, size);
+	if (seal) {
+		assert(ret < 0);
+
+		// use mprotect to verify page is not unmapped.
+		ret = sys_mprotect(ptr, page_size, PROT_READ);
+		assert(!ret);
+	} else
+		assert(!ret);
+}
+
+void test_seal_mremap_shrink(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_MREMAP);
+		assert(!ret);
+	}
+
+	// shrink from 4 pages to 2 pages.
+	ret2 = mremap(ptr, size, 2 * page_size, 0, 0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 != MAP_FAILED);
+		clean_single_address(ret2, 2 * page_size);
+	}
+	clean_single_address(ptr, size);
+}
+
+void test_seal_mremap_expand(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	// ummap last 2 pages.
+	ret = sys_munmap(ptr + 2 * page_size, 2 * page_size);
+	assert(!ret);
+
+	if (seal) {
+		ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_MREMAP);
+		assert(!ret);
+	}
+
+	// expand from 2 page to 4 pages.
+	ret2 = mremap(ptr, 2 * page_size, 4 * page_size, 0, 0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 == ptr);
+		clean_single_address(ret2, 4 * page_size);
+	}
+	clean_single_address(ptr, size);
+}
+
+void test_seal_mremap_move(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_MREMAP);
+		assert(!ret);
+	}
+
+	// move from ptr to fixed address.
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, addr1);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 != MAP_FAILED);
+		clean_single_address(ret2, size);
+	}
+	clean_single_address(ptr, size);
+}
+
+void test_seal_mmap_overwrite_prot(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_MMAP);
+		assert(!ret);
+	}
+
+	// use mmap to change protection.
+	ret2 = mmap(ptr, size, PROT_NONE,
+			MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else
+		assert(ret2 == ptr);
+
+	clean_single_address(ptr, size);
+}
+
+void test_seal_mremap_shrink_fixed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	setup_single_fixed_address(size, &newAddr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_MREMAP);
+		assert(!ret);
+	}
+
+	// mremap to move and shrink to fixed address
+	ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED,
+			newAddr);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else
+		assert(ret2 == newAddr);
+
+	clean_single_address(ptr, size);
+	clean_single_address(newAddr, size);
+}
+
+void test_seal_mremap_expand_fixed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(page_size, &ptr);
+	setup_single_fixed_address(size, &newAddr);
+
+	if (seal) {
+		ret = sys_mseal(newAddr, size, MM_SEAL_MREMAP);
+		assert(!ret);
+	}
+
+	// mremap to move and expand to fixed address
+	ret2 = mremap(ptr, page_size, size, MREMAP_MAYMOVE | MREMAP_FIXED,
+			newAddr);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else
+		assert(ret2 == newAddr);
+
+	clean_single_address(ptr, page_size);
+	clean_single_address(newAddr, size);
+}
+
+void test_seal_mremap_move_fixed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	setup_single_fixed_address(size, &newAddr);
+
+	if (seal) {
+		ret = sys_mseal(newAddr, size, MM_SEAL_MREMAP);
+		assert(!ret);
+	}
+
+	// mremap to move to fixed address
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newAddr);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else
+		assert(ret2 == newAddr);
+
+	clean_single_address(ptr, page_size);
+	clean_single_address(newAddr, size);
+}
+
+void test_seal_mremap_move_fixed_zero(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_MREMAP);
+		assert(!ret);
+	}
+
+	/*
+	 * MREMAP_FIXED can move the mapping to zero address
+	 */
+	ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED,
+			0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 == 0);
+		clean_single_address(ret2, 2 * page_size);
+	}
+	clean_single_address(ptr, size);
+}
+
+void test_seal_mremap_move_dontunmap(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_MREMAP);
+		assert(!ret);
+	}
+
+	// mremap to move, and don't unmap src addr.
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, 0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 != MAP_FAILED);
+		clean_single_address(ret2, size);
+	}
+
+	clean_single_address(ptr, page_size);
+}
+
+void test_seal_mremap_move_dontunmap_anyaddr(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_MREMAP);
+		assert(!ret);
+	}
+
+	/*
+	 * The 0xdeaddead should not have effect on dest addr
+	 * when MREMAP_DONTUNMAP is set.
+	 */
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP,
+			0xdeaddead);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 != MAP_FAILED);
+		assert((long)ret2 != 0xdeaddead);
+		clean_single_address(ret2, size);
+	}
+
+	clean_single_address(ptr, page_size);
+}
+
+int main(int argc, char **argv)
+{
+	test_seal_invalid_input();
+	test_seal_addseals();
+	test_seal_addseals_combined();
+	test_seal_addseals_reject();
+	test_seal_unmapped_start();
+	test_seal_unmapped_middle();
+	test_seal_unmapped_end();
+	test_seal_multiple_vmas();
+	test_seal_split_start();
+	test_seal_split_end();
+
+	test_seal_zero_length();
+	test_seal_twice();
+
+	test_seal_mprotect(false);
+	test_seal_mprotect(true);
+
+	test_seal_start_mprotect(false);
+	test_seal_start_mprotect(true);
+
+	test_seal_end_mprotect(false);
+	test_seal_end_mprotect(true);
+
+	test_seal_mprotect_unalign_len(false);
+	test_seal_mprotect_unalign_len(true);
+
+	test_seal_mprotect_unalign_len_variant_2(false);
+	test_seal_mprotect_unalign_len_variant_2(true);
+
+	test_seal_mprotect_two_vma(false);
+	test_seal_mprotect_two_vma(true);
+
+	test_seal_mprotect_two_vma_with_split(false);
+	test_seal_mprotect_two_vma_with_split(true);
+
+	test_seal_mprotect_partial_mprotect(false);
+	test_seal_mprotect_partial_mprotect(true);
+
+	test_seal_mprotect_two_vma_with_gap(false);
+	test_seal_mprotect_two_vma_with_gap(true);
+
+	test_seal_mprotect_merge(false);
+	test_seal_mprotect_merge(true);
+
+	test_seal_mprotect_split(false);
+	test_seal_mprotect_split(true);
+
+	test_seal_munmap(false);
+	test_seal_munmap(true);
+	test_seal_munmap_two_vma(false);
+	test_seal_munmap_two_vma(true);
+	test_seal_munmap_vma_with_gap(false);
+	test_seal_munmap_vma_with_gap(true);
+
+	test_munmap_start_freed(false);
+	test_munmap_start_freed(true);
+	test_munmap_middle_freed(false);
+	test_munmap_middle_freed(true);
+	test_munmap_end_freed(false);
+	test_munmap_end_freed(true);
+
+	test_seal_mremap_shrink(false);
+	test_seal_mremap_shrink(true);
+	test_seal_mremap_expand(false);
+	test_seal_mremap_expand(true);
+	test_seal_mremap_move(false);
+	test_seal_mremap_move(true);
+
+	test_seal_mremap_shrink_fixed(false);
+	test_seal_mremap_shrink_fixed(true);
+	test_seal_mremap_expand_fixed(false);
+	test_seal_mremap_expand_fixed(true);
+	test_seal_mremap_move_fixed(false);
+	test_seal_mremap_move_fixed(true);
+	test_seal_mremap_move_dontunmap(false);
+	test_seal_mremap_move_dontunmap(true);
+	test_seal_mremap_move_fixed_zero(false);
+	test_seal_mremap_move_fixed_zero(true);
+	test_seal_mremap_move_dontunmap_anyaddr(false);
+	test_seal_mremap_move_dontunmap_anyaddr(true);
+
+	test_seal_mmap_overwrite_prot(false);
+	test_seal_mmap_overwrite_prot(true);
+
+	printf("OK\n");
+	return 0;
+}
-- 
2.42.0.609.gbb76f46606-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 1/8] Add mseal syscall
  2023-10-16 14:38 ` [RFC PATCH v1 1/8] Add mseal syscall jeffxu
@ 2023-10-16 15:05   ` Greg KH
  2023-10-17  6:50     ` Jeff Xu
  0 siblings, 1 reply; 43+ messages in thread
From: Greg KH @ 2023-10-16 15:05 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, sroettger, jeffxu, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, jannh, surenb, alex.sierra, apopple,
	aneesh.kumar, axelrasmussen, ben, catalin.marinas, david, dwmw,
	ying.huang, hughd, joey.gouly, corbet, wangkefeng.wang,
	Liam.Howlett, torvalds, lstoakes, willy, mawupeng1, linmiaohe,
	namit, peterx, peterz, ryan.roberts, shr, vbabka, xiujianfeng,
	yu.ma, zhangpeng362, dave.hansen, luto, linux-hardening

On Mon, Oct 16, 2023 at 02:38:20PM +0000, jeffxu@chromium.org wrote:
> +#ifdef CONFIG_MSEAL
> +	/*
> +	 * bit masks for seal.
> +	 * need this since vm_flags is full.
> +	 */
> +	unsigned long vm_seals;		/* seal flags, see mm.h. */

"unsigned long" and yet:

> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index c0cb22cd607d..f574c7dbee76 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -802,6 +802,8 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>  			unsigned long prot, unsigned long pgoff,
>  			unsigned long flags);
> +asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned int types,
> +			  unsigned int flags);

"unsigned int"?

Why the mis-match?

> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -55,4 +55,10 @@ struct cachestat {
>  	__u64 nr_recently_evicted;
>  };
>  
> +#define MM_SEAL_MSEAL		0x1
> +#define MM_SEAL_MPROTECT	0x2
> +#define MM_SEAL_MUNMAP		0x4
> +#define MM_SEAL_MMAP		0x8
> +#define MM_SEAL_MREMAP		0x10

I think we can use the BIT() macro in uapi .h files now, it is _BITUL().
Might want to use it here too to make it obvious what is happening.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
                   ` (7 preceding siblings ...)
  2023-10-16 14:38 ` [RFC PATCH v1 8/8] selftest mm/mseal mprotect/munmap/mremap/mmap jeffxu
@ 2023-10-16 15:18 ` Matthew Wilcox
  2023-10-17  8:34   ` Jeff Xu
  2023-10-17 15:29   ` Pedro Falcato
  2023-10-16 17:23 ` Linus Torvalds
  2023-10-16 17:34 ` Jann Horn
  10 siblings, 2 replies; 43+ messages in thread
From: Matthew Wilcox @ 2023-10-16 15:18 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, sroettger, jeffxu, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, jannh, surenb, alex.sierra, apopple,
	aneesh.kumar, axelrasmussen, ben, catalin.marinas, david, dwmw,
	ying.huang, hughd, joey.gouly, corbet, wangkefeng.wang,
	Liam.Howlett, torvalds, lstoakes, mawupeng1, linmiaohe, namit,
	peterx, peterz, ryan.roberts, shr, vbabka, xiujianfeng, yu.ma,
	zhangpeng362, dave.hansen, luto, linux-hardening

On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> Modern CPUs support memory permissions such as RW and NX bits. Linux has
> supported NX since the release of kernel version 2.6.8 in August 2004 [1].

This seems like a confusing way to introduce the subject.  Here, you're
talking about page permissions, whereas (as far as I can tell), mseal() is
about making _virtual_ addresses immutable, for some value of immutable.

> Memory sealing additionally protects the mapping itself against
> modifications. This is useful to mitigate memory corruption issues where
> a corrupted pointer is passed to a memory management syscall. For example,
> such an attacker primitive can break control-flow integrity guarantees
> since read-only memory that is supposed to be trusted can become writable
> or .text pages can get remapped. Memory sealing can automatically be
> applied by the runtime loader to seal .text and .rodata pages and
> applications can additionally seal security critical data at runtime.
> A similar feature already exists in the XNU kernel with the
> VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> Also, Chrome wants to adopt this feature for their CFI work [2] and this
> patchset has been designed to be compatible with the Chrome use case.

This [2] seems very generic and wide-ranging, not helpful.  [5] was more
useful to understand what you're trying to do.

> The new mseal() is an architecture independent syscall, and with
> following signature:
> 
> mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> 
> addr/len: memory range.  Must be continuous/allocated memory, or else
> mseal() will fail and no VMA is updated. For details on acceptable
> arguments, please refer to comments in mseal.c. Those are also fully
> covered by the selftest.

Mmm.  So when you say "continuous/allocated" what you really mean is
"Must have contiguous VMAs" rather than "All pages in this range must
be populated", yes?

> types: bit mask to specify which syscall to seal, currently they are:
> MM_SEAL_MSEAL 0x1
> MM_SEAL_MPROTECT 0x2
> MM_SEAL_MUNMAP 0x4
> MM_SEAL_MMAP 0x8
> MM_SEAL_MREMAP 0x10

I don't understand why we want this level of granularity.  The OpenBSD
and XNU examples just say "This must be immutable*".  For values of
immutable that allow downgrading access (eg RW to RO or RX to RO),
but not upgrading access (RW->RX, RO->*, RX->RW).

> Each bit represents sealing for one specific syscall type, e.g.
> MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> is that the API is extendable, i.e. when needed, the sealing can be
> extended to madvise, mlock, etc. Backward compatibility is also easy.

Honestly, it feels too flexible.  Why not just two flags to mprotect()
-- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
maybe for some things we want to be able to downgrade and for other
things, we don't.

I'd like to see some discussion of how this interacts with mprotect().
As far as I can tell, the intent is to lock the protections/existance
of the mapping, and not to force memory to stay in core.  So it's fine
for the kernel to swap out the page and set up a PTE as a swap entry.
It's also fine for the kernel to mark PTEs as RO to catch page faults;
we're concerned with the LOGICAL permissions, and not the page tables.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
                   ` (8 preceding siblings ...)
  2023-10-16 15:18 ` [RFC PATCH v1 0/8] Introduce mseal() syscall Matthew Wilcox
@ 2023-10-16 17:23 ` Linus Torvalds
  2023-10-17  9:07   ` Jeff Xu
  2023-10-16 17:34 ` Jann Horn
  10 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2023-10-16 17:23 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, sroettger, jeffxu, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, jannh, surenb, alex.sierra, apopple,
	aneesh.kumar, axelrasmussen, ben, catalin.marinas, david, dwmw,
	ying.huang, hughd, joey.gouly, corbet, wangkefeng.wang,
	Liam.Howlett, lstoakes, willy, mawupeng1, linmiaohe, namit,
	peterx, peterz, ryan.roberts, shr, vbabka, xiujianfeng, yu.ma,
	zhangpeng362, dave.hansen, luto, linux-hardening

On Mon, 16 Oct 2023 at 07:38, <jeffxu@chromium.org> wrote:
>
> This patchset proposes a new mseal() syscall for the Linux kernel.

So I have no objections to adding some kind of "lock down memory
mappings" model, but this isn't it.

First off, the simple stuff: the commit messages are worthless. Having

   check seal for mmap(2)

as the commit message is not even remotely acceptable, to pick one
random example from the series (7/8).

But that doesn't matter much, since I think the more fundamental
problems are much worse:

 - the whole "ON_BEHALF_OF_KERNEL" and "ON_BEHALF_OF_USERSPACE" is
just complete noise and totally illogical. The whole concept needs to
be redone.

Example of complete nonsense (again, picking 7/8, but that's again
just a random example):

> @@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages,
>                 flags |= MAP_LOCKED;
>
>         file = get_file(vma->vm_file);
> -       ret = do_mmap(vma->vm_file, start, size,
> -                       prot, flags, pgoff, &populate, NULL);
> +       ret = do_mmap(vma->vm_file, start, size, prot, flags, pgoff,
> +                       &populate, NULL, ON_BEHALF_OF_KERNEL);
>         fput(file);
>  out:
>         mmap_write_unlock(mm);

Christ. That's *literally* the remap_file_pages() system call
definition. No way in hell does "ON_BEHALF_OF_KERNEL" make any sense
in this context.

It's not the only situation. "mremap() as the same thing. vm_munmap()
has the same thing. vm_brk_flags() has the same thing. None of them
make any sense.

But even if those obvious kinds of complete mis-designs were to be
individually fixed, the whole naming and concept is bogus. The
"ON_BEHALF_OF_KERNEL" thing seems to actually just mean "don't check
sealing". Naming should describe what the thing *means*, not some
random policy thing that may or may not be at all relevant.

 - the whole MM_SEAL_xx vs VM_SEAL_xx artificial distinction needs to go away.

Not only is it unacceptable to pointlessly create two different name
spaces for no obvious reason, code like this (from 1/8) should not
exist:

> +       if (types & MM_SEAL_MSEAL)
> +               newtypes |= VM_SEAL_MSEAL;
> +
> +       if (types & MM_SEAL_MPROTECT)
> +               newtypes |= VM_SEAL_MPROTECT;
> +
> +       if (types & MM_SEAL_MUNMAP)
> +               newtypes |= VM_SEAL_MUNMAP;
> +
> +       if (types & MM_SEAL_MMAP)
> +               newtypes |= VM_SEAL_MMAP;
> +
> +       if (types & MM_SEAL_MREMAP)
> +               newtypes |= VM_SEAL_MREMAP;

Sure, we have that in some cases when there was a *reason* for
namespacing imposed on us from an API standpoint (ie the "open()"
flags that get turned into FMODE_xyz flags, or the MS_xyz mount flags
being turned into MNT_xyz flags), but those tend to be signs of
problems in the system call API where some mixup made it impossible to
use the flags directly (most commonly because there is some extended
internal representation of said things).

For example, the MS_xyz namespace is a combination of "these are flags
for the new mount" (like MS_RDONLY) and "this is how you should mount
it" (like MS_REMOUNT), and so the user interface has that odd mixing
of different things, and the MNT_xyz namespace is a distillation of
the former.

But we certainly should not strive to introduce *new* interfaces that
start out with this kind of mixup and pointless "translate from one
bit to another" code.

 - And finally (for now), I hate that MM_ACTION_xyz thing too!

Why do we have MM_ACTION_MREMAP, and pointless like this (from 3/8):

> +       switch (action) {
> +       case MM_ACTION_MPROTECT:
> +               if (vma->vm_seals & VM_SEAL_MPROTECT)
> +                       return false;
> +               break;

when the sane thing to do is to use the *same* MM_SEAL_xyz bitmask and
sealing bitmask and just say

        if (vma->vm_seal & vm_action)
                return -EPERM;

IOW, you have pointlessly introduced not *two* different namespaces,
but *three*. All doing the exact same thing, and all just causing
pointless and ugly code to "translate" the actions of one into the
model of another.

So get rid of the "MM_ACTION_xyz" thing. Get rid of ther "VM_SEAL_xyz"
thing. Use *one* single "these are the sealing bits".

And get rid of "enum caller_origin" entirely. I don't know what the
right model for that thing is, but that isn't it.

*Maybe* the right model is some MM_SEAL_OVERRIDE bit that user space
cannot set, but that the kernel can use internally - and if that is
the right model, then dammit, the *uses* should be very very obvious
why the override is fine, because that remap_file_pages() use sure as
hell was *not* obvious.

In fact, it's not at all obvious why anything should override the
sealing bits - EVER. So I'm not convinced that the right model is
"replace ON_BEHALF_OF_KERNEL with MM_SEAL_OVERRIDE". Why would we
*ever* want to override sealing? It sounds like complete garbage. Most
of the users seem to be things like "execve()", which is nonsensical,
since the VM wouldn't have been sealed at that point _anyway_, so
having a "don't bother checking sealing bits" flag seems entirely
useless.

Anyway, this is all a resounding NAK on this series in this form. My
complaints are not some kind of small "fix this up". These are
fundamental issues.

            Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
                   ` (9 preceding siblings ...)
  2023-10-16 17:23 ` Linus Torvalds
@ 2023-10-16 17:34 ` Jann Horn
  2023-10-17  8:42   ` Jeff Xu
  10 siblings, 1 reply; 43+ messages in thread
From: Jann Horn @ 2023-10-16 17:34 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, sroettger, jeffxu, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, surenb, alex.sierra, apopple,
	aneesh.kumar, axelrasmussen, ben, catalin.marinas, david, dwmw,
	ying.huang, hughd, joey.gouly, corbet, wangkefeng.wang,
	Liam.Howlett, torvalds, lstoakes, willy, mawupeng1, linmiaohe,
	namit, peterx, peterz, ryan.roberts, shr, vbabka, xiujianfeng,
	yu.ma, zhangpeng362, dave.hansen, luto, linux-hardening

On Mon, Oct 16, 2023 at 4:38 PM <jeffxu@chromium.org> wrote:
>
> From: Jeff Xu <jeffxu@google.com>
>
> This patchset proposes a new mseal() syscall for the Linux kernel.
>
> Modern CPUs support memory permissions such as RW and NX bits. Linux has
> supported NX since the release of kernel version 2.6.8 in August 2004 [1].
> The memory permission feature improves security stance on memory
> corruption bugs, i.e. the attacker can’t just write to arbitrary memory
> and point the code to it, the memory has to be marked with X bit, or
> else an exception will happen.
>
> Memory sealing additionally protects the mapping itself against
> modifications. This is useful to mitigate memory corruption issues where
> a corrupted pointer is passed to a memory management syscall. For example,
> such an attacker primitive can break control-flow integrity guarantees
> since read-only memory that is supposed to be trusted can become writable
> or .text pages can get remapped. Memory sealing can automatically be
> applied by the runtime loader to seal .text and .rodata pages and
> applications can additionally seal security critical data at runtime.
> A similar feature already exists in the XNU kernel with the
> VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> Also, Chrome wants to adopt this feature for their CFI work [2] and this
> patchset has been designed to be compatible with the Chrome use case.
>
> The new mseal() is an architecture independent syscall, and with
> following signature:
>
> mseal(void addr, size_t len, unsigned int types, unsigned int flags)

Is the plan that the VMAs you need to protect would be created and
mseal()'ed while you expect that attacker code can not (yet) be
running concurrently?

Do you expect to be using sealed memory for shadow stacks (in x86 CET
/ arm64 GCS) to prevent an attacker from mixing those up by moving
pages inside a shadow stack or between different shadow stacks or
such? (If that's even possible, I think it is but I haven't tried.)

> addr/len: memory range.  Must be continuous/allocated memory, or else
> mseal() will fail and no VMA is updated. For details on acceptable
> arguments, please refer to comments in mseal.c. Those are also fully
> covered by the selftest.
> types: bit mask to specify which syscall to seal, currently they are:
> MM_SEAL_MSEAL 0x1
> MM_SEAL_MPROTECT 0x2
> MM_SEAL_MUNMAP 0x4
> MM_SEAL_MMAP 0x8
> MM_SEAL_MREMAP 0x10

You'd probably also want to block destructive madvise() operations
that can effectively alter region contents by discarding pages and
such, in particular MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED;
probably also MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK. Maybe you'd
want to just block all madvise() for sealed VMAs? Or rename
process_madvise_behavior_valid() to something like
"madvise_is_nondestructive()" and use that.

The following comments probably mostly don't matter in practice if
this feature is used in a context that is heavily seccomp-sandboxed
(like Desktop Linux Chrome), but should maybe be addressed to make
this feature more usable for other users. (Including Android Chrome,
which has a weaker sandbox...)

FWIW, it is also possible to write to read-only memory through the
/proc/self/mem interface (or through ptrace commands like
PTRACE_POKETEXT) because of FOLL_FORCE, depending on kernel
configuration, seccomp policy, and what the LSMs do. (I think Android
Chrome would allow /proc/self/mem writes, but would block
PTRACE_POKETEXT with RestrictPtrace() in the sandbox code?)

I had a related ancient patch series in 2016 with an attempt to allow
SELinux to prevent bypassing W^X protections with this, but I never
followed through with getting that landed...
(https://lore.kernel.org/linux-mm/1475103281-7989-1-git-send-email-jann@thejh.net/)

I guess the question there is what the right semantics for this kind
of protected memory are when a debugger is active. The simple solution
that might break some debugging would be "just deny all FOLL_FORCE
write access for this memory" (which would prevent debuggers from
being able to place breakpoints, which would maybe not be great). But
maybe it makes more sense to consider this to be an independent
concern and solve it with a new SELinux feature or something like that
instead, and then document that mseal() requires some complement to
prevent forced writes to read-only private memory? (For which the
simplest solution would be "don't grant filesystem access or ptrace()
access to the sandboxed code".)


What is the intended interaction with userfaultfd, which I believe by
design permits arbitrary data into unpopulated areas of anonymous
VMAs? If the intention is that the process should otherwise be
sandboxed to not have access to userfaultfd, that should maybe be
documented. (Alternatively I guess you could theoretically remove the
VM_MAYWRITE bit from marked VMAs, but that might be more strict than
we want, since it'd also block all FOLL_FORCE writes.)


There are also some interfaces like AIO or the X86 Shadow Stacks
interface that indirectly unmap memory through the kernel and look
like they could perhaps be tricked into racily unmapping a
just-created sealed VMA. You'd probably have to make sure that they
can't do that and essentially treat their unmap operations as if they
came from userspace, I guess? What Linus just wrote.


I think either way this feature needs some documentation on what kind
of context it's supposed to run in.


> Each bit represents sealing for one specific syscall type, e.g.
> MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> is that the API is extendable, i.e. when needed, the sealing can be
> extended to madvise, mlock, etc. Backward compatibility is also easy.
>
> The kernel will remember which seal types are applied, and the application
> doesn’t need to repeat all existing seal types in the next mseal().  Once
> a seal type is applied, it can’t be unsealed. Call mseal() on an existing
> seal type is a no-action, not a failure.
>
> MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
>
> Internally, vm_area_struct adds a new field vm_seals, to store the bit
> masks.
>
> For the affected syscalls, such as mprotect, a check(can_modify_mm) for
> sealing is added, this usually happens at the early point of the syscall,
> before any update is made to VMAs. The effect of that is: if any of the
> VMAs in the given address range fails the sealing check, none of the VMA
> will be updated. It might be worth noting that this is different from the
> rest of mprotect(), where some updates can happen even when mprotect
> returns fail. Consider can_modify_mm only checks vm_seals in
> vm_area_struct, and it is not going deeper in the page table or updating
> any HW, success or none behavior might fit better here. I would like to
> listen to the community's feedback on this.
>
> The idea that inspired this patch comes from Stephen Röttger’s work in
> V8 CFI [5],  Chrome browser in ChromeOS will be the first user of this API.
>
> In addition, Stephen is working on glibc change to add sealing support
> into the dynamic linker to seal all non-writable segments at startup. When
> that work is completed, all applications can automatically benefit from
> these new protections.
>
> [1] https://kernelnewbies.org/Linux_2_6_8
> [2] https://v8.dev/blog/control-flow-integrity
> [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
> [4] https://man.openbsd.org/mimmutable.2
> [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
>
> Jeff Xu (8):
>   Add mseal syscall
>   Wire up mseal syscall
>   mseal: add can_modify_mm and can_modify_vma
>   mseal: seal mprotect
>   mseal munmap
>   mseal mremap
>   mseal mmap
>   selftest mm/mseal mprotect/munmap/mremap/mmap
>
>  arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
>  arch/arm/tools/syscall.tbl                  |    1 +
>  arch/arm64/include/asm/unistd.h             |    2 +-
>  arch/arm64/include/asm/unistd32.h           |    2 +
>  arch/ia64/kernel/syscalls/syscall.tbl       |    1 +
>  arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
>  arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
>  arch/s390/kernel/syscalls/syscall.tbl       |    1 +
>  arch/sh/kernel/syscalls/syscall.tbl         |    1 +
>  arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
>  arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
>  arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
>  fs/aio.c                                    |    5 +-
>  include/linux/mm.h                          |   55 +-
>  include/linux/mm_types.h                    |    7 +
>  include/linux/syscalls.h                    |    2 +
>  include/uapi/asm-generic/unistd.h           |    5 +-
>  include/uapi/linux/mman.h                   |    6 +
>  ipc/shm.c                                   |    3 +-
>  kernel/sys_ni.c                             |    1 +
>  mm/Kconfig                                  |    8 +
>  mm/Makefile                                 |    1 +
>  mm/internal.h                               |    4 +-
>  mm/mmap.c                                   |   49 +-
>  mm/mprotect.c                               |    6 +
>  mm/mremap.c                                 |   19 +-
>  mm/mseal.c                                  |  328 +++++
>  mm/nommu.c                                  |    6 +-
>  mm/util.c                                   |    8 +-
>  tools/testing/selftests/mm/Makefile         |    1 +
>  tools/testing/selftests/mm/mseal_test.c     | 1428 +++++++++++++++++++
>  37 files changed, 1934 insertions(+), 28 deletions(-)
>  create mode 100644 mm/mseal.c
>  create mode 100644 tools/testing/selftests/mm/mseal_test.c
>
> --
> 2.42.0.609.gbb76f46606-goog
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 1/8] Add mseal syscall
  2023-10-16 15:05   ` Greg KH
@ 2023-10-17  6:50     ` Jeff Xu
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff Xu @ 2023-10-17  6:50 UTC (permalink / raw)
  To: Greg KH
  Cc: jeffxu, akpm, keescook, sroettger, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, jannh, surenb, alex.sierra, apopple,
	aneesh.kumar, axelrasmussen, ben, catalin.marinas, david, dwmw,
	ying.huang, hughd, joey.gouly, corbet, wangkefeng.wang,
	Liam.Howlett, torvalds, lstoakes, willy, mawupeng1, linmiaohe,
	namit, peterx, peterz, ryan.roberts, shr, vbabka, xiujianfeng,
	yu.ma, zhangpeng362, dave.hansen, luto, linux-hardening

On Mon, Oct 16, 2023 at 8:07 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Mon, Oct 16, 2023 at 02:38:20PM +0000, jeffxu@chromium.org wrote:
> > +#ifdef CONFIG_MSEAL
> > +     /*
> > +      * bit masks for seal.
> > +      * need this since vm_flags is full.
> > +      */
> > +     unsigned long vm_seals;         /* seal flags, see mm.h. */
>
> "unsigned long" and yet:
>
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index c0cb22cd607d..f574c7dbee76 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -802,6 +802,8 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
> >  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> >                       unsigned long prot, unsigned long pgoff,
> >                       unsigned long flags);
> > +asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned int types,
> > +                       unsigned int flags);
>
> "unsigned int"?
>
> Why the mis-match?
>
Thanks. Fixed in V2.

> > --- a/include/uapi/linux/mman.h
> > +++ b/include/uapi/linux/mman.h
> > @@ -55,4 +55,10 @@ struct cachestat {
> >       __u64 nr_recently_evicted;
> >  };
> >
> > +#define MM_SEAL_MSEAL                0x1
> > +#define MM_SEAL_MPROTECT     0x2
> > +#define MM_SEAL_MUNMAP               0x4
> > +#define MM_SEAL_MMAP         0x8
> > +#define MM_SEAL_MREMAP               0x10
>
> I think we can use the BIT() macro in uapi .h files now, it is _BITUL().
> Might want to use it here too to make it obvious what is happening.
>
Sure. Will update in V2.

> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-16 15:18 ` [RFC PATCH v1 0/8] Introduce mseal() syscall Matthew Wilcox
@ 2023-10-17  8:34   ` Jeff Xu
  2023-10-17 12:56     ` Matthew Wilcox
  2023-10-17 15:29   ` Pedro Falcato
  1 sibling, 1 reply; 43+ messages in thread
From: Jeff Xu @ 2023-10-17  8:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: jeffxu, akpm, keescook, sroettger, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, jannh, surenb, alex.sierra, apopple,
	aneesh.kumar, axelrasmussen, ben, catalin.marinas, david, dwmw,
	ying.huang, hughd, joey.gouly, corbet, wangkefeng.wang,
	Liam.Howlett, torvalds, lstoakes, mawupeng1, linmiaohe, namit,
	peterx, peterz, ryan.roberts, shr, vbabka, xiujianfeng, yu.ma,
	zhangpeng362, dave.hansen, luto, linux-hardening

Hi Matthew.

Thanks for your comments and time to review the patchset.

On Mon, Oct 16, 2023 at 8:18 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
>
> This seems like a confusing way to introduce the subject.  Here, you're
> talking about page permissions, whereas (as far as I can tell), mseal() is
> about making _virtual_ addresses immutable, for some value of immutable.
>
> > Memory sealing additionally protects the mapping itself against
> > modifications. This is useful to mitigate memory corruption issues where
> > a corrupted pointer is passed to a memory management syscall. For example,
> > such an attacker primitive can break control-flow integrity guarantees
> > since read-only memory that is supposed to be trusted can become writable
> > or .text pages can get remapped. Memory sealing can automatically be
> > applied by the runtime loader to seal .text and .rodata pages and
> > applications can additionally seal security critical data at runtime.
> > A similar feature already exists in the XNU kernel with the
> > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > patchset has been designed to be compatible with the Chrome use case.
>
> This [2] seems very generic and wide-ranging, not helpful.  [5] was more
> useful to understand what you're trying to do.
>
> > The new mseal() is an architecture independent syscall, and with
> > following signature:
> >
> > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> >
> > addr/len: memory range.  Must be continuous/allocated memory, or else
> > mseal() will fail and no VMA is updated. For details on acceptable
> > arguments, please refer to comments in mseal.c. Those are also fully
> > covered by the selftest.
>
> Mmm.  So when you say "continuous/allocated" what you really mean is
> "Must have contiguous VMAs" rather than "All pages in this range must
> be populated", yes?
>
There can't be a gap (unallocated memory) in the given range.
Those are covered in selftest:
test_seal_unmapped_start()
test_seal_unmapped_middle()
test_seal_unmapped_end()
The comments in check_mm_seal() also mentioned that.

> > types: bit mask to specify which syscall to seal, currently they are:
> > MM_SEAL_MSEAL 0x1
> > MM_SEAL_MPROTECT 0x2
> > MM_SEAL_MUNMAP 0x4
> > MM_SEAL_MMAP 0x8
> > MM_SEAL_MREMAP 0x10
>
> I don't understand why we want this level of granularity.  The OpenBSD
> and XNU examples just say "This must be immutable*".  For values of
> immutable that allow downgrading access (eg RW to RO or RX to RO),
> but not upgrading access (RW->RX, RO->*, RX->RW).
>
> > Each bit represents sealing for one specific syscall type, e.g.
> > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > is that the API is extendable, i.e. when needed, the sealing can be
> > extended to madvise, mlock, etc. Backward compatibility is also easy.
>
> Honestly, it feels too flexible.  Why not just two flags to mprotect()
> -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> maybe for some things we want to be able to downgrade and for other
> things, we don't.
>
Having a seal type per syscall type helps to add the feature incrementally.
Applications also know exactly what is sealed.

I'm not against types such as IMMUTABLE and DOWNGRADEABLE, if we
can define what it seals precisely. As Jann pointed out, there have
other scenarios that potentially affect IMMUTABLE. Implementing all thoses
will take time. And if we missed a case, we could introduce backward
compatibility issues to the application. Bitmask will solve this nicely, i.e.
application will need to apply the newly added sealing type explicitly.

> I'd like to see some discussion of how this interacts with mprotect().
> As far as I can tell, the intent is to lock the protections/existance
> of the mapping, and not to force memory to stay in core.  So it's fine
> for the kernel to swap out the page and set up a PTE as a swap entry.
> It's also fine for the kernel to mark PTEs as RO to catch page faults;
> we're concerned with the LOGICAL permissions, and not the page tables.

Yes. That is correct.

-Jeff Xu

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-16 17:34 ` Jann Horn
@ 2023-10-17  8:42   ` Jeff Xu
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff Xu @ 2023-10-17  8:42 UTC (permalink / raw)
  To: Jann Horn
  Cc: jeffxu, akpm, keescook, sroettger, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, surenb, alex.sierra, apopple,
	aneesh.kumar, axelrasmussen, ben, catalin.marinas, david, dwmw,
	ying.huang, hughd, joey.gouly, corbet, wangkefeng.wang,
	Liam.Howlett, torvalds, lstoakes, willy, mawupeng1, linmiaohe,
	namit, peterx, peterz, ryan.roberts, shr, vbabka, xiujianfeng,
	yu.ma, zhangpeng362, dave.hansen, luto, linux-hardening

Hi Jann,

Thank you for reviewing the patch and comments.

On Mon, Oct 16, 2023 at 10:35 AM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Oct 16, 2023 at 4:38 PM <jeffxu@chromium.org> wrote:
> >
> > From: Jeff Xu <jeffxu@google.com>
> >
> > This patchset proposes a new mseal() syscall for the Linux kernel.
> >
> > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
> > The memory permission feature improves security stance on memory
> > corruption bugs, i.e. the attacker can’t just write to arbitrary memory
> > and point the code to it, the memory has to be marked with X bit, or
> > else an exception will happen.
> >
> > Memory sealing additionally protects the mapping itself against
> > modifications. This is useful to mitigate memory corruption issues where
> > a corrupted pointer is passed to a memory management syscall. For example,
> > such an attacker primitive can break control-flow integrity guarantees
> > since read-only memory that is supposed to be trusted can become writable
> > or .text pages can get remapped. Memory sealing can automatically be
> > applied by the runtime loader to seal .text and .rodata pages and
> > applications can additionally seal security critical data at runtime.
> > A similar feature already exists in the XNU kernel with the
> > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > patchset has been designed to be compatible with the Chrome use case.
> >
> > The new mseal() is an architecture independent syscall, and with
> > following signature:
> >
> > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
>
> Is the plan that the VMAs you need to protect would be created and
> mseal()'ed while you expect that attacker code can not (yet) be
> running concurrently?
>
Yes.

> Do you expect to be using sealed memory for shadow stacks (in x86 CET
> / arm64 GCS) to prevent an attacker from mixing those up by moving
> pages inside a shadow stack or between different shadow stacks or
> such? (If that's even possible, I think it is but I haven't tried.)
>
I will defer the question to Stephen.
(I also think that is possible, if the application know where the
shadow stack is at runtime)

> > addr/len: memory range.  Must be continuous/allocated memory, or else
> > mseal() will fail and no VMA is updated. For details on acceptable
> > arguments, please refer to comments in mseal.c. Those are also fully
> > covered by the selftest.
> > types: bit mask to specify which syscall to seal, currently they are:
> > MM_SEAL_MSEAL 0x1
> > MM_SEAL_MPROTECT 0x2
> > MM_SEAL_MUNMAP 0x4
> > MM_SEAL_MMAP 0x8
> > MM_SEAL_MREMAP 0x10
>
> You'd probably also want to block destructive madvise() operations
> that can effectively alter region contents by discarding pages and
> such, in particular MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED;
> probably also MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK. Maybe you'd
> want to just block all madvise() for sealed VMAs? Or rename
> process_madvise_behavior_valid() to something like
> "madvise_is_nondestructive()" and use that.
>
Thanks for the suggestion. I can add madvise() later, maybe in another series.
For now, the goal of this patchset covers 4 syscalls, as wished for by
Chrome initially.

> The following comments probably mostly don't matter in practice if
> this feature is used in a context that is heavily seccomp-sandboxed
> (like Desktop Linux Chrome), but should maybe be addressed to make
> this feature more usable for other users. (Including Android Chrome,
> which has a weaker sandbox...)
>
> FWIW, it is also possible to write to read-only memory through the
> /proc/self/mem interface (or through ptrace commands like
> PTRACE_POKETEXT) because of FOLL_FORCE, depending on kernel
> configuration, seccomp policy, and what the LSMs do. (I think Android
> Chrome would allow /proc/self/mem writes, but would block
> PTRACE_POKETEXT with RestrictPtrace() in the sandbox code?)
>
> I had a related ancient patch series in 2016 with an attempt to allow
> SELinux to prevent bypassing W^X protections with this, but I never
> followed through with getting that landed...
> (https://lore.kernel.org/linux-mm/1475103281-7989-1-git-send-email-jann@thejh.net/)
>
> I guess the question there is what the right semantics for this kind
> of protected memory are when a debugger is active. The simple solution
> that might break some debugging would be "just deny all FOLL_FORCE
> write access for this memory" (which would prevent debuggers from
> being able to place breakpoints, which would maybe not be great). But
> maybe it makes more sense to consider this to be an independent
> concern and solve it with a new SELinux feature or something like that
> instead, and then document that mseal() requires some complement to
> prevent forced writes to read-only private memory? (For which the
> simplest solution would be "don't grant filesystem access or ptrace()
> access to the sandboxed code".)
>
> What is the intended interaction with userfaultfd, which I believe by
> design permits arbitrary data into unpopulated areas of anonymous
> VMAs? If the intention is that the process should otherwise be
> sandboxed to not have access to userfaultfd, that should maybe be
> documented. (Alternatively I guess you could theoretically remove the
> VM_MAYWRITE bit from marked VMAs, but that might be more strict than
> we want, since it'd also block all FOLL_FORCE writes.)
>
>
> There are also some interfaces like AIO or the X86 Shadow Stacks
> interface that indirectly unmap memory through the kernel and look
> like they could perhaps be tricked into racily unmapping a
> just-created sealed VMA. You'd probably have to make sure that they
> can't do that and essentially treat their unmap operations as if they
> came from userspace, I guess? What Linus just wrote.
>
>
> I think either way this feature needs some documentation on what kind
> of context it's supposed to run in.
>
Thanks for listing all the cases. As you pointed out, Chrome is
heavily sandboxed, with syscalls and file access. I will work with Stephan
to check if some of these applied to Chrome.

It is probably worth to mention, with the current approach of mseal(),
i.e. by bitmask,  it is possible to implement those above incrementally.

Thanks
-Jeff



-Jeff


>
> > Each bit represents sealing for one specific syscall type, e.g.
> > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > is that the API is extendable, i.e. when needed, the sealing can be
> > extended to madvise, mlock, etc. Backward compatibility is also easy.
> >
> > The kernel will remember which seal types are applied, and the application
> > doesn’t need to repeat all existing seal types in the next mseal().  Once
> > a seal type is applied, it can’t be unsealed. Call mseal() on an existing
> > seal type is a no-action, not a failure.
> >
> > MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
> >
> > Internally, vm_area_struct adds a new field vm_seals, to store the bit
> > masks.
> >
> > For the affected syscalls, such as mprotect, a check(can_modify_mm) for
> > sealing is added, this usually happens at the early point of the syscall,
> > before any update is made to VMAs. The effect of that is: if any of the
> > VMAs in the given address range fails the sealing check, none of the VMA
> > will be updated. It might be worth noting that this is different from the
> > rest of mprotect(), where some updates can happen even when mprotect
> > returns fail. Consider can_modify_mm only checks vm_seals in
> > vm_area_struct, and it is not going deeper in the page table or updating
> > any HW, success or none behavior might fit better here. I would like to
> > listen to the community's feedback on this.
> >
> > The idea that inspired this patch comes from Stephen Röttger’s work in
> > V8 CFI [5],  Chrome browser in ChromeOS will be the first user of this API.
> >
> > In addition, Stephen is working on glibc change to add sealing support
> > into the dynamic linker to seal all non-writable segments at startup. When
> > that work is completed, all applications can automatically benefit from
> > these new protections.
> >
> > [1] https://kernelnewbies.org/Linux_2_6_8
> > [2] https://v8.dev/blog/control-flow-integrity
> > [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
> > [4] https://man.openbsd.org/mimmutable.2
> > [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
> >
> > Jeff Xu (8):
> >   Add mseal syscall
> >   Wire up mseal syscall
> >   mseal: add can_modify_mm and can_modify_vma
> >   mseal: seal mprotect
> >   mseal munmap
> >   mseal mremap
> >   mseal mmap
> >   selftest mm/mseal mprotect/munmap/mremap/mmap
> >
> >  arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
> >  arch/arm/tools/syscall.tbl                  |    1 +
> >  arch/arm64/include/asm/unistd.h             |    2 +-
> >  arch/arm64/include/asm/unistd32.h           |    2 +
> >  arch/ia64/kernel/syscalls/syscall.tbl       |    1 +
> >  arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
> >  arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
> >  arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
> >  arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
> >  arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
> >  arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
> >  arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
> >  arch/s390/kernel/syscalls/syscall.tbl       |    1 +
> >  arch/sh/kernel/syscalls/syscall.tbl         |    1 +
> >  arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
> >  arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
> >  arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
> >  fs/aio.c                                    |    5 +-
> >  include/linux/mm.h                          |   55 +-
> >  include/linux/mm_types.h                    |    7 +
> >  include/linux/syscalls.h                    |    2 +
> >  include/uapi/asm-generic/unistd.h           |    5 +-
> >  include/uapi/linux/mman.h                   |    6 +
> >  ipc/shm.c                                   |    3 +-
> >  kernel/sys_ni.c                             |    1 +
> >  mm/Kconfig                                  |    8 +
> >  mm/Makefile                                 |    1 +
> >  mm/internal.h                               |    4 +-
> >  mm/mmap.c                                   |   49 +-
> >  mm/mprotect.c                               |    6 +
> >  mm/mremap.c                                 |   19 +-
> >  mm/mseal.c                                  |  328 +++++
> >  mm/nommu.c                                  |    6 +-
> >  mm/util.c                                   |    8 +-
> >  tools/testing/selftests/mm/Makefile         |    1 +
> >  tools/testing/selftests/mm/mseal_test.c     | 1428 +++++++++++++++++++
> >  37 files changed, 1934 insertions(+), 28 deletions(-)
> >  create mode 100644 mm/mseal.c
> >  create mode 100644 tools/testing/selftests/mm/mseal_test.c
> >
> > --
> > 2.42.0.609.gbb76f46606-goog
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-16 17:23 ` Linus Torvalds
@ 2023-10-17  9:07   ` Jeff Xu
  2023-10-17 17:22     ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff Xu @ 2023-10-17  9:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: jeffxu, akpm, keescook, sroettger, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, jannh, surenb, alex.sierra, apopple,
	aneesh.kumar, axelrasmussen, ben, catalin.marinas, david, dwmw,
	ying.huang, hughd, joey.gouly, corbet, wangkefeng.wang,
	Liam.Howlett, lstoakes, willy, mawupeng1, linmiaohe, namit,
	peterx, peterz, ryan.roberts, shr, vbabka, xiujianfeng, yu.ma,
	zhangpeng362, dave.hansen, luto, linux-hardening

Hello Linus,

Thank you for the time reviewing this patch set.

On Mon, Oct 16, 2023 at 10:23 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 16 Oct 2023 at 07:38, <jeffxu@chromium.org> wrote:
> >
> > This patchset proposes a new mseal() syscall for the Linux kernel.
>
> So I have no objections to adding some kind of "lock down memory
> mappings" model, but this isn't it.
>
> First off, the simple stuff: the commit messages are worthless. Having
>
>    check seal for mmap(2)
>
> as the commit message is not even remotely acceptable, to pick one
> random example from the series (7/8).
>
> But that doesn't matter much, since I think the more fundamental
> problems are much worse:
>
>  - the whole "ON_BEHALF_OF_KERNEL" and "ON_BEHALF_OF_USERSPACE" is
> just complete noise and totally illogical. The whole concept needs to
> be redone.
>
> Example of complete nonsense (again, picking 7/8, but that's again
> just a random example):
>
> > @@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages,
> >                 flags |= MAP_LOCKED;
> >
> >         file = get_file(vma->vm_file);
> > -       ret = do_mmap(vma->vm_file, start, size,
> > -                       prot, flags, pgoff, &populate, NULL);
> > +       ret = do_mmap(vma->vm_file, start, size, prot, flags, pgoff,
> > +                       &populate, NULL, ON_BEHALF_OF_KERNEL);
> >         fput(file);
> >  out:
> >         mmap_write_unlock(mm);
>
> Christ. That's *literally* the remap_file_pages() system call
> definition. No way in hell does "ON_BEHALF_OF_KERNEL" make any sense
> in this context.
>
> It's not the only situation. "mremap() as the same thing. vm_munmap()
> has the same thing. vm_brk_flags() has the same thing. None of them
> make any sense.
>
> But even if those obvious kinds of complete mis-designs were to be
> individually fixed, the whole naming and concept is bogus. The
> "ON_BEHALF_OF_KERNEL" thing seems to actually just mean "don't check
> sealing". Naming should describe what the thing *means*, not some
> random policy thing that may or may not be at all relevant.
>

I apologize that I didn't think of a better name for ON_BEHALF_OF_XX
and I should have written a more clear commit message.

I prepared a V2 patchset with a more detailed commit message, hopefully
that will help to explain the design.

Indeed, the ON_BEHALF_OF_XX is confusing, especially with
remap_file_pathes(). remap_file_pathes(2) is not supported
in this patch set, this covers mprotect(2), mmap(2), munmap(2),
mremap(2) as the first feature set. I could extend the sealing
to more syscalls, if it is determined necessary from the outcome of
this discussion. The initial set of 4 syscalls was chosen based
on Chrome's initial wish list.

Regarding the ON_BEHALF_OF flag, my intention is to have a flag,
set at syscall entry points of mprotect(2), munmap(2), mremap(2),
mmap(2),  pass the flag along the call stack, till reaching
can_modify_mm(), can_modify_mm() does the actual check for
the sealing type.

It is probably worth noting that I choose to check one and only
one sealing type per syscall. i.e. munmap(2) checks
MM_SEAL_MUNMAP only. With this approach, sealing can be
implemented incrementally. For example, When implementing
sealing for munmap(2), I don't need to care that mremap(2)
can also call internal functions to unmap the VMAs.
The mremap(2) will be sealed by MM_SEAL_MREMAP(), and
be dealt with separately.

This approach also allows dev to expand the sealing to madvice(),
mlock(), or whatever syscalls or cases that modify VMA's meta data.
As Yann points out, There is a list of cases that we might care about.
Having all of those implemented will take time. Using bitmasks will
help to add those incrementally. An application will  be backward
compatible when a new sealing type is added, i.e. It has to set the
new sealing type explicitly.

>  - the whole MM_SEAL_xx vs VM_SEAL_xx artificial distinction needs to go away.
>
> Not only is it unacceptable to pointlessly create two different name
> spaces for no obvious reason, code like this (from 1/8) should not
> exist:
>
> > +       if (types & MM_SEAL_MSEAL)
> > +               newtypes |= VM_SEAL_MSEAL;
> > +
> > +       if (types & MM_SEAL_MPROTECT)
> > +               newtypes |= VM_SEAL_MPROTECT;
> > +
> > +       if (types & MM_SEAL_MUNMAP)
> > +               newtypes |= VM_SEAL_MUNMAP;
> > +
> > +       if (types & MM_SEAL_MMAP)
> > +               newtypes |= VM_SEAL_MMAP;
> > +
> > +       if (types & MM_SEAL_MREMAP)
> > +               newtypes |= VM_SEAL_MREMAP;
>
> Sure, we have that in some cases when there was a *reason* for
> namespacing imposed on us from an API standpoint (ie the "open()"
> flags that get turned into FMODE_xyz flags, or the MS_xyz mount flags
> being turned into MNT_xyz flags), but those tend to be signs of
> problems in the system call API where some mixup made it impossible to
> use the flags directly (most commonly because there is some extended
> internal representation of said things).
>
> For example, the MS_xyz namespace is a combination of "these are flags
> for the new mount" (like MS_RDONLY) and "this is how you should mount
> it" (like MS_REMOUNT), and so the user interface has that odd mixing
> of different things, and the MNT_xyz namespace is a distillation of
> the former.
>
> But we certainly should not strive to introduce *new* interfaces that
> start out with this kind of mixup and pointless "translate from one
> bit to another" code.
>
The two namespaces can go away, that means the bitmap will be stored
as is by vm_seals in VMA struct. (1/8) Copied below.

+++ b/include/linux/mm_types.h
@@ -660,6 +660,13 @@ struct vm_area_struct {
+       unsigned long vm_seals;         /* seal flags, see mm.h. */

My original considerations are:
1. vm_seals is a new field, and mseal() currently uses 5 bits.
vm_seals can be repurposed to store other VMA flags in future,
having two namespaces (API and internal one) can be useful.
2. vm_flags is full and it seems to me there is pending work on expanding
vm_flags. [1]  if that happens, we will need translation logic.
[1] https://lore.kernel.org/linux-mm/4F6CA298.4000301@jp.fujitsu.com/

I might have over-engineered this, so I removed the VM_SEAL_XX in V2.

>  - And finally (for now), I hate that MM_ACTION_xyz thing too!
>
> Why do we have MM_ACTION_MREMAP, and pointless like this (from 3/8):
>
> > +       switch (action) {
> > +       case MM_ACTION_MPROTECT:
> > +               if (vma->vm_seals & VM_SEAL_MPROTECT)
> > +                       return false;
> > +               break;
>
> when the sane thing to do is to use the *same* MM_SEAL_xyz bitmask and
> sealing bitmask and just say
>
>         if (vma->vm_seal & vm_action)
>                 return -EPERM;
>

Make sense. My original thought is that can_modify_vma() will check one
and only one seal type, and having an enum type will enforce that.  This
restriction feels unnecessary. I removed the action type in V2.

> IOW, you have pointlessly introduced not *two* different namespaces,
> but *three*. All doing the exact same thing, and all just causing
> pointless and ugly code to "translate" the actions of one into the
> model of another.
>
> So get rid of the "MM_ACTION_xyz" thing. Get rid of ther "VM_SEAL_xyz"
> thing. Use *one* single "these are the sealing bits".
>
> And get rid of "enum caller_origin" entirely. I don't know what the
> right model for that thing is, but that isn't it.
>
> *Maybe* the right model is some MM_SEAL_OVERRIDE bit that user space
> cannot set, but that the kernel can use internally - and if that is
> the right model, then dammit, the *uses* should be very very obvious
> why the override is fine, because that remap_file_pages() use sure as
> hell was *not* obvious.
>
> In fact, it's not at all obvious why anything should override the
> sealing bits - EVER. So I'm not convinced that the right model is
> "replace ON_BEHALF_OF_KERNEL with MM_SEAL_OVERRIDE". Why would we
> *ever* want to override sealing? It sounds like complete garbage. Most
> of the users seem to be things like "execve()", which is nonsensical,
> since the VM wouldn't have been sealed at that point _anyway_, so
> having a "don't bother checking sealing bits" flag seems entirely
> useless.
>
Would the new commit message and comments in V2 help to
explain the design better ? (will send shortly)

Another code change I can make to help the readability (not in v2),
is to set and pass checkSeals flag from syscall entry point, all the
way to can_modify_vma(). Currently, I didn't do that if I checked the
internal function is only used by syscal entry point, e.g. in
do_mprotect_pkey(), mremap_to(), ksys_mmap_pgoff() cases.
Doing that does increase the size of the patch set though.

Thanks.
-Jeff


-Jeff

> Anyway, this is all a resounding NAK on this series in this form. My
> complaints are not some kind of small "fix this up". These are
> fundamental issues.
>
>             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17  8:34   ` Jeff Xu
@ 2023-10-17 12:56     ` Matthew Wilcox
  0 siblings, 0 replies; 43+ messages in thread
From: Matthew Wilcox @ 2023-10-17 12:56 UTC (permalink / raw)
  To: Jeff Xu
  Cc: jeffxu, akpm, keescook, sroettger, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, jannh, surenb, alex.sierra, apopple,
	aneesh.kumar, axelrasmussen, ben, catalin.marinas, david, dwmw,
	ying.huang, hughd, joey.gouly, corbet, wangkefeng.wang,
	Liam.Howlett, torvalds, lstoakes, mawupeng1, linmiaohe, namit,
	peterx, peterz, ryan.roberts, shr, vbabka, xiujianfeng, yu.ma,
	zhangpeng362, dave.hansen, luto, linux-hardening

On Tue, Oct 17, 2023 at 01:34:35AM -0700, Jeff Xu wrote:
> > > types: bit mask to specify which syscall to seal, currently they are:
> > > MM_SEAL_MSEAL 0x1
> > > MM_SEAL_MPROTECT 0x2
> > > MM_SEAL_MUNMAP 0x4
> > > MM_SEAL_MMAP 0x8
> > > MM_SEAL_MREMAP 0x10
> >
> > I don't understand why we want this level of granularity.  The OpenBSD
> > and XNU examples just say "This must be immutable*".  For values of
> > immutable that allow downgrading access (eg RW to RO or RX to RO),
> > but not upgrading access (RW->RX, RO->*, RX->RW).
> >
> > > Each bit represents sealing for one specific syscall type, e.g.
> > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > > is that the API is extendable, i.e. when needed, the sealing can be
> > > extended to madvise, mlock, etc. Backward compatibility is also easy.
> >
> > Honestly, it feels too flexible.  Why not just two flags to mprotect()
> > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> > maybe for some things we want to be able to downgrade and for other
> > things, we don't.
> >
> Having a seal type per syscall type helps to add the feature incrementally.
> Applications also know exactly what is sealed.

You're trying to portray a disadvantage as an advantage,  This is the
seccomp disease, only worse because you're trying to deny individual
syscalls instead of building up a list of permitted syscalls.  If we
introduce a new syscall tomorrow which can affect VMAs, we'll then
make it the application's fault for not disabling that new syscall.
That's terrible design!

> I'm not against types such as IMMUTABLE and DOWNGRADEABLE, if we
> can define what it seals precisely. As Jann pointed out, there have
> other scenarios that potentially affect IMMUTABLE. Implementing all thoses
> will take time. And if we missed a case, we could introduce backward
> compatibility issues to the application. Bitmask will solve this nicely, i.e.
> application will need to apply the newly added sealing type explicitly.

It is your job to do this.  You seem to have taken the attitude that you
will give the Chrome team exactly what they asked for instead of trying
to understand their requirements and give them what they need.

And don't send a v2 before discussion of v1 has come to an end.  It
uselessly fragments the discussion.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-16 15:18 ` [RFC PATCH v1 0/8] Introduce mseal() syscall Matthew Wilcox
  2023-10-17  8:34   ` Jeff Xu
@ 2023-10-17 15:29   ` Pedro Falcato
  2023-10-17 21:33     ` Jeff Xu
  1 sibling, 1 reply; 43+ messages in thread
From: Pedro Falcato @ 2023-10-17 15:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: jeffxu, akpm, keescook, sroettger, jeffxu, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
>
> This seems like a confusing way to introduce the subject.  Here, you're
> talking about page permissions, whereas (as far as I can tell), mseal() is
> about making _virtual_ addresses immutable, for some value of immutable.
>
> > Memory sealing additionally protects the mapping itself against
> > modifications. This is useful to mitigate memory corruption issues where
> > a corrupted pointer is passed to a memory management syscall. For example,
> > such an attacker primitive can break control-flow integrity guarantees
> > since read-only memory that is supposed to be trusted can become writable
> > or .text pages can get remapped. Memory sealing can automatically be
> > applied by the runtime loader to seal .text and .rodata pages and
> > applications can additionally seal security critical data at runtime.
> > A similar feature already exists in the XNU kernel with the
> > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > patchset has been designed to be compatible with the Chrome use case.
>
> This [2] seems very generic and wide-ranging, not helpful.  [5] was more
> useful to understand what you're trying to do.
>
> > The new mseal() is an architecture independent syscall, and with
> > following signature:
> >
> > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> >
> > addr/len: memory range.  Must be continuous/allocated memory, or else
> > mseal() will fail and no VMA is updated. For details on acceptable
> > arguments, please refer to comments in mseal.c. Those are also fully
> > covered by the selftest.
>
> Mmm.  So when you say "continuous/allocated" what you really mean is
> "Must have contiguous VMAs" rather than "All pages in this range must
> be populated", yes?
>
> > types: bit mask to specify which syscall to seal, currently they are:
> > MM_SEAL_MSEAL 0x1
> > MM_SEAL_MPROTECT 0x2
> > MM_SEAL_MUNMAP 0x4
> > MM_SEAL_MMAP 0x8
> > MM_SEAL_MREMAP 0x10
>
> I don't understand why we want this level of granularity.  The OpenBSD
> and XNU examples just say "This must be immutable*".  For values of
> immutable that allow downgrading access (eg RW to RO or RX to RO),
> but not upgrading access (RW->RX, RO->*, RX->RW).
>
> > Each bit represents sealing for one specific syscall type, e.g.
> > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > is that the API is extendable, i.e. when needed, the sealing can be
> > extended to madvise, mlock, etc. Backward compatibility is also easy.
>
> Honestly, it feels too flexible.  Why not just two flags to mprotect()
> -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> maybe for some things we want to be able to downgrade and for other
> things, we don't.

I think it's worth pointing out that this suggestion (with PROT_*)
could easily integrate with mmap() and as such allow for one-shot
mmap() + mseal().
If we consider the common case as 'addr = mmap(...); mseal(addr);', it
definitely sounds like a performance win as we halve the number of
syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
like common patterns.

-- 
Pedro

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17  9:07   ` Jeff Xu
@ 2023-10-17 17:22     ` Linus Torvalds
  2023-10-17 18:20       ` Theo de Raadt
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2023-10-17 17:22 UTC (permalink / raw)
  To: Jeff Xu
  Cc: jeffxu, akpm, keescook, sroettger, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, jannh, surenb, alex.sierra, apopple,
	aneesh.kumar, axelrasmussen, ben, catalin.marinas, david, dwmw,
	ying.huang, hughd, joey.gouly, corbet, wangkefeng.wang,
	Liam.Howlett, lstoakes, willy, mawupeng1, linmiaohe, namit,
	peterx, peterz, ryan.roberts, shr, vbabka, xiujianfeng, yu.ma,
	zhangpeng362, dave.hansen, luto, linux-hardening

On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@google.com> wrote:
>
> It is probably worth noting that I choose to check one and only
> one sealing type per syscall. i.e. munmap(2) checks
> MM_SEAL_MUNMAP only.

Yeah, this is wrong.

It's wrong exactly because other system calls will unmap things too.

Using mmap() to over-map something will unmap the old one.

Same goes for mremap() to move over an existing mapping.

So the whole "do things by the name of the system call" is not workable.

All that matters is what the system calls *do*, not what their name is.

And mmap() will fundamentally munmap() as part of the action.

This is why I absolutely hated the old "ON_BEHALF_OF_xyz" flag, and
why I still absolutely hate the "randomly pass different sealing flags
fto do_munmap()".

You should *not* be passing any flags at all to do_munmap(). Because
*regardless* of who calls it, and regardless of which system call
started the action, do_munmap() unmaps a virtual memory area.

See what I'm saying?

                 Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17 17:22     ` Linus Torvalds
@ 2023-10-17 18:20       ` Theo de Raadt
  2023-10-17 18:38         ` Linus Torvalds
  2023-10-17 23:01         ` Jeff Xu
  0 siblings, 2 replies; 43+ messages in thread
From: Theo de Raadt @ 2023-10-17 18:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Xu, jeffxu, akpm, keescook, sroettger, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@google.com> wrote:
> >
> > It is probably worth noting that I choose to check one and only
> > one sealing type per syscall. i.e. munmap(2) checks
> > MM_SEAL_MUNMAP only.
> 
> Yeah, this is wrong.
> 
> It's wrong exactly because other system calls will unmap things too.
> 
> Using mmap() to over-map something will unmap the old one.
> 
> Same goes for mremap() to move over an existing mapping.
> 
> So the whole "do things by the name of the system call" is not workable.
> 
> All that matters is what the system calls *do*, not what their name is.

I agree completely...

mseal() is a clone of mimmutable(2), but with an extremely
over-complicated API based upon dubious arguments.

I designed mimmutable(2) [1] in OpenBSD, it took about a year to get all
the components working correctly.  There were many intermediate API
during development, but in the end the API is simply:

     int mimmutable(void *addr, size_t len);

The kernel code for mimmutable() traverses the specified VA range.  In
that range, it will find unmapped sub-regions (which are are ignored)
and mapped sub-regions. For these mapped regions, it does not care what
the permissions are, it just marks each sub-region as immutable.

Later on, when any VM operation request upon a VA range attempts to
      (1) change the permissions
      (2) to re-map on top
      (3) or dispose of the mapping,
that operation is refused with errno EPERM.  We don't care where the
request comes from (ie. what system call).  It is a behaviour of the
VM system, when asked to act upon a VA sub-range mapping.

Very simple semantics.

The only case where the immutable marker is ignored is during address space
teardown as a result of process termination.


In his submission of this API, Jeff Xu makes three claims I find dubious;

> Also, Chrome wants to adopt this feature for their CFI work [2] and this
> patchset has been designed to be compatible with the Chrome use case.

I specifically designed mimmutable(2) with chrome in mind, and the
chrome binary running on OpenBSD is full of immutable mappings.  All the
library regions automatically become immutable because ld.so can infer
it and do the mimmutable calls for the right subregions.

So this chrome work has already been done by OpenBSD, and it is dead
simple.  During early development I thought mimmutable(2) would be
called by applications or libraries, but I was dead wrong: 99.9% of
calls are from ld.so, and no applications need to call it, these are the
two exceptions:

In OpenBSD, mimmutable() is used in libc malloc() to lock-down some data
structures at initialization time, so they canoot be attacked to create
an invariant for use in ROP return-to-libc style methods.

In Chrome, there is a v8_flags variable rounded out to a full page, and
placed in .data.  Chrome initialized this variable, and wants to mprotect
PROT_READ, but .data has been made immutable by ld.so.  So we force this
page into a new ELF section called "openbsd.mutable" which also behaves RW
like .data.  Where chrome does the mprotect  PROT_READ, it now also performs
mimmutable() on that page.

> Having a seal type per syscall type helps to add the feature incrementally.

Yet, somehow OpenBSD didn't do it per syscall, and we managed to make our
entire base operating system and 10,000+ applications automatically receive
the benefits.  In one year's effort.  The only application which cared about
it was chrome, described in the previous paragraph.

I think Jeff's idea here is super dangerous.  What will actually happen
is people will add a few mseal() sub-operations and think the job is done.
It isn't done.  They need all the mseal() requests, or the mapping are
not safe.

It is very counterproductive to provide developers a complex API that has
insecure suboperations.

> Applications also know exactly what is sealed.

Actually applicatins won't know because there is no tooling to inspect this --
but I will argue further that applications don't need to know.  Immutable
marking is a system decision, not a program decision.


I'll close by asking for a new look at the mimmutable(2) API we settled
on for OpenBSD.  I think there is nothing wrong with it.  I'm willing to
help guide glibc / ld.so / musl teams through the problems they may find
along the way, I know where the skeletons are buried.  Two in
particular: -znow RELRO already today, and xonly-text in the future.


[1] https://man.openbsd.org/mimmutable.2


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17 18:20       ` Theo de Raadt
@ 2023-10-17 18:38         ` Linus Torvalds
  2023-10-17 18:55           ` Theo de Raadt
  2023-10-19  8:00           ` Stephen Röttger
  2023-10-17 23:01         ` Jeff Xu
  1 sibling, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2023-10-17 18:38 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Jeff Xu, jeffxu, akpm, keescook, sroettger, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Tue, 17 Oct 2023 at 11:20, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> The only case where the immutable marker is ignored is during address space
> teardown as a result of process termination.

.. and presumably also execve()?

I do like us starting with just "mimmutable()", since it already
exists. Particularly if chrome already knows how to use it.

Maybe add a flag field (require it to be zero initially) just to allow
any future expansion. Maybe the chrome team has *wanted* to have some
finer granularity thing and currently doesn't use mimmutable() in some
case?

                  Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17 18:38         ` Linus Torvalds
@ 2023-10-17 18:55           ` Theo de Raadt
  2023-10-19  8:00           ` Stephen Röttger
  1 sibling, 0 replies; 43+ messages in thread
From: Theo de Raadt @ 2023-10-17 18:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Xu, jeffxu, akpm, keescook, sroettger, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Oct 2023 at 11:20, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > The only case where the immutable marker is ignored is during address space
> > teardown as a result of process termination.
> 
> .. and presumably also execve()?

Ah yes of course that also.

> I do like us starting with just "mimmutable()", since it already
> exists. Particularly if chrome already knows how to use it.

Well, our chrome fork knows how to use it.  Robert Nagy in our group maintains
1280 patches to make chrome work on OpenBSD.  Google ignores them and will not
upstream them.  Some of these changes are security related, and they ignore
them.  Other changes are to cope with security work we've done on our own,
for example: JIT changes from Stephen@google for mandatory IBT which google
hasn't upstreamed yet, impacts due to PROT_EXEC-only mappings, etc.

But the only chrome diff required for mimmutable is for that v8_flags thing
I described.   And the same issue would need handling for mseal().  Introducing
the new "mutable" ELF section is probably going to be a bigger fuss than the
system call after mprotect(PROT_READ)....

> Maybe add a flag field (require it to be zero initially) just to allow
> any future expansion. Maybe the chrome team has *wanted* to have some
> finer granularity thing and currently doesn't use mimmutable() in some
> case?

There's only one feature I can think of, and we already do it right now,
documented in our manual page:

CAVEATS
     At present, mprotect(2) may reduce permissions on immutable pages marked
     PROT_READ | PROT_WRITE to the less permissive PROT_READ.  This one-way
     operation is permitted for an introductory period to observe how software
     uses this mechanism.  It may change to require explicit mutable region
     annotation with __attribute__((section(".openbsd.mutable"))) and explicit
     calls to mimmutable().

We had something which needed this behaviour during the development
transition.  It is exlusively mprotect RW -> R, no other transitions
allowed.

But once the transition was done, we don't need it anymore.  I want to
delete it, because it is a bit of a trap.  It still fails closed from an
attack perspective, but...

What worries me is a piece of code reached by mistake can do a
mprotect(lowering), not receive -1 EPERM, and carry on running..  I'd
prefer the first time you touch a mapping in the wrong way, you receive
indication of error.  This only applies applies to mprotect() acting up
on a region so the argument is a bit weak, due to mprotect() return
value checking being about as rare as unicorns.

Also, it would be a pain for OpenBSD to transition to adding a 0 flag.
I would need to see real cause not theory.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17 15:29   ` Pedro Falcato
@ 2023-10-17 21:33     ` Jeff Xu
  2023-10-17 22:35       ` Pedro Falcato
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff Xu @ 2023-10-17 21:33 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Matthew Wilcox, jeffxu, akpm, keescook, sroettger, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Tue, Oct 17, 2023 at 8:30 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> > > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
> >
> > This seems like a confusing way to introduce the subject.  Here, you're
> > talking about page permissions, whereas (as far as I can tell), mseal() is
> > about making _virtual_ addresses immutable, for some value of immutable.
> >
> > > Memory sealing additionally protects the mapping itself against
> > > modifications. This is useful to mitigate memory corruption issues where
> > > a corrupted pointer is passed to a memory management syscall. For example,
> > > such an attacker primitive can break control-flow integrity guarantees
> > > since read-only memory that is supposed to be trusted can become writable
> > > or .text pages can get remapped. Memory sealing can automatically be
> > > applied by the runtime loader to seal .text and .rodata pages and
> > > applications can additionally seal security critical data at runtime.
> > > A similar feature already exists in the XNU kernel with the
> > > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > > patchset has been designed to be compatible with the Chrome use case.
> >
> > This [2] seems very generic and wide-ranging, not helpful.  [5] was more
> > useful to understand what you're trying to do.
> >
> > > The new mseal() is an architecture independent syscall, and with
> > > following signature:
> > >
> > > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> > >
> > > addr/len: memory range.  Must be continuous/allocated memory, or else
> > > mseal() will fail and no VMA is updated. For details on acceptable
> > > arguments, please refer to comments in mseal.c. Those are also fully
> > > covered by the selftest.
> >
> > Mmm.  So when you say "continuous/allocated" what you really mean is
> > "Must have contiguous VMAs" rather than "All pages in this range must
> > be populated", yes?
> >
> > > types: bit mask to specify which syscall to seal, currently they are:
> > > MM_SEAL_MSEAL 0x1
> > > MM_SEAL_MPROTECT 0x2
> > > MM_SEAL_MUNMAP 0x4
> > > MM_SEAL_MMAP 0x8
> > > MM_SEAL_MREMAP 0x10
> >
> > I don't understand why we want this level of granularity.  The OpenBSD
> > and XNU examples just say "This must be immutable*".  For values of
> > immutable that allow downgrading access (eg RW to RO or RX to RO),
> > but not upgrading access (RW->RX, RO->*, RX->RW).
> >
> > > Each bit represents sealing for one specific syscall type, e.g.
> > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > > is that the API is extendable, i.e. when needed, the sealing can be
> > > extended to madvise, mlock, etc. Backward compatibility is also easy.
> >
> > Honestly, it feels too flexible.  Why not just two flags to mprotect()
> > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> > maybe for some things we want to be able to downgrade and for other
> > things, we don't.
>
> I think it's worth pointing out that this suggestion (with PROT_*)
> could easily integrate with mmap() and as such allow for one-shot
> mmap() + mseal().
> If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> definitely sounds like a performance win as we halve the number of
> syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> like common patterns.
>
Yes. mmap() can support sealing as well, and memory is allocated as
immutable from begining.
This is orthogonal to mseal() though.
In case of ld.so, iiuc, memory can be first allocated as W, then later
changed to RO, for example, during symbol resolution.
The important point is that the application can decide what type of
sealing it wants, and when to apply it.  There needs to be an api(),
that can be mseal() or mprotect2() or mimmutable(), the naming is not
important to me.

mprotect() in linux have the following signature:
int mprotect(void addr[.len], size_t len, int prot);
the prot bitmasks are all taken here.
I have not checked the prot field in mmap(), there might be bits left,
even not, we could have mmap2(), so that is not an issue.

Thanks
-Jeff

> --
> Pedro

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17 21:33     ` Jeff Xu
@ 2023-10-17 22:35       ` Pedro Falcato
  2023-10-18 18:20         ` Jeff Xu
  0 siblings, 1 reply; 43+ messages in thread
From: Pedro Falcato @ 2023-10-17 22:35 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Matthew Wilcox, jeffxu, akpm, keescook, sroettger, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Tue, Oct 17, 2023 at 10:34 PM Jeff Xu <jeffxu@google.com> wrote:
>
> On Tue, Oct 17, 2023 at 8:30 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> > > > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > > > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
> > >
> > > This seems like a confusing way to introduce the subject.  Here, you're
> > > talking about page permissions, whereas (as far as I can tell), mseal() is
> > > about making _virtual_ addresses immutable, for some value of immutable.
> > >
> > > > Memory sealing additionally protects the mapping itself against
> > > > modifications. This is useful to mitigate memory corruption issues where
> > > > a corrupted pointer is passed to a memory management syscall. For example,
> > > > such an attacker primitive can break control-flow integrity guarantees
> > > > since read-only memory that is supposed to be trusted can become writable
> > > > or .text pages can get remapped. Memory sealing can automatically be
> > > > applied by the runtime loader to seal .text and .rodata pages and
> > > > applications can additionally seal security critical data at runtime.
> > > > A similar feature already exists in the XNU kernel with the
> > > > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > > > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > > > patchset has been designed to be compatible with the Chrome use case.
> > >
> > > This [2] seems very generic and wide-ranging, not helpful.  [5] was more
> > > useful to understand what you're trying to do.
> > >
> > > > The new mseal() is an architecture independent syscall, and with
> > > > following signature:
> > > >
> > > > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> > > >
> > > > addr/len: memory range.  Must be continuous/allocated memory, or else
> > > > mseal() will fail and no VMA is updated. For details on acceptable
> > > > arguments, please refer to comments in mseal.c. Those are also fully
> > > > covered by the selftest.
> > >
> > > Mmm.  So when you say "continuous/allocated" what you really mean is
> > > "Must have contiguous VMAs" rather than "All pages in this range must
> > > be populated", yes?
> > >
> > > > types: bit mask to specify which syscall to seal, currently they are:
> > > > MM_SEAL_MSEAL 0x1
> > > > MM_SEAL_MPROTECT 0x2
> > > > MM_SEAL_MUNMAP 0x4
> > > > MM_SEAL_MMAP 0x8
> > > > MM_SEAL_MREMAP 0x10
> > >
> > > I don't understand why we want this level of granularity.  The OpenBSD
> > > and XNU examples just say "This must be immutable*".  For values of
> > > immutable that allow downgrading access (eg RW to RO or RX to RO),
> > > but not upgrading access (RW->RX, RO->*, RX->RW).
> > >
> > > > Each bit represents sealing for one specific syscall type, e.g.
> > > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > > > is that the API is extendable, i.e. when needed, the sealing can be
> > > > extended to madvise, mlock, etc. Backward compatibility is also easy.
> > >
> > > Honestly, it feels too flexible.  Why not just two flags to mprotect()
> > > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> > > maybe for some things we want to be able to downgrade and for other
> > > things, we don't.
> >
> > I think it's worth pointing out that this suggestion (with PROT_*)
> > could easily integrate with mmap() and as such allow for one-shot
> > mmap() + mseal().
> > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > definitely sounds like a performance win as we halve the number of
> > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > like common patterns.
> >
> Yes. mmap() can support sealing as well, and memory is allocated as
> immutable from begining.
> This is orthogonal to mseal() though.

I don't see how this can be orthogonal to mseal().
In the case we opt for adding PROT_ bits, we should more or less only
need to adapt calc_vm_prot_bits(), and the rest should work without
issues.
vma merging won't merge vmas with different prots. The current
interfaces (mmap and mprotect) would work just fine.
In this case, mseal() or mimmutable() would only be needed if you need
to set immutability over a range of VMAs with different permissions.

Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
The only annoying wrench in my plans here is that we have effectively
run out of vm_flags bits in 32-bit architectures, so this approach as
I described is not compatible with 32-bit.

> In case of ld.so, iiuc, memory can be first allocated as W, then later
> changed to RO, for example, during symbol resolution.
> The important point is that the application can decide what type of
> sealing it wants, and when to apply it.  There needs to be an api(),
> that can be mseal() or mprotect2() or mimmutable(), the naming is not
> important to me.
>
> mprotect() in linux have the following signature:
> int mprotect(void addr[.len], size_t len, int prot);
> the prot bitmasks are all taken here.
> I have not checked the prot field in mmap(), there might be bits left,
> even not, we could have mmap2(), so that is not an issue.

I don't see what you mean. We have plenty of prot bits left (32-bits,
and we seem to have around 8 different bits used).
And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)

The only issue seems to be that 32-bit ran out of vm_flags, but that
can probably be worked around if need be.

-- 
Pedro

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17 18:20       ` Theo de Raadt
  2023-10-17 18:38         ` Linus Torvalds
@ 2023-10-17 23:01         ` Jeff Xu
  2023-10-17 23:56           ` Theo de Raadt
  1 sibling, 1 reply; 43+ messages in thread
From: Jeff Xu @ 2023-10-17 23:01 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Linus Torvalds, jeffxu, akpm, keescook, sroettger, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Tue, Oct 17, 2023 at 11:20 AM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@google.com> wrote:
> > >
> > > It is probably worth noting that I choose to check one and only
> > > one sealing type per syscall. i.e. munmap(2) checks
> > > MM_SEAL_MUNMAP only.
> >
> > Yeah, this is wrong.
> >
> > It's wrong exactly because other system calls will unmap things too.
> >
> > Using mmap() to over-map something will unmap the old one.
> >
> > Same goes for mremap() to move over an existing mapping.
> >
> > So the whole "do things by the name of the system call" is not workable.
> >
> > All that matters is what the system calls *do*, not what their name is.
>
> I agree completely...
>
> mseal() is a clone of mimmutable(2), but with an extremely
> over-complicated API based upon dubious arguments.
>
> I designed mimmutable(2) [1] in OpenBSD, it took about a year to get all
> the components working correctly.  There were many intermediate API
> during development, but in the end the API is simply:
>
>      int mimmutable(void *addr, size_t len);
>
> The kernel code for mimmutable() traverses the specified VA range.  In
> that range, it will find unmapped sub-regions (which are are ignored)
> and mapped sub-regions. For these mapped regions, it does not care what
> the permissions are, it just marks each sub-region as immutable.
>
> Later on, when any VM operation request upon a VA range attempts to
>       (1) change the permissions
>       (2) to re-map on top
>       (3) or dispose of the mapping,
> that operation is refused with errno EPERM.  We don't care where the
> request comes from (ie. what system call).  It is a behaviour of the
> VM system, when asked to act upon a VA sub-range mapping.
>
> Very simple semantics.
>
> The only case where the immutable marker is ignored is during address space
> teardown as a result of process termination.
>
May I ask, for BSD's implementation of immutable(), do you cover
things such as mlock(),
madvice() ? or just the protection bit (WRX) + remap() + unmap().

In other words:
Is BSD's definition of immutable equivalent to
MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ?

I hesitate to introduce the concept of immutable into linux because I don't know
all the scenarios present in linux where VMAs's metadata can be
modified. As Jann's email pointed out,
There could be quite a few things we still need to deal with, to
completely block the possibility,
e.g. malicious code attempting to write to a RO memory or change RW
memory to RWX.

If, as part of immutable, I also block madvice(), mlock(), which also updates
VMA's metadata, so by definition, I could.  What if the user wants the
features in
madvice() and at the same time, also wants their .text protected ?

Also, if linux introduces a new syscall that depends on a new metadata of VMA,
say msecret(), (for discussion purpose), should immutable
automatically support that ?

Without those questions answered, I couldn't choose the route of
immutable() yet.

-Jeff



>
> In his submission of this API, Jeff Xu makes three claims I find dubious;
>
> > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > patchset has been designed to be compatible with the Chrome use case.
>
> I specifically designed mimmutable(2) with chrome in mind, and the
> chrome binary running on OpenBSD is full of immutable mappings.  All the
> library regions automatically become immutable because ld.so can infer
> it and do the mimmutable calls for the right subregions.
>
> So this chrome work has already been done by OpenBSD, and it is dead
> simple.  During early development I thought mimmutable(2) would be
> called by applications or libraries, but I was dead wrong: 99.9% of
> calls are from ld.so, and no applications need to call it, these are the
> two exceptions:
>
> In OpenBSD, mimmutable() is used in libc malloc() to lock-down some data
> structures at initialization time, so they canoot be attacked to create
> an invariant for use in ROP return-to-libc style methods.
>
> In Chrome, there is a v8_flags variable rounded out to a full page, and
> placed in .data.  Chrome initialized this variable, and wants to mprotect
> PROT_READ, but .data has been made immutable by ld.so.  So we force this
> page into a new ELF section called "openbsd.mutable" which also behaves RW
> like .data.  Where chrome does the mprotect  PROT_READ, it now also performs
> mimmutable() on that page.
>
> > Having a seal type per syscall type helps to add the feature incrementally.
>
> Yet, somehow OpenBSD didn't do it per syscall, and we managed to make our
> entire base operating system and 10,000+ applications automatically receive
> the benefits.  In one year's effort.  The only application which cared about
> it was chrome, described in the previous paragraph.
>
> I think Jeff's idea here is super dangerous.  What will actually happen
> is people will add a few mseal() sub-operations and think the job is done.
> It isn't done.  They need all the mseal() requests, or the mapping are
> not safe.
>
> It is very counterproductive to provide developers a complex API that has
> insecure suboperations.
>
> > Applications also know exactly what is sealed.
>
> Actually applicatins won't know because there is no tooling to inspect this --
> but I will argue further that applications don't need to know.  Immutable
> marking is a system decision, not a program decision.
>
>
> I'll close by asking for a new look at the mimmutable(2) API we settled
> on for OpenBSD.  I think there is nothing wrong with it.  I'm willing to
> help guide glibc / ld.so / musl teams through the problems they may find
> along the way, I know where the skeletons are buried.  Two in
> particular: -znow RELRO already today, and xonly-text in the future.
>
>
> [1] https://man.openbsd.org/mimmutable.2
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17 23:01         ` Jeff Xu
@ 2023-10-17 23:56           ` Theo de Raadt
  2023-10-18  3:18             ` Jeff Xu
  0 siblings, 1 reply; 43+ messages in thread
From: Theo de Raadt @ 2023-10-17 23:56 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Linus Torvalds, jeffxu, akpm, keescook, sroettger, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

Jeff Xu <jeffxu@google.com> wrote:

> May I ask, for BSD's implementation of immutable(), do you cover
> things such as mlock(),
> madvice() ? or just the protection bit (WRX) + remap() + unmap().

It only prevents removal of the mapping, placement of a replacement
mapping, or changing the existing permissions.  If one page in the
existing sub-region is marked immutable, the whole operation fails with
EPERM.

Those are the only user-visible aspects that an attacker cares about to
utilize in this area.

mlock() and madvise() deal with the physical memory handling underneath
the VA.  They have nothing to do with how attack code might manipulate
the VA address space inside a program to convert a series of dead-end
approaches into a succesfull escalation strategy.

[It would be very long conversation to explain where and how this has
been utilized to make an attack succesfull]

> In other words:
> Is BSD's definition of immutable equivalent to
> MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ?

I can't compare it to your subsystem, because I completely fail to
understand the cause or benefit of all the complexity.

I think I've explained what mimmutable() is in extremely simple terms.

And I don't understand else you are trying to do anything beyond what
mimmutable() offers.  It seems like this is inventing additional
solutions without proof that any of them are necessary to solve the
specific problem that is known.

> I hesitate to introduce the concept of immutable into linux because I don't know
> all the scenarios present in linux where VMAs's metadata can be
> modified.

Good grief.  It seems obvious if you want to lock the change-behaviour
of an object (the object in this case being a VA sub-region, there is a
datastructure for that, in OpenBSD it is called an "entry"), then you
put a flag in that object's data-structure and you simply check the flag
everytime a change-operation is attempted.  It is a flag which gets set,
and checked.  Nothing ever clears it (except address space teardown).

This flag must be put on the data structure that manages VA sub-ranges.

In our case when a prot/mapping operation reaches low-level code that
will want to change an "entry", we notice it is not allowed and simply
percolate EPERM up through the layers.

> There could be quite a few things we still need to deal with, to
> completely block the possibility,
> e.g. malicious code attempting to write to a RO memory

What?!  writes to RO memory are blocked by the permission bits.

> or change RW memory to RWX.

In our case that is blocked by W^X policy.

But if the region is marked mimmutable, then that's another reason you cannot
change RW to RWX.  It seems so off-topic, to talk about writes to RO memory.
I get a feeling you are a bit lost.

mimmutable() is not about permissions, but about locking permissions.
- You can't change the permissions of the address space region.
- You cannot map a replacement object at the location instead (especially
  with different permission).
- You cannot unmap at that location (which you would do if you wanted to
  map a new object, with a different permission).

All 3 of these scenarios are identical.  No regular code performs these 3
operations on regions of the address space which we mark immutable.

There is nothing more to mimmutable in the VM layer.  The hard work is
writing code in execve() and ld.so which will decide which objects can
be marked immutable automatically, so that programs don't do this to
themselves.

I'm aware of where this simple piece fits in.  It does not solve all
problems, it is a very narrow change to impact a problem which only
high-value targets will ever face (like chrome).

But I think you don't understand the purpose of this mechanism.

> If, as part of immutable, I also block madvice(), mlock(), which also updates
> VMA's metadata, so by definition, I could.  What if the user wants the
> features in
> madvice() and at the same time, also wants their .text protected ?

I have no idea what you are talking about.  None of those things relate
to the access permission of the memory the user sees, and therefore none
of them are in the attack surface profile which is being prevented.

Meaning, we allow madvise() and mlock() and mphysicalquantummemory() because
those relate to the physical storage and not the VA permission model.

> Also, if linux introduces a new syscall that depends on a new metadata of VMA,
> say msecret(), (for discussion purpose), should immutable
> automatically support that ?

How about the future makingexcuses() system call?

I don't think you understand the problem space well enough to come up with
your own solution for it.  I spent a year on this, and ship a complete system
using it.  You are asking such simplistic questions above it shocks me.

Maybe read the LWN article;

    https://lwn.net/Articles/915640/

> Without those questions answered, I couldn't choose the route of
> immutable() yet.

"... so I can clearly not choose the wine in front of you."

If you don't understand what this thing is for, and cannot minimize the
complexity of this thing, then Linux doesn't need it at all.

I should warn everyone the hard work is not in the VM layer, but in
ld.so -- deciding which parts of the image to make immutable, and when.
It is also possible to make some segments immutable directly in execve()
-- but in both cases you better have a really good grasp on RELRO
executable layout or will make too many pieces immutable...

I am pretty sure Linux will never get as far as we got. Even our main
stacks are marked immutable, but in Linux that would conflict with glibc
ld.so mprotecting RWX the stack if you dlopen() a shared library with
GNUSTACK, a very bad idea which needs a different fight...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17 23:56           ` Theo de Raadt
@ 2023-10-18  3:18             ` Jeff Xu
  2023-10-18  3:37               ` Theo de Raadt
  2023-10-18 15:17               ` Matthew Wilcox
  0 siblings, 2 replies; 43+ messages in thread
From: Jeff Xu @ 2023-10-18  3:18 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Linus Torvalds, jeffxu, akpm, keescook, sroettger, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Tue, Oct 17, 2023 at 4:57 PM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Jeff Xu <jeffxu@google.com> wrote:
>
> > May I ask, for BSD's implementation of immutable(), do you cover
> > things such as mlock(),
> > madvice() ? or just the protection bit (WRX) + remap() + unmap().
>
> It only prevents removal of the mapping, placement of a replacement
> mapping, or changing the existing permissions.  If one page in the
> existing sub-region is marked immutable, the whole operation fails with
> EPERM.
>
> Those are the only user-visible aspects that an attacker cares about to
> utilize in this area.
>
> mlock() and madvise() deal with the physical memory handling underneath
> the VA.  They have nothing to do with how attack code might manipulate
> the VA address space inside a program to convert a series of dead-end
> approaches into a succesfull escalation strategy.
>
> [It would be very long conversation to explain where and how this has
> been utilized to make an attack succesfull]
>
> > In other words:
> > Is BSD's definition of immutable equivalent to
> > MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ?
>
> I can't compare it to your subsystem, because I completely fail to
> understand the cause or benefit of all the complexity.
>
> I think I've explained what mimmutable() is in extremely simple terms.
>

Thanks for the explanation, based on those, this is exactly what the
current set of patch does.
In practice: libc could do below:
#define MM_IMMUTABLE
(MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP)
mseal(add,len, MM_IMMUTABLE)
it will be equivalent to BSD's immutable().

> And I don't understand else you are trying to do anything beyond what

> mimmutable() offers.  It seems like this is inventing additional
> solutions without proof that any of them are necessary to solve the
> specific problem that is known.
>
> > I hesitate to introduce the concept of immutable into linux because I don't know
> > all the scenarios present in linux where VMAs's metadata can be
> > modified.
>
> Good grief.  It seems obvious if you want to lock the change-behaviour
> of an object (the object in this case being a VA sub-region, there is a
> datastructure for that, in OpenBSD it is called an "entry"), then you
> put a flag in that object's data-structure and you simply check the flag
> everytime a change-operation is attempted.  It is a flag which gets set,
> and checked.  Nothing ever clears it (except address space teardown).
>
> This flag must be put on the data structure that manages VA sub-ranges.
>
> In our case when a prot/mapping operation reaches low-level code that
> will want to change an "entry", we notice it is not allowed and simply
> percolate EPERM up through the layers.
>
> > There could be quite a few things we still need to deal with, to
> > completely block the possibility,
> > e.g. malicious code attempting to write to a RO memory
>
> What?!  writes to RO memory are blocked by the permission bits.
>
> > or change RW memory to RWX.
>
> In our case that is blocked by W^X policy.
>
> But if the region is marked mimmutable, then that's another reason you cannot
> change RW to RWX.  It seems so off-topic, to talk about writes to RO memory.
> I get a feeling you are a bit lost.
>
> immutable() is not about permissions, but about locking permissions.
> - You can't change the permissions of the address space region.
> - You cannot map a replacement object at the location instead (especially
>   with different permission).
> - You cannot unmap at that location (which you would do if you wanted to
>   map a new object, with a different permission).
>
> All 3 of these scenarios are identical.  No regular code performs these 3
> operations on regions of the address space which we mark immutable.
>
> There is nothing more to mimmutable in the VM layer.  The hard work is
> writing code in execve() and ld.so which will decide which objects can
> be marked immutable automatically, so that programs don't do this to
> themselves.
>
> I'm aware of where this simple piece fits in.  It does not solve all
> problems, it is a very narrow change to impact a problem which only
> high-value targets will ever face (like chrome).
>
> But I think you don't understand the purpose of this mechanism.
>

In linux cases, I think, eventually, mseal() will have a bigger scope than
BSD's mimmutable().  VMA's metadata(vm_area_struct) contains a lot
of control info, depending on application's needs, mseal() can be
expanded to seal individual control info.

For example, in madvice(2) case:
As Jann point out in [1] and I quote:
"you'd probably also want to block destructive madvise() operations
that can effectively alter region contents by discarding pages and
such, ..."

Another example: if an application wants to keep a memory always
present in RAM, for whatever the reason, it can call seal the mlock().

To handle those two new cases. mseal() could add two more bits:
MM_SEAL_MADVICE, MM_SEAL_MLOCK.

It is practical to keep syscall extentable, when the business logic is the same.

I think I  explained the logic of using bitmasks in the mseal()
interface clearly with the example of madvice() and mlock().

-Jeff


[1] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/


> > If, as part of immutable, I also block madvice(), mlock(), which also updates
> > VMA's metadata, so by definition, I could.  What if the user wants the
> > features in
> > madvice() and at the same time, also wants their .text protected ?
>
> I have no idea what you are talking about.  None of those things relate
> to the access permission of the memory the user sees, and therefore none
> of them are in the attack surface profile which is being prevented.
>
> Meaning, we allow madvise() and mlock() and mphysicalquantummemory() because
> those relate to the physical storage and not the VA permission model.
>
> > Also, if linux introduces a new syscall that depends on a new metadata of VMA,
> > say msecret(), (for discussion purpose), should immutable
> > automatically support that ?
>
> How about the future makingexcuses() system call?
>
> I don't think you understand the problem space well enough to come up with
> your own solution for it.  I spent a year on this, and ship a complete system
> using it.  You are asking such simplistic questions above it shocks me.
>
> Maybe read the LWN article;
>
>     https://lwn.net/Articles/915640/
>
> > Without those questions answered, I couldn't choose the route of
> > immutable() yet.
>
> "... so I can clearly not choose the wine in front of you."
>
> If you don't understand what this thing is for, and cannot minimize the
> complexity of this thing, then Linux doesn't need it at all.
>
> I should warn everyone the hard work is not in the VM layer, but in
> ld.so -- deciding which parts of the image to make immutable, and when.
> It is also possible to make some segments immutable directly in execve()
> -- but in both cases you better have a really good grasp on RELRO
> executable layout or will make too many pieces immutable...
>
> I am pretty sure Linux will never get as far as we got. Even our main
> stacks are marked immutable, but in Linux that would conflict with glibc
> ld.so mprotecting RWX the stack if you dlopen() a shared library with
> GNUSTACK, a very bad idea which needs a different fight...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-18  3:18             ` Jeff Xu
@ 2023-10-18  3:37               ` Theo de Raadt
  2023-10-18 15:17               ` Matthew Wilcox
  1 sibling, 0 replies; 43+ messages in thread
From: Theo de Raadt @ 2023-10-18  3:37 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Linus Torvalds, jeffxu, akpm, keescook, sroettger, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

Jeff Xu <jeffxu@google.com> wrote:

> In linux cases, I think, eventually, mseal() will have a bigger scope than
> BSD's mimmutable().

I don't believe that, considering noone needed this behaviour from the VM
system in the last 4 decades.

> VMA's metadata(vm_area_struct) contains a lot
> of control info, depending on application's needs, mseal() can be
> expanded to seal individual control info.

> For example, in madvice(2) case:
> As Jann point out in [1] and I quote:
> "you'd probably also want to block destructive madvise() operations
> that can effectively alter region contents by discarding pages and
> such, ..."

Then prohibit madvise(MADV_FREE) on all non-writeable mappings that are
immutable.  Just include this in the set of behaviours. Or make it the
default.

Don't make it an option that a program needs to set on pages!  Noone
is going to call it.  Most programs don't know the addresses of the
*REGIONS* they would want to do this for.

Does your program know where libc's text segment starts and ends?
No your program does not know these addresses, so the parts of the
'system' which do know this needs to do it (which would be ld.so or
the libc init constructors).

If madvise(MADV_FREE) is so dangerous..  say you have a program that
would call through abort(), but you know a zero there can make the
abort not abort but return, then is it bad to let the attacker do:

   madvise(&abort, pagesize, MADV_FREE)

If that is bad, then block it in a smart way!  Don't make a programmer
of an application figure out how to do this.  That results in a defense
methodology where a handful of programs self-protect, but everything
else is unprotected or unprotectable.  That is shortsighted.

> Another example: if an application wants to keep a memory always
> present in RAM, for whatever the reason, it can call seal the mlock().

Please explain the attack surface.

> I think I  explained the logic of using bitmasks in the mseal()
> interface clearly with the example of madvice() and mlock().

It is clear as mud.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-18  3:18             ` Jeff Xu
  2023-10-18  3:37               ` Theo de Raadt
@ 2023-10-18 15:17               ` Matthew Wilcox
  2023-10-18 18:54                 ` Jeff Xu
  1 sibling, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2023-10-18 15:17 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Theo de Raadt, Linus Torvalds, jeffxu, akpm, keescook, sroettger,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm, jannh,
	surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, mawupeng1,
	linmiaohe, namit, peterx, peterz, ryan.roberts, shr, vbabka,
	xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Tue, Oct 17, 2023 at 08:18:47PM -0700, Jeff Xu wrote:
> In practice: libc could do below:
> #define MM_IMMUTABLE
> (MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP)
> mseal(add,len, MM_IMMUTABLE)
> it will be equivalent to BSD's immutable().

No, it wouldn't, because you've carefully listed the syscalls you're
blocking instead of understanding the _concept_ of what you need to
block.

> In linux cases, I think, eventually, mseal() will have a bigger scope than
> BSD's mimmutable().  VMA's metadata(vm_area_struct) contains a lot
> of control info, depending on application's needs, mseal() can be
> expanded to seal individual control info.
> 
> For example, in madvice(2) case:
> As Jann point out in [1] and I quote:
> "you'd probably also want to block destructive madvise() operations
> that can effectively alter region contents by discarding pages and
> such, ..."
> 
> Another example: if an application wants to keep a memory always
> present in RAM, for whatever the reason, it can call seal the mlock().
> 
> To handle those two new cases. mseal() could add two more bits:
> MM_SEAL_MADVICE, MM_SEAL_MLOCK.

Yes, thank you for demonstrating that you have no idea what you need to
block.

> It is practical to keep syscall extentable, when the business logic is the same.

I concur with Theo & Linus.  You don't know what you're doing.  I think
the underlying idea of mimmutable() is good, but how you've split it up
and how you've implemented it is terrible.

Let's start with the purpose.  The point of mimmutable/mseal/whatever is
to fix the mapping of an address range to its underlying object, be it
a particular file mapping or anonymous memory.  After the call succeeds,
it must not be possible to make any address in that virtual range point
into any other object.

The secondary purpose is to lock down permissions on that range.
Possibly to fix them where they are, possibly to allow RW->RO transitions.

With those purposes in mind, you should be able to deduce for any syscall
or any madvise(), ... whether it should be allowed.

Look, I appreciate this is only your second set of patches to Linux and
you've taken on a big job.  But that's all the more reason you should
listen to people who are trying to help you.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17 22:35       ` Pedro Falcato
@ 2023-10-18 18:20         ` Jeff Xu
  2023-10-19 17:30           ` Jeff Xu
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff Xu @ 2023-10-18 18:20 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Matthew Wilcox, jeffxu, akpm, keescook, sroettger, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> > >
> > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > could easily integrate with mmap() and as such allow for one-shot
> > > mmap() + mseal().
> > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > definitely sounds like a performance win as we halve the number of
> > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > like common patterns.
> > >
> > Yes. mmap() can support sealing as well, and memory is allocated as
> > immutable from begining.
> > This is orthogonal to mseal() though.
>
> I don't see how this can be orthogonal to mseal().
> In the case we opt for adding PROT_ bits, we should more or less only
> need to adapt calc_vm_prot_bits(), and the rest should work without
> issues.
> vma merging won't merge vmas with different prots. The current
> interfaces (mmap and mprotect) would work just fine.
> In this case, mseal() or mimmutable() would only be needed if you need
> to set immutability over a range of VMAs with different permissions.
>
Agreed. By orthogonal, I meant we can have two APIs:
mmap() and mseal()/mprotect()
i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
Sealing can be applied after initial memory creation.

> Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> The only annoying wrench in my plans here is that we have effectively
> run out of vm_flags bits in 32-bit architectures, so this approach as
> I described is not compatible with 32-bit.
>
> > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > changed to RO, for example, during symbol resolution.
> > The important point is that the application can decide what type of
> > sealing it wants, and when to apply it.  There needs to be an api(),
> > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > important to me.
> >
> > mprotect() in linux have the following signature:
> > int mprotect(void addr[.len], size_t len, int prot);
> > the prot bitmasks are all taken here.
> > I have not checked the prot field in mmap(), there might be bits left,
> > even not, we could have mmap2(), so that is not an issue.
>
> I don't see what you mean. We have plenty of prot bits left (32-bits,
> and we seem to have around 8 different bits used).
> And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
>
> The only issue seems to be that 32-bit ran out of vm_flags, but that
> can probably be worked around if need be.
>
Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
Apology that I was wrong previously and caused confusion.

There is a slight difference in the syntax of mprotect and mseal.
Each time when mprotect() is called, the kernel takes all of RWX bits
and updates vm_flags,
In other words, the application sets/unset each RWX, and kernel takes it.

In the mseal() case, the kernel will remember which seal types were
applied previously, and the application doesn’t need to repeat all
existing seal types in the next mseal().  Once a seal type is applied,
it can’t be unsealed.

So if we want to use mprotect() for sealing, developers need to think
of sealing bits differently than the rest of prot bits. It is a
different programming model, might or might not be an obvious concept
to developers.

There is a difference in input check and error handling as well.
for mseal(), if a given address range has a gap (unallocated memory),
or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
updated.
For mprotect(), some VMAs can be updated, till an error happens to a VMA.

IMO: I think it makes sense for mseal() and mprotect() to be different
in this. mseal() only checks vma struct so it is fast, and mprotect()
goes deeper into HW.

Because of those two differences, personally I feel a new syscall
might be worth the effort.

That said, mmap() + mprotect() is workable to me if that is what the
community wants.

Thanks
-Jeff

-Jeff


> --
> Pedro

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-18 15:17               ` Matthew Wilcox
@ 2023-10-18 18:54                 ` Jeff Xu
  2023-10-18 20:36                   ` Theo de Raadt
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff Xu @ 2023-10-18 18:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theo de Raadt, Linus Torvalds, jeffxu, akpm, keescook, sroettger,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm, jannh,
	surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, mawupeng1,
	linmiaohe, namit, peterx, peterz, ryan.roberts, shr, vbabka,
	xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Let's start with the purpose.  The point of mimmutable/mseal/whatever is
> to fix the mapping of an address range to its underlying object, be it
> a particular file mapping or anonymous memory.  After the call succeeds,
> it must not be possible to make any address in that virtual range point
> into any other object.
>
> The secondary purpose is to lock down permissions on that range.
> Possibly to fix them where they are, possibly to allow RW->RO transitions.
>
> With those purposes in mind, you should be able to deduce for any syscall
> or any madvise(), ... whether it should be allowed.
>
I got it.

IMO: The approaches mimmutable() and mseal() took are different, but
we all want to seal the memory from attackers and make the linux
application safer.

mimmutable() started  with "none of updates to the sealed address is
allowed once marked as immutable", this includes from within kernel space
including driver, or any new syscalls. It is reasonable to me.

mseal() starts with 4 syscalls from userspace, which is just a way (among many
other ways) to demo what memory operation can be sealed, which happens
to meet what Chome wants.  This is an RFC, I appreciate your input.

Best regards,
-Jeff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-18 18:54                 ` Jeff Xu
@ 2023-10-18 20:36                   ` Theo de Raadt
  2023-10-19  8:28                     ` Stephen Röttger
  0 siblings, 1 reply; 43+ messages in thread
From: Theo de Raadt @ 2023-10-18 20:36 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Matthew Wilcox, Linus Torvalds, jeffxu, akpm, keescook,
	sroettger, jorgelo, groeck, linux-kernel, linux-kselftest,
	linux-mm, jannh, surenb, alex.sierra, apopple, aneesh.kumar,
	axelrasmussen, ben, catalin.marinas, david, dwmw, ying.huang,
	hughd, joey.gouly, corbet, wangkefeng.wang, Liam.Howlett,
	lstoakes, mawupeng1, linmiaohe, namit, peterx, peterz,
	ryan.roberts, shr, vbabka, xiujianfeng, yu.ma, zhangpeng362,
	dave.hansen, luto, linux-hardening

Jeff Xu <jeffxu@google.com> wrote:

> On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Let's start with the purpose.  The point of mimmutable/mseal/whatever is
> > to fix the mapping of an address range to its underlying object, be it
> > a particular file mapping or anonymous memory.  After the call succeeds,
> > it must not be possible to make any address in that virtual range point
> > into any other object.
> >
> > The secondary purpose is to lock down permissions on that range.
> > Possibly to fix them where they are, possibly to allow RW->RO transitions.
> >
> > With those purposes in mind, you should be able to deduce for any syscall
> > or any madvise(), ... whether it should be allowed.
> >
> I got it.
> 
> IMO: The approaches mimmutable() and mseal() took are different, but
> we all want to seal the memory from attackers and make the linux
> application safer.

I think you are building mseal for chrome, and chrome alone.

I do not think this will work out for the rest of the application space
because

1) it is too complicated
2) experience with mimmutable() says that applications don't do any of it
   themselves, it is all in execve(), libc initialization, and ld.so.
   You don't strike me as an execve, libc, or ld.so developer.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-17 18:38         ` Linus Torvalds
  2023-10-17 18:55           ` Theo de Raadt
@ 2023-10-19  8:00           ` Stephen Röttger
  2023-10-20 16:27             ` Theo de Raadt
  1 sibling, 1 reply; 43+ messages in thread
From: Stephen Röttger @ 2023-10-19  8:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theo de Raadt, Jeff Xu, jeffxu, akpm, keescook, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

[-- Attachment #1: Type: text/plain, Size: 928 bytes --]

> I do like us starting with just "mimmutable()", since it already
> exists. Particularly if chrome already knows how to use it.
>
> Maybe add a flag field (require it to be zero initially) just to allow
> any future expansion. Maybe the chrome team has *wanted* to have some
> finer granularity thing and currently doesn't use mimmutable() in some
> case?

Yes, we do have a use case in Chrome to split the sealing into unmap and
mprotect which will allow us to seal additional pages that we can't seal with
pure mimmutable().
For example, we have pkey-tagged RWX memory that we want to seal. Since
the memory is already RWX and the pkey controls write access, we don't care
about permission changes but sometimes we do need to mprotect data only
pages.
But the munmap sealing will provide protection against implicit changes of the
pkey in this case which would happen if a page gets unmapped and another
mapped in its place.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4005 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-18 20:36                   ` Theo de Raadt
@ 2023-10-19  8:28                     ` Stephen Röttger
  2023-10-20 15:55                       ` Theo de Raadt
  0 siblings, 1 reply; 43+ messages in thread
From: Stephen Röttger @ 2023-10-19  8:28 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Jeff Xu, Matthew Wilcox, Linus Torvalds, jeffxu, akpm, keescook,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm, jannh,
	surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, mawupeng1,
	linmiaohe, namit, peterx, peterz, ryan.roberts, shr, vbabka,
	xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

[-- Attachment #1: Type: text/plain, Size: 1115 bytes --]

> > IMO: The approaches mimmutable() and mseal() took are different, but
> > we all want to seal the memory from attackers and make the linux
> > application safer.
>
> I think you are building mseal for chrome, and chrome alone.
>
> I do not think this will work out for the rest of the application space
> because
>
> 1) it is too complicated
> 2) experience with mimmutable() says that applications don't do any of it
>    themselves, it is all in execve(), libc initialization, and ld.so.
>    You don't strike me as an execve, libc, or ld.so developer.

We do want to build this in a way that it can be applied automatically by ld.so
and we appreciate all your feedback on this. The intention of
splitting the sealing
by syscall was to provide flexibility while still allowing ld.so to
seal all operations.
But it's clear from the feedback that both the fine grained split and
the per-syscall
approach are not the right way to go.
Does Linus' proposal to just split munmap / mprotect sealing address your
complexity concerns? ld.so would always use both flags which should then behave
similar to mimmutable().

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4005 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-18 18:20         ` Jeff Xu
@ 2023-10-19 17:30           ` Jeff Xu
  2023-10-19 22:47             ` Pedro Falcato
  0 siblings, 1 reply; 43+ messages in thread
From: Jeff Xu @ 2023-10-19 17:30 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Matthew Wilcox, jeffxu, akpm, keescook, sroettger, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

Hi Pedro

Some followup on mmap() + mprotect():

On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote:
>
> On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > > >
> > > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > > could easily integrate with mmap() and as such allow for one-shot
> > > > mmap() + mseal().
> > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > > definitely sounds like a performance win as we halve the number of
> > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > > like common patterns.
> > > >
> > > Yes. mmap() can support sealing as well, and memory is allocated as
> > > immutable from begining.
> > > This is orthogonal to mseal() though.
> >
> > I don't see how this can be orthogonal to mseal().
> > In the case we opt for adding PROT_ bits, we should more or less only
> > need to adapt calc_vm_prot_bits(), and the rest should work without
> > issues.
> > vma merging won't merge vmas with different prots. The current
> > interfaces (mmap and mprotect) would work just fine.
> > In this case, mseal() or mimmutable() would only be needed if you need
> > to set immutability over a range of VMAs with different permissions.
> >
> Agreed. By orthogonal, I meant we can have two APIs:
> mmap() and mseal()/mprotect()
> i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
> Sealing can be applied after initial memory creation.
>
> > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> > The only annoying wrench in my plans here is that we have effectively
> > run out of vm_flags bits in 32-bit architectures, so this approach as
> > I described is not compatible with 32-bit.
> >
> > > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > > changed to RO, for example, during symbol resolution.
> > > The important point is that the application can decide what type of
> > > sealing it wants, and when to apply it.  There needs to be an api(),
> > > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > > important to me.
> > >
> > > mprotect() in linux have the following signature:
> > > int mprotect(void addr[.len], size_t len, int prot);
> > > the prot bitmasks are all taken here.
> > > I have not checked the prot field in mmap(), there might be bits left,
> > > even not, we could have mmap2(), so that is not an issue.
> >
> > I don't see what you mean. We have plenty of prot bits left (32-bits,
> > and we seem to have around 8 different bits used).
> > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
> >
> > The only issue seems to be that 32-bit ran out of vm_flags, but that
> > can probably be worked around if need be.
> >
> Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
> Apology that I was wrong previously and caused confusion.
>
> There is a slight difference in the syntax of mprotect and mseal.
> Each time when mprotect() is called, the kernel takes all of RWX bits
> and updates vm_flags,
> In other words, the application sets/unset each RWX, and kernel takes it.
>
> In the mseal() case, the kernel will remember which seal types were
> applied previously, and the application doesn’t need to repeat all
> existing seal types in the next mseal().  Once a seal type is applied,
> it can’t be unsealed.
>
> So if we want to use mprotect() for sealing, developers need to think
> of sealing bits differently than the rest of prot bits. It is a
> different programming model, might or might not be an obvious concept
> to developers.
>
This probably doesn't matter much to developers.
We can enforce the sealing bit to be the same as the rest of PROT bits.
If mprotect() tries to unset sealing, it will fail.

> There is a difference in input check and error handling as well.
> for mseal(), if a given address range has a gap (unallocated memory),
> or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
> updated.
> For mprotect(), some VMAs can be updated, till an error happens to a VMA.
>
This difference doesn't matter much.

For mprotect()/mmap(), is Linux implementation limited by POSIX ?
This can be made backward compatible.
If there is no objection to adding linux specific values in mmap() and
mprotect(),
This works for me.

Thanks for your input.
-Jeff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-19 17:30           ` Jeff Xu
@ 2023-10-19 22:47             ` Pedro Falcato
  2023-10-19 23:06               ` Linus Torvalds
  2023-10-23 17:42               ` Jeff Xu
  0 siblings, 2 replies; 43+ messages in thread
From: Pedro Falcato @ 2023-10-19 22:47 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Matthew Wilcox, jeffxu, akpm, keescook, sroettger, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Thu, Oct 19, 2023 at 6:30 PM Jeff Xu <jeffxu@google.com> wrote:
>
> Hi Pedro
>
> Some followup on mmap() + mprotect():
>
> On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote:
> >
> > On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > >
> > > > >
> > > > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > > > could easily integrate with mmap() and as such allow for one-shot
> > > > > mmap() + mseal().
> > > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > > > definitely sounds like a performance win as we halve the number of
> > > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > > > like common patterns.
> > > > >
> > > > Yes. mmap() can support sealing as well, and memory is allocated as
> > > > immutable from begining.
> > > > This is orthogonal to mseal() though.
> > >
> > > I don't see how this can be orthogonal to mseal().
> > > In the case we opt for adding PROT_ bits, we should more or less only
> > > need to adapt calc_vm_prot_bits(), and the rest should work without
> > > issues.
> > > vma merging won't merge vmas with different prots. The current
> > > interfaces (mmap and mprotect) would work just fine.
> > > In this case, mseal() or mimmutable() would only be needed if you need
> > > to set immutability over a range of VMAs with different permissions.
> > >
> > Agreed. By orthogonal, I meant we can have two APIs:
> > mmap() and mseal()/mprotect()
> > i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
> > Sealing can be applied after initial memory creation.
> >
> > > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> > > The only annoying wrench in my plans here is that we have effectively
> > > run out of vm_flags bits in 32-bit architectures, so this approach as
> > > I described is not compatible with 32-bit.
> > >
> > > > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > > > changed to RO, for example, during symbol resolution.
> > > > The important point is that the application can decide what type of
> > > > sealing it wants, and when to apply it.  There needs to be an api(),
> > > > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > > > important to me.
> > > >
> > > > mprotect() in linux have the following signature:
> > > > int mprotect(void addr[.len], size_t len, int prot);
> > > > the prot bitmasks are all taken here.
> > > > I have not checked the prot field in mmap(), there might be bits left,
> > > > even not, we could have mmap2(), so that is not an issue.
> > >
> > > I don't see what you mean. We have plenty of prot bits left (32-bits,
> > > and we seem to have around 8 different bits used).
> > > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
> > >
> > > The only issue seems to be that 32-bit ran out of vm_flags, but that
> > > can probably be worked around if need be.
> > >
> > Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
> > Apology that I was wrong previously and caused confusion.
> >
> > There is a slight difference in the syntax of mprotect and mseal.
> > Each time when mprotect() is called, the kernel takes all of RWX bits
> > and updates vm_flags,
> > In other words, the application sets/unset each RWX, and kernel takes it.
> >
> > In the mseal() case, the kernel will remember which seal types were
> > applied previously, and the application doesn’t need to repeat all
> > existing seal types in the next mseal().  Once a seal type is applied,
> > it can’t be unsealed.
> >
> > So if we want to use mprotect() for sealing, developers need to think
> > of sealing bits differently than the rest of prot bits. It is a
> > different programming model, might or might not be an obvious concept
> > to developers.
> >
> This probably doesn't matter much to developers.
> We can enforce the sealing bit to be the same as the rest of PROT bits.
> If mprotect() tries to unset sealing, it will fail.

Yep. Erroneous or malicious mprotects would all be caught. However, if
we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect()
to less permissions or even downright munmap()) you'd want some care
to preserve that bit when setting permissions.

>
> > There is a difference in input check and error handling as well.
> > for mseal(), if a given address range has a gap (unallocated memory),
> > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
> > updated.
> > For mprotect(), some VMAs can be updated, till an error happens to a VMA.
> >
> This difference doesn't matter much.
>
> For mprotect()/mmap(), is Linux implementation limited by POSIX ?

No. POSIX works merely as a baseline that UNIX systems aim towards.
You can (and very frequently do) extend POSIX interfaces (in fact,
it's how most of POSIX was written, through sheer
"design-by-committee" on a bunch of UNIX systems' extensions).

> This can be made backward compatible.
> If there is no objection to adding linux specific values in mmap() and
> mprotect(),
> This works for me.

Linux already has system-specific values for PROT_ (PROT_BTI,
PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc).
Whether this is the right interface is another question. I do like it
a lot, but there's of course value in being compatible with existing
solutions (like mimmutable()).

-- 
Pedro

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-19 22:47             ` Pedro Falcato
@ 2023-10-19 23:06               ` Linus Torvalds
  2023-10-23 17:44                 ` Jeff Xu
  2023-10-23 17:42               ` Jeff Xu
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2023-10-19 23:06 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Jeff Xu, Matthew Wilcox, jeffxu, akpm, keescook, sroettger,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm, jannh,
	surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, mawupeng1,
	linmiaohe, namit, peterx, peterz, ryan.roberts, shr, vbabka,
	xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Thu, 19 Oct 2023 at 15:47, Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > For mprotect()/mmap(), is Linux implementation limited by POSIX ?
>
> No. POSIX works merely as a baseline that UNIX systems aim towards.
> You can (and very frequently do) extend POSIX interfaces (in fact,
> it's how most of POSIX was written, through sheer
> "design-by-committee" on a bunch of UNIX systems' extensions).

We can in extreme circumstances actually go further than that, and not
only extend on POSIX requirements, but actively even violate them.

It does need a very good reason, though, but it has happened when
POSIX requirements were simply actively wrong.

For example, at one point POSIX required

     int accept(int s, struct sockaddr *addr, size_t *addrlen);

which was simply completely wrong. It's utter shite, and didn't
actually match any reality.

The 'addrlen' parameter is 'int *', and POSIX suddenly trying to make
it "size_t" was completely unacceptable.

So we ignored it, told POSIX people that they were full of sh*t, and
they eventually fixed it in the next version (by introducing a
"socklen_t" that had better be the same as "int").

So POSIX can simply be wrong.

Also, if it turns out that we had some ABI that wasn't
POSIX-compatible, the whole "don't break user space" will take
precedence over any POSIX concerns, and we will not "fix" our system
call if people already use our old semantics.

So in that case, we generally end up with a new system call (or new
flags) instead.

Or sometimes it just is such a small detail that nobody cares - POSIX
also has a notion of documenting areas of non-conformance, and people
who really care end up having notions like "conformance vs _strict_
conformance".

                 Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-19  8:28                     ` Stephen Röttger
@ 2023-10-20 15:55                       ` Theo de Raadt
  0 siblings, 0 replies; 43+ messages in thread
From: Theo de Raadt @ 2023-10-20 15:55 UTC (permalink / raw)
  To: =?UTF-8?Q?Stephen_R=C3=B6ttger?=
  Cc: Jeff Xu, Matthew Wilcox, Linus Torvalds, jeffxu, akpm, keescook,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm, jannh,
	surenb, alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, mawupeng1,
	linmiaohe, namit, peterx, peterz, ryan.roberts, shr, vbabka,
	xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

Stephen Röttger <sroettger@google.com> wrote:

> > > IMO: The approaches mimmutable() and mseal() took are different, but
> > > we all want to seal the memory from attackers and make the linux
> > > application safer.
> >
> > I think you are building mseal for chrome, and chrome alone.
> >
> > I do not think this will work out for the rest of the application space
> > because
> >
> > 1) it is too complicated
> > 2) experience with mimmutable() says that applications don't do any of it
> >    themselves, it is all in execve(), libc initialization, and ld.so.
> >    You don't strike me as an execve, libc, or ld.so developer.
> 
> We do want to build this in a way that it can be applied automatically by ld.so
> and we appreciate all your feedback on this.

Hi Stephen,

I am pretty sure your mechanism will be useable by ld.so.

What bothers me is the complex many-bits approach may encourage people
to set only a subset of the bits, and then believe they have a security
primitive.

Partial sealing is not safe.  I define partial sealing as "blocking munmap,
but not mprotect".  Or "blocking mprotect, but not madvise or mmap".

In Message-id <ZS/3GCKvNn5qzhC4@casper.infradead.org> Matthew stated there
that there are two aspects being locked: which object is mapped, and the
permission of that mapping.  When additional system calls msync() and madvise()
are included in the picture, there are 3 actions being prevented:

 - Can someone replace the object
 - Can someone change the permission
 - Can someone throw away the cached pages, reverting to original
   content of the object (that is the madvise / msync)

In Message-id: <CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com>
Jan reminded us of this piece.  I'm taking this as a long-standing security
hole in some sub-operations of msync/madvise which can write to data regions
that aren't actually writeable.  Sub-operations with this problem are MADV_FREE,
MADV_DONTNEED, POSIX_MADV_DONTNEED, MS_INVALIDATE.. on Linux MADV_WIPEONFORK,
and probably a whole bunch of others.  I am testing OpenBSD changes which
demand PROT_WRITE permission for these sub-operations.  Perhaps some systems
are already careful.

If you leave any of these operators available, the object is not actually sealed
against abuse.  I believe an attacker will simply switch to a different operator
(mmap, munmap, mprotect, madvise, msync) to achieve a similar objective of
damaging the permission or contents.

Since mseal() is designed to create partial sealings, the name of the proposed
system call really smells.

> The intention of
> splitting the sealing
> by syscall was to provide flexibility while still allowing ld.so to
> seal all operations.

Yes, you will have ld.so set all the bits, and the same in C runtime
initialization.  If you convince glibc to stop make the stack executable
in dlopen(), the kernel could automatically do it.. With Linux backwards
compat management, getting there would be an extremely long long long
roadmap.  But anyways the idea would be "set all the bits".  Because otherwise
the object or data isn't safe.

> Does Linus' proposal to just split munmap / mprotect sealing address your
> complexity concerns? ld.so would always use both flags which should then behave
> similar to mimmutable().

No, I think it is weak, because it isn't sealed.

A seperate mail in the thread from you says this is about chrome wanting
to use PKU on RWX objects.  I think that's the reason for wanting to
seperate the sealing (I haven't heard of other applications wanting that).
How about we explore that in the other subthread..


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-19  8:00           ` Stephen Röttger
@ 2023-10-20 16:27             ` Theo de Raadt
  2023-10-24 10:42               ` Stephen Röttger
  0 siblings, 1 reply; 43+ messages in thread
From: Theo de Raadt @ 2023-10-20 16:27 UTC (permalink / raw)
  To: =?UTF-8?Q?Stephen_R=C3=B6ttger?=
  Cc: Linus Torvalds, Jeff Xu, jeffxu, akpm, keescook, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

Stephen Röttger <sroettger@google.com> wrote:

> > I do like us starting with just "mimmutable()", since it already
> > exists. Particularly if chrome already knows how to use it.
> >
> > Maybe add a flag field (require it to be zero initially) just to allow
> > any future expansion. Maybe the chrome team has *wanted* to have some
> > finer granularity thing and currently doesn't use mimmutable() in some
> > case?
> 
> Yes, we do have a use case in Chrome to split the sealing into unmap and
> mprotect which will allow us to seal additional pages that we can't seal with
> pure mimmutable().

> For example, we have pkey-tagged RWX memory that we want to seal. Since
> the memory is already RWX and the pkey controls write access, we don't care
> about permission changes but sometimes we do need to mprotect data only
> pages.

Let me try to decompose this statement.

This is clearly for the JIT. You can pivot between the a JIT generated
code mapping being RW and RX (or X-only), the object will pivot between
W or X to satisfy W^X policy for safety.

I think you are talking about a RWX MAP_ANON object.

Then you use pkey_alloc() to get a PKEY.  pkey_mprotect() sets the PKEY on
the region.  I argue you can then make it entirely immutable / sealed.

Let's say it is fully immutable / sealed.

After which, you can change the in-processor PKU register (using pkey_set)
to toggle the Write-Inhibit and eXecute-Inhibit bits on that key.

The immutable object has a dangerous RWX permission.  But the specific
PKEY making it either RX (or X-only) or RW depending upon your context.
The mapping is never exposed as RWX. The PKU model reduces the
permission access of the object below the immutable permission level.

The security depends on the PKEY WI/XI bits being difficult to control.

SADLY on x86, this is managed with a PKRU userland register which is changeble
without any supervisor control -- yes, it seems quite dangerous.
Changing it requires a complicated MSR dance.  It is unfortunate that
the pkey_set() library function is easily reachedable in the PLT via ROP
methods.  On non-x86 cpus that have similar functionality, the register
is privileged, but operating supporting it generally change it and
return immediately.

The problem you seem to have with fully locked mseal() in chrome seems
to be here:

> about permission changes but sometimes we do need to mprotect data only
> pages.

Does that data have to be in the same region?  Can your allocator not
put the non-code pieces of the JIT elsewhere, with a different
permission, fully immutable / msealed -- and perhaps even managed with a
different PKEY if neccessary?

May that requires a huge rearchitecture.  But isn't the root problem here
that data and code are being handled in the same object with a shared
permission model?

> But the munmap sealing will provide protection against implicit changes of the
> pkey in this case which would happen if a page gets unmapped and another
> mapped in its place.

That primitive feels so weird, I have a difficult time believing it will
remain unattackable in the long term.


But what if you could replace mprotect() with pkey_mprotect() upon a
different key.. ?

--

A few more notes comparing what OpenBSD has done compared to Linux:


In OpenBSD, we do not have the pkey library.  We have stolen one of the
PKEY and use it for kernel support of xonly for kernel code and userland
code.  On x86 we recognize that userland can flip the permission by whacking
the RPKU register -- which would make xonly code readable.  (The chrome data
you are trying to guard faces the same problem).

To prevent that, a majority of traps in the kernel (page faults,
interrupts, etc) check if the PKRU register has been modified, and kill
the process.  It is statistically strong.

We are not making pkey available as a userland feature, but if we later
do so we would still have 15 pkeys to play with.  We would probably make
the pkey_set() operation a system call, so the trap handler can also
observe RPKU register modifications by the instruction.

Above, I mentioned pivoting between "RW or RX (or X-only)".  On OpenBSD, chrome would be
able to pivot between RW and X-only.

When it comes to Pkey utilization, we've ended up in a very different
place than Linux.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-19 22:47             ` Pedro Falcato
  2023-10-19 23:06               ` Linus Torvalds
@ 2023-10-23 17:42               ` Jeff Xu
  1 sibling, 0 replies; 43+ messages in thread
From: Jeff Xu @ 2023-10-23 17:42 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Jeff Xu, Matthew Wilcox, akpm, keescook, sroettger, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, torvalds, lstoakes,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

On Thu, Oct 19, 2023 at 3:47 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Thu, Oct 19, 2023 at 6:30 PM Jeff Xu <jeffxu@google.com> wrote:
> >
> > Hi Pedro
> >
> > Some followup on mmap() + mprotect():
> >
> > On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote:
> > >
> > > On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > > >
> > > > > >
> > > > > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > > > > could easily integrate with mmap() and as such allow for one-shot
> > > > > > mmap() + mseal().
> > > > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > > > > definitely sounds like a performance win as we halve the number of
> > > > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > > > > like common patterns.
> > > > > >
> > > > > Yes. mmap() can support sealing as well, and memory is allocated as
> > > > > immutable from begining.
> > > > > This is orthogonal to mseal() though.
> > > >
> > > > I don't see how this can be orthogonal to mseal().
> > > > In the case we opt for adding PROT_ bits, we should more or less only
> > > > need to adapt calc_vm_prot_bits(), and the rest should work without
> > > > issues.
> > > > vma merging won't merge vmas with different prots. The current
> > > > interfaces (mmap and mprotect) would work just fine.
> > > > In this case, mseal() or mimmutable() would only be needed if you need
> > > > to set immutability over a range of VMAs with different permissions.
> > > >
> > > Agreed. By orthogonal, I meant we can have two APIs:
> > > mmap() and mseal()/mprotect()
> > > i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
> > > Sealing can be applied after initial memory creation.
> > >
> > > > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> > > > The only annoying wrench in my plans here is that we have effectively
> > > > run out of vm_flags bits in 32-bit architectures, so this approach as
> > > > I described is not compatible with 32-bit.
> > > >
> > > > > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > > > > changed to RO, for example, during symbol resolution.
> > > > > The important point is that the application can decide what type of
> > > > > sealing it wants, and when to apply it.  There needs to be an api(),
> > > > > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > > > > important to me.
> > > > >
> > > > > mprotect() in linux have the following signature:
> > > > > int mprotect(void addr[.len], size_t len, int prot);
> > > > > the prot bitmasks are all taken here.
> > > > > I have not checked the prot field in mmap(), there might be bits left,
> > > > > even not, we could have mmap2(), so that is not an issue.
> > > >
> > > > I don't see what you mean. We have plenty of prot bits left (32-bits,
> > > > and we seem to have around 8 different bits used).
> > > > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
> > > >
> > > > The only issue seems to be that 32-bit ran out of vm_flags, but that
> > > > can probably be worked around if need be.
> > > >
> > > Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
> > > Apology that I was wrong previously and caused confusion.
> > >
> > > There is a slight difference in the syntax of mprotect and mseal.
> > > Each time when mprotect() is called, the kernel takes all of RWX bits
> > > and updates vm_flags,
> > > In other words, the application sets/unset each RWX, and kernel takes it.
> > >
> > > In the mseal() case, the kernel will remember which seal types were
> > > applied previously, and the application doesn’t need to repeat all
> > > existing seal types in the next mseal().  Once a seal type is applied,
> > > it can’t be unsealed.
> > >
> > > So if we want to use mprotect() for sealing, developers need to think
> > > of sealing bits differently than the rest of prot bits. It is a
> > > different programming model, might or might not be an obvious concept
> > > to developers.
> > >
> > This probably doesn't matter much to developers.
> > We can enforce the sealing bit to be the same as the rest of PROT bits.
> > If mprotect() tries to unset sealing, it will fail.
>
> Yep. Erroneous or malicious mprotects would all be caught. However, if
> we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect()
> to less permissions or even downright munmap()) you'd want some care
> to preserve that bit when setting permissions.
>
> >
> > > There is a difference in input check and error handling as well.
> > > for mseal(), if a given address range has a gap (unallocated memory),
> > > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
> > > updated.
> > > For mprotect(), some VMAs can be updated, till an error happens to a VMA.
> > >
> > This difference doesn't matter much.
> >
> > For mprotect()/mmap(), is Linux implementation limited by POSIX ?
>
> No. POSIX works merely as a baseline that UNIX systems aim towards.
> You can (and very frequently do) extend POSIX interfaces (in fact,
> it's how most of POSIX was written, through sheer
> "design-by-committee" on a bunch of UNIX systems' extensions).
>
> > This can be made backward compatible.
> > If there is no objection to adding linux specific values in mmap() and
> > mprotect(),
> > This works for me.
>
> Linux already has system-specific values for PROT_ (PROT_BTI,
> PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc).
> Whether this is the right interface is another question. I do like it
> a lot, but there's of course value in being compatible with existing
> solutions (like mimmutable()).
>

Thanks Pedro for providing examples on mm extension to POSIX. This
opens more design options on solving the sealing problem. I will take
a few days to research  design options.

-Jeff


> --
> Pedro

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-19 23:06               ` Linus Torvalds
@ 2023-10-23 17:44                 ` Jeff Xu
  0 siblings, 0 replies; 43+ messages in thread
From: Jeff Xu @ 2023-10-23 17:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pedro Falcato, Jeff Xu, Matthew Wilcox, akpm, keescook,
	sroettger, jorgelo, groeck, linux-kernel, linux-kselftest,
	linux-mm, jannh, surenb, alex.sierra, apopple, aneesh.kumar,
	axelrasmussen, ben, catalin.marinas, david, dwmw, ying.huang,
	hughd, joey.gouly, corbet, wangkefeng.wang, Liam.Howlett,
	lstoakes, mawupeng1, linmiaohe, namit, peterx, peterz,
	ryan.roberts, shr, vbabka, xiujianfeng, yu.ma, zhangpeng362,
	dave.hansen, luto, linux-hardening

On Thu, Oct 19, 2023 at 4:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 19 Oct 2023 at 15:47, Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > >
> > > For mprotect()/mmap(), is Linux implementation limited by POSIX ?
> >
> > No. POSIX works merely as a baseline that UNIX systems aim towards.
> > You can (and very frequently do) extend POSIX interfaces (in fact,
> > it's how most of POSIX was written, through sheer
> > "design-by-committee" on a bunch of UNIX systems' extensions).
>
> We can in extreme circumstances actually go further than that, and not
> only extend on POSIX requirements, but actively even violate them.
>
> It does need a very good reason, though, but it has happened when
> POSIX requirements were simply actively wrong.
>
> For example, at one point POSIX required
>
>      int accept(int s, struct sockaddr *addr, size_t *addrlen);
>
> which was simply completely wrong. It's utter shite, and didn't
> actually match any reality.
>
> The 'addrlen' parameter is 'int *', and POSIX suddenly trying to make
> it "size_t" was completely unacceptable.
>
> So we ignored it, told POSIX people that they were full of sh*t, and
> they eventually fixed it in the next version (by introducing a
> "socklen_t" that had better be the same as "int").
>
> So POSIX can simply be wrong.
>
> Also, if it turns out that we had some ABI that wasn't
> POSIX-compatible, the whole "don't break user space" will take
> precedence over any POSIX concerns, and we will not "fix" our system
> call if people already use our old semantics.
>
> So in that case, we generally end up with a new system call (or new
> flags) instead.
>
> Or sometimes it just is such a small detail that nobody cares - POSIX
> also has a notion of documenting areas of non-conformance, and people
> who really care end up having notions like "conformance vs _strict_
> conformance".
>
>                  Linus
>

Thanks Linus for clarifying the guidelines on POSIX in Linux.

-Jeff

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH v1 0/8] Introduce mseal() syscall
  2023-10-20 16:27             ` Theo de Raadt
@ 2023-10-24 10:42               ` Stephen Röttger
  0 siblings, 0 replies; 43+ messages in thread
From: Stephen Röttger @ 2023-10-24 10:42 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Linus Torvalds, Jeff Xu, jeffxu, akpm, keescook, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, jannh, surenb,
	alex.sierra, apopple, aneesh.kumar, axelrasmussen, ben,
	catalin.marinas, david, dwmw, ying.huang, hughd, joey.gouly,
	corbet, wangkefeng.wang, Liam.Howlett, lstoakes, willy,
	mawupeng1, linmiaohe, namit, peterx, peterz, ryan.roberts, shr,
	vbabka, xiujianfeng, yu.ma, zhangpeng362, dave.hansen, luto,
	linux-hardening

[-- Attachment #1: Type: text/plain, Size: 637 bytes --]

> The problem you seem to have with fully locked mseal() in chrome seems
> to be here:
>
> > about permission changes but sometimes we do need to mprotect data only
> > pages.
>
> Does that data have to be in the same region?  Can your allocator not
> put the non-code pieces of the JIT elsewhere, with a different
> permission, fully immutable / msealed -- and perhaps even managed with a
> different PKEY if neccessary?

No we can't. We investigated this extensively since this also poses some
difficulties on MacOS. We implemented different approaches but any such
change to the allocator introduces too much of a performance impact.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4005 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2023-10-24 10:43 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-16 14:38 [RFC PATCH v1 0/8] Introduce mseal() syscall jeffxu
2023-10-16 14:38 ` [RFC PATCH v1 1/8] Add mseal syscall jeffxu
2023-10-16 15:05   ` Greg KH
2023-10-17  6:50     ` Jeff Xu
2023-10-16 14:38 ` [RFC PATCH v1 2/8] Wire up " jeffxu
2023-10-16 14:38 ` [RFC PATCH v1 3/8] mseal: add can_modify_mm and can_modify_vma jeffxu
2023-10-16 14:38 ` [RFC PATCH v1 4/8] mseal: seal mprotect jeffxu
2023-10-16 14:38 ` [RFC PATCH v1 5/8] mseal munmap jeffxu
2023-10-16 14:38 ` [RFC PATCH v1 6/8] mseal mremap jeffxu
2023-10-16 14:38 ` [RFC PATCH v1 7/8] mseal mmap jeffxu
2023-10-16 14:38 ` [RFC PATCH v1 8/8] selftest mm/mseal mprotect/munmap/mremap/mmap jeffxu
2023-10-16 15:18 ` [RFC PATCH v1 0/8] Introduce mseal() syscall Matthew Wilcox
2023-10-17  8:34   ` Jeff Xu
2023-10-17 12:56     ` Matthew Wilcox
2023-10-17 15:29   ` Pedro Falcato
2023-10-17 21:33     ` Jeff Xu
2023-10-17 22:35       ` Pedro Falcato
2023-10-18 18:20         ` Jeff Xu
2023-10-19 17:30           ` Jeff Xu
2023-10-19 22:47             ` Pedro Falcato
2023-10-19 23:06               ` Linus Torvalds
2023-10-23 17:44                 ` Jeff Xu
2023-10-23 17:42               ` Jeff Xu
2023-10-16 17:23 ` Linus Torvalds
2023-10-17  9:07   ` Jeff Xu
2023-10-17 17:22     ` Linus Torvalds
2023-10-17 18:20       ` Theo de Raadt
2023-10-17 18:38         ` Linus Torvalds
2023-10-17 18:55           ` Theo de Raadt
2023-10-19  8:00           ` Stephen Röttger
2023-10-20 16:27             ` Theo de Raadt
2023-10-24 10:42               ` Stephen Röttger
2023-10-17 23:01         ` Jeff Xu
2023-10-17 23:56           ` Theo de Raadt
2023-10-18  3:18             ` Jeff Xu
2023-10-18  3:37               ` Theo de Raadt
2023-10-18 15:17               ` Matthew Wilcox
2023-10-18 18:54                 ` Jeff Xu
2023-10-18 20:36                   ` Theo de Raadt
2023-10-19  8:28                     ` Stephen Röttger
2023-10-20 15:55                       ` Theo de Raadt
2023-10-16 17:34 ` Jann Horn
2023-10-17  8:42   ` Jeff Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).