linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/4] Introduce mseal
@ 2024-01-31 17:50 jeffxu
  2024-01-31 17:50 ` [PATCH v8 1/4] mseal: Wire up mseal syscall jeffxu
                   ` (4 more replies)
  0 siblings, 5 replies; 50+ messages in thread
From: jeffxu @ 2024-01-31 17:50 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds,
	usama.anjum, rdunlap
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

This patchset proposes a new mseal() syscall for the Linux kernel.

In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.

Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.

Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.

Two system calls are involved in sealing the map:  mmap() and mseal().

The new mseal() is an syscall on 64 bit CPU, and with
following signature:

int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.

mseal() blocks following operations for the given memory range.

1> Unmapping, moving to another location, and shrinking the size,
   via munmap() and mremap(), can leave an empty space, therefore can
   be replaced with a VMA with a new set of attributes.

2> Moving or expanding a different VMA into the current location,
   via mremap().

3> Modifying a VMA via mmap(MAP_FIXED).

4> Size expansion, via mremap(), does not appear to pose any specific
   risks to sealed VMAs. It is included anyway because the use case is
   unclear. In any case, users can rely on merging to expand a sealed VMA.

5> mprotect() and pkey_mprotect().

6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
   memory, when users don't have write permission to the memory. Those
   behaviors can alter region contents by discarding pages, effectively a
   memset(0) for anonymous memory.

In addition: mmap() has two related changes.

The PROT_SEAL bit in prot field of mmap(). When present, it marks
the map sealed since creation.

The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks
the map as sealable. A map created without MAP_SEALABLE will not support
sealing, i.e. mseal() will fail.

Applications that don't care about sealing will expect their behavior
unchanged. For those that need sealing support, opt-in by adding
MAP_SEALABLE in mmap().

The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.

Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.

Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).

However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.

Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.

In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:

Jann Horn: raising awareness and providing valuable insights on the
  destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Pedro Falcato: suggesting sealing in the mmap().
Theo de Raadt: sharing the experiences and insight gained from
  implementing mimmutable() in OpenBSD.

Change history:
===============
V8:
- perf optimization in mmap. (Liam R. Howlett)
- add one testcase (test_seal_zero_address) 
- Update mseal.rst to add note for MAP_SEALABLE.

V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.org/T/

V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024) 
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/

V5:
- fix build issue in mseal-Wire-up-mseal-syscall
  (Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r

V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/

V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
  MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
  both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
  destructive operations of madvise. (Suggested by Jann Horn and
  Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.org/

v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/

v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/

----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/

Jeff Xu (4):
  mseal: Wire up mseal syscall
  mseal: add mseal syscall
  selftest mm/mseal memory sealing
  mseal:add documentation

 Documentation/userspace-api/index.rst       |    1 +
 Documentation/userspace-api/mseal.rst       |  215 ++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
 arch/arm/tools/syscall.tbl                  |    1 +
 arch/arm64/include/asm/unistd.h             |    2 +-
 arch/arm64/include/asm/unistd32.h           |    2 +
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
 arch/s390/kernel/syscalls/syscall.tbl       |    1 +
 arch/sh/kernel/syscalls/syscall.tbl         |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
 include/linux/syscalls.h                    |    1 +
 include/uapi/asm-generic/mman-common.h      |    8 +
 include/uapi/asm-generic/unistd.h           |    5 +-
 kernel/sys_ni.c                             |    1 +
 mm/Makefile                                 |    4 +
 mm/internal.h                               |   48 +
 mm/madvise.c                                |   12 +
 mm/mmap.c                                   |   35 +-
 mm/mprotect.c                               |   10 +
 mm/mremap.c                                 |   31 +
 mm/mseal.c                                  |  343 ++++
 tools/testing/selftests/mm/.gitignore       |    1 +
 tools/testing/selftests/mm/Makefile         |    1 +
 tools/testing/selftests/mm/mseal_test.c     | 2024 +++++++++++++++++++
 33 files changed, 2756 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/userspace-api/mseal.rst
 create mode 100644 mm/mseal.c
 create mode 100644 tools/testing/selftests/mm/mseal_test.c

-- 
2.43.0.429.g432eaa2c6b-goog


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v8 1/4] mseal: Wire up mseal syscall
  2024-01-31 17:50 [PATCH v8 0/4] Introduce mseal jeffxu
@ 2024-01-31 17:50 ` jeffxu
  2024-01-31 17:50 ` [PATCH v8 2/4] mseal: add " jeffxu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 50+ messages in thread
From: jeffxu @ 2024-01-31 17:50 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds,
	usama.anjum, rdunlap
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Wire up mseal syscall for all architectures.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 arch/alpha/kernel/syscalls/syscall.tbl      | 1 +
 arch/arm/tools/syscall.tbl                  | 1 +
 arch/arm64/include/asm/unistd.h             | 2 +-
 arch/arm64/include/asm/unistd32.h           | 2 ++
 arch/m68k/kernel/syscalls/syscall.tbl       | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl     | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    | 1 +
 arch/s390/kernel/syscalls/syscall.tbl       | 1 +
 arch/sh/kernel/syscalls/syscall.tbl         | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl      | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     | 1 +
 include/uapi/asm-generic/unistd.h           | 5 ++++-
 kernel/sys_ni.c                             | 1 +
 19 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 8ff110826ce2..d8f96362e9f8 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -501,3 +501,4 @@
 569	common	lsm_get_self_attr		sys_lsm_get_self_attr
 570	common	lsm_set_self_attr		sys_lsm_set_self_attr
 571	common	lsm_list_modules		sys_lsm_list_modules
+572	common  mseal				sys_mseal
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index b6c9e01e14f5..2ed7d229c8f9 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -475,3 +475,4 @@
 459	common	lsm_get_self_attr		sys_lsm_get_self_attr
 460	common	lsm_set_self_attr		sys_lsm_set_self_attr
 461	common	lsm_list_modules		sys_lsm_list_modules
+462	common	mseal				sys_mseal
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 491b2b9bd553..1346579f802f 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -39,7 +39,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		462
+#define __NR_compat_syscalls		463
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 7118282d1c79..266b96acc014 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -929,6 +929,8 @@ __SYSCALL(__NR_lsm_get_self_attr, sys_lsm_get_self_attr)
 __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
 #define __NR_lsm_list_modules 461
 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
+#define __NR_mseal 462
+__SYSCALL(__NR_mseal, sys_mseal)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7fd43fd4c9f2..22a3cbd4c602 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -461,3 +461,4 @@
 459	common	lsm_get_self_attr		sys_lsm_get_self_attr
 460	common	lsm_set_self_attr		sys_lsm_set_self_attr
 461	common	lsm_list_modules		sys_lsm_list_modules
+462	common	mseal				sys_mseal
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index b00ab2cabab9..2b81a6bd78b2 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -467,3 +467,4 @@
 459	common	lsm_get_self_attr		sys_lsm_get_self_attr
 460	common	lsm_set_self_attr		sys_lsm_set_self_attr
 461	common	lsm_list_modules		sys_lsm_list_modules
+462	common	mseal				sys_mseal
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 83cfc9eb6b88..cc869f5d5693 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -400,3 +400,4 @@
 459	n32	lsm_get_self_attr		sys_lsm_get_self_attr
 460	n32	lsm_set_self_attr		sys_lsm_set_self_attr
 461	n32	lsm_list_modules		sys_lsm_list_modules
+462	n32	mseal				sys_mseal
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 532b855df589..1464c6be6eb3 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -376,3 +376,4 @@
 459	n64	lsm_get_self_attr		sys_lsm_get_self_attr
 460	n64	lsm_set_self_attr		sys_lsm_set_self_attr
 461	n64	lsm_list_modules		sys_lsm_list_modules
+462	n64	mseal				sys_mseal
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index f45c9530ea93..008ebe60263e 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -449,3 +449,4 @@
 459	o32	lsm_get_self_attr		sys_lsm_get_self_attr
 460	o32	lsm_set_self_attr		sys_lsm_set_self_attr
 461	o32	lsm_list_modules		sys_lsm_list_modules
+462	o32	mseal				sys_mseal
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index b236a84c4e12..b13c21373974 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -460,3 +460,4 @@
 459	common	lsm_get_self_attr		sys_lsm_get_self_attr
 460	common	lsm_set_self_attr		sys_lsm_set_self_attr
 461	common	lsm_list_modules		sys_lsm_list_modules
+462	common	mseal				sys_mseal
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 17173b82ca21..3656f1ca7a21 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -548,3 +548,4 @@
 459	common	lsm_get_self_attr		sys_lsm_get_self_attr
 460	common	lsm_set_self_attr		sys_lsm_set_self_attr
 461	common	lsm_list_modules		sys_lsm_list_modules
+462	common	mseal				sys_mseal
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 095bb86339a7..bd0fee24ad10 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -464,3 +464,4 @@
 459  common	lsm_get_self_attr	sys_lsm_get_self_attr		sys_lsm_get_self_attr
 460  common	lsm_set_self_attr	sys_lsm_set_self_attr		sys_lsm_set_self_attr
 461  common	lsm_list_modules	sys_lsm_list_modules		sys_lsm_list_modules
+462  common	mseal			sys_mseal			sys_mseal
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 86fe269f0220..bbf83a2db986 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -464,3 +464,4 @@
 459	common	lsm_get_self_attr		sys_lsm_get_self_attr
 460	common	lsm_set_self_attr		sys_lsm_set_self_attr
 461	common	lsm_list_modules		sys_lsm_list_modules
+462	common	mseal				sys_mseal
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index b23d59313589..ac6c281ccfe0 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -507,3 +507,4 @@
 459	common	lsm_get_self_attr		sys_lsm_get_self_attr
 460	common	lsm_set_self_attr		sys_lsm_set_self_attr
 461	common	lsm_list_modules		sys_lsm_list_modules
+462	common	mseal 				sys_mseal
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 5f8591ce7f25..7fd1f57ad3d3 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -466,3 +466,4 @@
 459	i386	lsm_get_self_attr	sys_lsm_get_self_attr
 460	i386	lsm_set_self_attr	sys_lsm_set_self_attr
 461	i386	lsm_list_modules	sys_lsm_list_modules
+462	i386	mseal 			sys_mseal
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7e8d46f4147f..52df0dec70da 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -383,6 +383,7 @@
 459	common	lsm_get_self_attr	sys_lsm_get_self_attr
 460	common	lsm_set_self_attr	sys_lsm_set_self_attr
 461	common	lsm_list_modules	sys_lsm_list_modules
+462 	common  mseal			sys_mseal
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index dd116598fb25..67083fc1b2f5 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -432,3 +432,4 @@
 459	common	lsm_get_self_attr		sys_lsm_get_self_attr
 460	common	lsm_set_self_attr		sys_lsm_set_self_attr
 461	common	lsm_list_modules		sys_lsm_list_modules
+462	common	mseal 				sys_mseal
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 75f00965ab15..d983c48a3b6a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
 #define __NR_lsm_list_modules 461
 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
 
+#define __NR_mseal 462
+__SYSCALL(__NR_mseal, sys_mseal)
+
 #undef __NR_syscalls
-#define __NR_syscalls 462
+#define __NR_syscalls 463
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index faad00cce269..d7eee421d4bc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -196,6 +196,7 @@ COND_SYSCALL(migrate_pages);
 COND_SYSCALL(move_pages);
 COND_SYSCALL(set_mempolicy_home_node);
 COND_SYSCALL(cachestat);
+COND_SYSCALL(mseal);
 
 COND_SYSCALL(perf_event_open);
 COND_SYSCALL(accept4);
-- 
2.43.0.429.g432eaa2c6b-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v8 2/4] mseal: add mseal syscall
  2024-01-31 17:50 [PATCH v8 0/4] Introduce mseal jeffxu
  2024-01-31 17:50 ` [PATCH v8 1/4] mseal: Wire up mseal syscall jeffxu
@ 2024-01-31 17:50 ` jeffxu
  2024-02-01 23:11   ` Eric Biggers
  2024-01-31 17:50 ` [PATCH v8 3/4] selftest mm/mseal memory sealing jeffxu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 50+ messages in thread
From: jeffxu @ 2024-01-31 17:50 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds,
	usama.anjum, rdunlap
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

The new mseal() is an syscall on 64 bit CPU, and with
following signature:

int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.

mseal() blocks following operations for the given memory range.

1> Unmapping, moving to another location, and shrinking the size,
   via munmap() and mremap(), can leave an empty space, therefore can
   be replaced with a VMA with a new set of attributes.

2> Moving or expanding a different VMA into the current location,
   via mremap().

3> Modifying a VMA via mmap(MAP_FIXED).

4> Size expansion, via mremap(), does not appear to pose any specific
   risks to sealed VMAs. It is included anyway because the use case is
   unclear. In any case, users can rely on merging to expand a sealed VMA.

5> mprotect() and pkey_mprotect().

6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
   memory, when users don't have write permission to the memory. Those
   behaviors can alter region contents by discarding pages, effectively a
   memset(0) for anonymous memory.

In addition: mmap() has two related changes.

The PROT_SEAL bit in prot field of mmap(). When present, it marks
the map sealed since creation.

The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks
the map as sealable. A map created without MAP_SEALABLE will not support
sealing, i.e. mseal() will fail.

Applications that don't care about sealing will expect their behavior
unchanged. For those that need sealing support, opt-in by adding
MAP_SEALABLE in mmap().

Following input during RFC are incooperated into this patch:

Jann Horn: raising awareness and providing valuable insights on the
destructive madvise operations.
Linus Torvalds: assisting in defining system call signature and scope.
Pedro Falcato: suggesting sealing in the mmap().
Liam R. Howlett: perf optimization.

Finally, the idea that inspired this patch comes from Stephen Röttger’s
work in Chrome V8 CFI.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 include/linux/syscalls.h               |   1 +
 include/uapi/asm-generic/mman-common.h |   8 +
 mm/Makefile                            |   4 +
 mm/internal.h                          |  48 ++++
 mm/madvise.c                           |  12 +
 mm/mmap.c                              |  35 ++-
 mm/mprotect.c                          |  10 +
 mm/mremap.c                            |  31 +++
 mm/mseal.c                             | 343 +++++++++++++++++++++++++
 9 files changed, 491 insertions(+), 1 deletion(-)
 create mode 100644 mm/mseal.c

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index cdba4d0c6d4a..2d44e0d99e37 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -820,6 +820,7 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
+asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned long flags);
 asmlinkage long sys_mbind(unsigned long start, unsigned long len,
 				unsigned long mode,
 				const unsigned long __user *nmask,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..3ca4d694a621 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -17,6 +17,11 @@
 #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
 #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
 
+/*
+ * The PROT_SEAL defines memory sealing in the prot argument of mmap().
+ */
+#define PROT_SEAL	0x04000000	/* _BITUL(26) */
+
 /* 0x01 - 0x03 are defined in linux/mman.h */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
 #define MAP_FIXED	0x10		/* Interpret addr exactly */
@@ -33,6 +38,9 @@
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
 
+/* map is sealable */
+#define MAP_SEALABLE	0x8000000	/* _BITUL(27) */
+
 /*
  * Flags for mlock
  */
diff --git a/mm/Makefile b/mm/Makefile
index e4b5b75aaec9..cbae83f74642 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -43,6 +43,10 @@ ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
 endif
 
+ifdef CONFIG_64BIT
+mmu-$(CONFIG_MMU)	+= mseal.o
+endif
+
 obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   maccess.o page-writeback.o folio-compat.o \
 			   readahead.o swap.o truncate.o vmscan.o shrinker.o \
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..00b45c8550c4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1221,6 +1221,54 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn,
 unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
 			  int priority);
 
+#ifdef CONFIG_64BIT
+/* VM is sealable, in vm_flags */
+#define VM_SEALABLE	_BITUL(63)
+
+/* VM is sealed, in vm_flags */
+#define VM_SEALED	_BITUL(62)
+#endif
+
+#ifdef CONFIG_64BIT
+static inline int can_do_mseal(unsigned long flags)
+{
+	if (flags)
+		return -EINVAL;
+
+	return 0;
+}
+
+bool can_modify_mm(struct mm_struct *mm, unsigned long start,
+		unsigned long end);
+bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start,
+		unsigned long end, int behavior);
+unsigned long get_mmap_seals(unsigned long prot,
+		unsigned long flags);
+#else
+static inline int can_do_mseal(unsigned long flags)
+{
+	return -EPERM;
+}
+
+static inline bool can_modify_mm(struct mm_struct *mm, unsigned long start,
+		unsigned long end)
+{
+	return true;
+}
+
+static inline bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start,
+		unsigned long end, int behavior)
+{
+	return true;
+}
+
+static inline unsigned long get_mmap_seals(unsigned long prot,
+	unsigned long flags)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_SHRINKER_DEBUG
 static inline __printf(2, 0) int shrinker_debugfs_name_alloc(
 			struct shrinker *shrinker, const char *fmt, va_list ap)
diff --git a/mm/madvise.c b/mm/madvise.c
index 912155a94ed5..9c0761c68111 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1393,6 +1393,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  -EIO    - an I/O error occurred while paging in data.
  *  -EBADF  - map exists, but area maps something that isn't a file.
  *  -EAGAIN - a kernel resource was temporarily unavailable.
+ *  -EPERM  - memory is sealed.
  */
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
@@ -1436,10 +1437,21 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	start = untagged_addr_remote(mm, start);
 	end = start + len;
 
+	/*
+	 * Check if the address range is sealed for do_madvise().
+	 * can_modify_mm_madv assumes we have acquired the lock on MM.
+	 */
+	if (!can_modify_mm_madv(mm, start, end, behavior)) {
+		error = -EPERM;
+		goto out;
+	}
+
 	blk_start_plug(&plug);
 	error = madvise_walk_vmas(mm, start, end, behavior,
 			madvise_vma_behavior);
 	blk_finish_plug(&plug);
+
+out:
 	if (write)
 		mmap_write_unlock(mm);
 	else
diff --git a/mm/mmap.c b/mm/mmap.c
index b78e83d351d2..4b3143044db4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1213,6 +1213,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 {
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
+	unsigned long vm_seals;
 
 	*populate = 0;
 
@@ -1233,6 +1234,8 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	if (flags & MAP_FIXED_NOREPLACE)
 		flags |= MAP_FIXED;
 
+	vm_seals = get_mmap_seals(prot, flags);
+
 	if (!(flags & MAP_FIXED))
 		addr = round_hint_to_min(addr);
 
@@ -1261,6 +1264,16 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			return -EEXIST;
 	}
 
+	/*
+	 * addr is returned from get_unmapped_area,
+	 * There are two cases:
+	 * 1> MAP_FIXED == false
+	 *	unallocated memory, no need to check sealing.
+	 * 1> MAP_FIXED == true
+	 *	sealing is checked inside mmap_region when
+	 *	do_vmi_munmap is called.
+	 */
+
 	if (prot == PROT_EXEC) {
 		pkey = execute_only_pkey(mm);
 		if (pkey < 0)
@@ -1376,6 +1389,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
+	vm_flags |= vm_seals;
 	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
@@ -2679,6 +2693,14 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 	if (end == start)
 		return -EINVAL;
 
+	/*
+	 * Check if memory is sealed before arch_unmap.
+	 * Prevent unmapping a sealed VMA.
+	 * can_modify_mm assumes we have acquired the lock on MM.
+	 */
+	if (!can_modify_mm(mm, start, end))
+		return -EPERM;
+
 	 /* arch_unmap() might do unmaps itself.  */
 	arch_unmap(mm, start, end);
 
@@ -2741,7 +2763,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	}
 
 	/* Unmap any existing mapping in the area */
-	if (do_vmi_munmap(&vmi, mm, addr, len, uf, false))
+	error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
+	if (error == -EPERM)
+		return error;
+	else if (error)
 		return -ENOMEM;
 
 	/*
@@ -3102,6 +3127,14 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
+	/*
+	 * Check if memory is sealed before arch_unmap.
+	 * Prevent unmapping a sealed VMA.
+	 * can_modify_mm assumes we have acquired the lock on MM.
+	 */
+	if (!can_modify_mm(mm, start, end))
+		return -EPERM;
+
 	arch_unmap(mm, start, end);
 	return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 81991102f785..5f0f716bf4ae 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -32,6 +32,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/memory-tiers.h>
+#include <uapi/linux/mman.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
@@ -743,6 +744,15 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		}
 	}
 
+	/*
+	 * checking if memory is sealed.
+	 * can_modify_mm assumes we have acquired the lock on MM.
+	 */
+	if (!can_modify_mm(current->mm, start, end)) {
+		error = -EPERM;
+		goto out;
+	}
+
 	prev = vma_prev(&vmi);
 	if (start > vma->vm_start)
 		prev = vma;
diff --git a/mm/mremap.c b/mm/mremap.c
index 38d98465f3d8..d69b438dcf83 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -902,7 +902,25 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if ((mm->map_count + 2) >= sysctl_max_map_count - 3)
 		return -ENOMEM;
 
+	/*
+	 * In mremap_to().
+	 * Move a VMA to another location, check if src addr is sealed.
+	 *
+	 * Place can_modify_mm here because mremap_to()
+	 * does its own checking for address range, and we only
+	 * check the sealing after passing those checks.
+	 *
+	 * can_modify_mm assumes we have acquired the lock on MM.
+	 */
+	if (!can_modify_mm(mm, addr, addr + old_len))
+		return -EPERM;
+
 	if (flags & MREMAP_FIXED) {
+		/*
+		 * In mremap_to().
+		 * VMA is moved to dst address, and munmap dst first.
+		 * do_munmap will check if dst is sealed.
+		 */
 		ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
 		if (ret)
 			goto out;
@@ -1061,6 +1079,19 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 		goto out;
 	}
 
+	/*
+	 * Below is shrink/expand case (not mremap_to())
+	 * Check if src address is sealed, if so, reject.
+	 * In other words, prevent shrinking or expanding a sealed VMA.
+	 *
+	 * Place can_modify_mm here so we can keep the logic related to
+	 * shrink/expand together.
+	 */
+	if (!can_modify_mm(mm, addr, addr + old_len)) {
+		ret = -EPERM;
+		goto out;
+	}
+
 	/*
 	 * Always allow a shrinking remap: that just unmaps
 	 * the unnecessary pages..
diff --git a/mm/mseal.c b/mm/mseal.c
new file mode 100644
index 000000000000..abc00c0b9895
--- /dev/null
+++ b/mm/mseal.c
@@ -0,0 +1,343 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  Implement mseal() syscall.
+ *
+ *  Copyright (c) 2023,2024 Google, Inc.
+ *
+ *  Author: Jeff Xu <jeffxu@chromium.org>
+ */
+
+#include <linux/mempolicy.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/mm_inline.h>
+#include <linux/mmu_context.h>
+#include <linux/syscalls.h>
+#include <linux/sched.h>
+#include "internal.h"
+
+static inline bool vma_is_sealed(struct vm_area_struct *vma)
+{
+	return (vma->vm_flags & VM_SEALED);
+}
+
+static inline bool vma_is_sealable(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_SEALABLE;
+}
+
+static inline void set_vma_sealed(struct vm_area_struct *vma)
+{
+	vm_flags_set(vma, VM_SEALED);
+}
+
+/*
+ * check if a vma is sealed for modification.
+ * return true, if modification is allowed.
+ */
+static bool can_modify_vma(struct vm_area_struct *vma)
+{
+	if (vma_is_sealed(vma))
+		return false;
+
+	return true;
+}
+
+static bool is_madv_discard(int behavior)
+{
+	return	behavior &
+		(MADV_FREE | MADV_DONTNEED | MADV_DONTNEED_LOCKED |
+		 MADV_REMOVE | MADV_DONTFORK | MADV_WIPEONFORK);
+}
+
+static bool is_ro_anon(struct vm_area_struct *vma)
+{
+	/* check anonymous mapping. */
+	if (vma->vm_file || vma->vm_flags & VM_SHARED)
+		return false;
+
+	/*
+	 * check for non-writable:
+	 * PROT=RO or PKRU is not writeable.
+	 */
+	if (!(vma->vm_flags & VM_WRITE) ||
+		!arch_vma_access_permitted(vma, true, false, false))
+		return true;
+
+	return false;
+}
+
+/*
+ * Check if the vmas of a memory range are allowed to be modified.
+ * the memory ranger can have a gap (unallocated memory).
+ * return true, if it is allowed.
+ */
+bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end)
+{
+	struct vm_area_struct *vma;
+
+	VMA_ITERATOR(vmi, mm, start);
+
+	/* going through each vma to check. */
+	for_each_vma_range(vmi, vma, end) {
+		if (!can_modify_vma(vma))
+			return false;
+	}
+
+	/* Allow by default. */
+	return true;
+}
+
+/*
+ * Check if the vmas of a memory range are allowed to be modified by madvise.
+ * the memory ranger can have a gap (unallocated memory).
+ * return true, if it is allowed.
+ */
+bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, unsigned long end,
+		int behavior)
+{
+	struct vm_area_struct *vma;
+
+	VMA_ITERATOR(vmi, mm, start);
+
+	if (!is_madv_discard(behavior))
+		return true;
+
+	/* going through each vma to check. */
+	for_each_vma_range(vmi, vma, end)
+		if (is_ro_anon(vma) && !can_modify_vma(vma))
+			return false;
+
+	/* Allow by default. */
+	return true;
+}
+
+unsigned long get_mmap_seals(unsigned long prot,
+		unsigned long flags)
+{
+	unsigned long vm_seals;
+
+	if (prot & PROT_SEAL)
+		vm_seals = VM_SEALED | VM_SEALABLE;
+	else
+		vm_seals = (flags & MAP_SEALABLE) ? VM_SEALABLE : 0;
+
+	return vm_seals;
+}
+
+/*
+ * Check if a seal type can be added to VMA.
+ */
+static bool can_add_vma_seal(struct vm_area_struct *vma)
+{
+	/* if map is not sealable, reject. */
+	if (!vma_is_sealable(vma))
+		return false;
+
+	return true;
+}
+
+static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		struct vm_area_struct **prev, unsigned long start,
+		unsigned long end, vm_flags_t newflags)
+{
+	int ret = 0;
+	vm_flags_t oldflags = vma->vm_flags;
+
+	if (newflags == oldflags)
+		goto out;
+
+	vma = vma_modify_flags(vmi, *prev, vma, start, end, newflags);
+	if (IS_ERR(vma)) {
+		ret = PTR_ERR(vma);
+		goto out;
+	}
+
+	set_vma_sealed(vma);
+out:
+	*prev = vma;
+	return ret;
+}
+
+/*
+ * Check for do_mseal:
+ * 1> start is part of a valid vma.
+ * 2> end is part of a valid vma.
+ * 3> No gap (unallocated address) between start and end.
+ * 4> map is sealable.
+ */
+static int check_mm_seal(unsigned long start, unsigned long end)
+{
+	struct vm_area_struct *vma;
+	unsigned long nstart = start;
+
+	VMA_ITERATOR(vmi, current->mm, start);
+
+	/* going through each vma to check. */
+	for_each_vma_range(vmi, vma, end) {
+		if (vma->vm_start > nstart)
+			/* unallocated memory found. */
+			return -ENOMEM;
+
+		if (!can_add_vma_seal(vma))
+			return -EACCES;
+
+		if (vma->vm_end >= end)
+			return 0;
+
+		nstart = vma->vm_end;
+	}
+
+	return -ENOMEM;
+}
+
+/*
+ * Apply sealing.
+ */
+static int apply_mm_seal(unsigned long start, unsigned long end)
+{
+	unsigned long nstart;
+	struct vm_area_struct *vma, *prev;
+
+	VMA_ITERATOR(vmi, current->mm, start);
+
+	vma = vma_iter_load(&vmi);
+	/*
+	 * Note: check_mm_seal should already checked ENOMEM case.
+	 * so vma should not be null, same for the other ENOMEM cases.
+	 */
+	prev = vma_prev(&vmi);
+	if (start > vma->vm_start)
+		prev = vma;
+
+	nstart = start;
+	for_each_vma_range(vmi, vma, end) {
+		int error;
+		unsigned long tmp;
+		vm_flags_t newflags;
+
+		newflags = vma->vm_flags | VM_SEALED;
+		tmp = vma->vm_end;
+		if (tmp > end)
+			tmp = end;
+		error = mseal_fixup(&vmi, vma, &prev, nstart, tmp, newflags);
+		if (error)
+			return error;
+		tmp = vma_iter_end(&vmi);
+		nstart = tmp;
+	}
+
+	return 0;
+}
+
+/*
+ * mseal(2) seals the VM's meta data from
+ * selected syscalls.
+ *
+ * addr/len: VM address range.
+ *
+ *  The address range by addr/len must meet:
+ *   start (addr) must be in a valid VMA.
+ *   end (addr + len) must be in a valid VMA.
+ *   no gap (unallocated memory) between start and end.
+ *   start (addr) must be page aligned.
+ *
+ *  len: len will be page aligned implicitly.
+ *
+ *   Below VMA operations are blocked after sealing.
+ *   1> Unmapping, moving to another location, and shrinking
+ *	the size, via munmap() and mremap(), can leave an empty
+ *	space, therefore can be replaced with a VMA with a new
+ *	set of attributes.
+ *   2> Moving or expanding a different vma into the current location,
+ *	via mremap().
+ *   3> Modifying a VMA via mmap(MAP_FIXED).
+ *   4> Size expansion, via mremap(), does not appear to pose any
+ *	specific risks to sealed VMAs. It is included anyway because
+ *	the use case is unclear. In any case, users can rely on
+ *	merging to expand a sealed VMA.
+ *   5> mprotect and pkey_mprotect.
+ *   6> Some destructive madvice() behavior (e.g. MADV_DONTNEED)
+ *      for anonymous memory, when users don't have write permission to the
+ *	memory. Those behaviors can alter region contents by discarding pages,
+ *	effectively a memset(0) for anonymous memory.
+ *
+ *  flags: reserved.
+ *
+ * return values:
+ *  zero: success.
+ *  -EINVAL:
+ *   invalid input flags.
+ *   start address is not page aligned.
+ *   Address arange (start + len) overflow.
+ *  -ENOMEM:
+ *   addr is not a valid address (not allocated).
+ *   end (start + len) is not a valid address.
+ *   a gap (unallocated memory) between start and end.
+ *  -EACCES:
+ *   MAP_SEALABLE is not set.
+ *  -EPERM:
+ *  - In 32 bit architecture, sealing is not supported.
+ * Note:
+ *  user can call mseal(2) multiple times, adding a seal on an
+ *  already sealed memory is a no-action (no error).
+ *
+ *  unseal() is not supported.
+ */
+static int do_mseal(unsigned long start, size_t len_in, unsigned long flags)
+{
+	size_t len;
+	int ret = 0;
+	unsigned long end;
+	struct mm_struct *mm = current->mm;
+
+	ret = can_do_mseal(flags);
+	if (ret)
+		return ret;
+
+	start = untagged_addr(start);
+	if (!PAGE_ALIGNED(start))
+		return -EINVAL;
+
+	len = PAGE_ALIGN(len_in);
+	/* Check to see whether len was rounded up from small -ve to zero. */
+	if (len_in && !len)
+		return -EINVAL;
+
+	end = start + len;
+	if (end < start)
+		return -EINVAL;
+
+	if (end == start)
+		return 0;
+
+	if (mmap_write_lock_killable(mm))
+		return -EINTR;
+
+	/*
+	 * First pass, this helps to avoid
+	 * partial sealing in case of error in input address range,
+	 * e.g. ENOMEM and EACCESS error.
+	 */
+	ret = check_mm_seal(start, end);
+	if (ret)
+		goto out;
+
+	/*
+	 * Second pass, this should success, unless there are errors
+	 * from vma_modify_flags, e.g. merge/split error, or process
+	 * reaching the max supported VMAs, however, those cases shall
+	 * be rare.
+	 */
+	ret = apply_mm_seal(start, end);
+
+out:
+	mmap_write_unlock(current->mm);
+	return ret;
+}
+
+SYSCALL_DEFINE3(mseal, unsigned long, start, size_t, len, unsigned long,
+		flags)
+{
+	return do_mseal(start, len, flags);
+}
-- 
2.43.0.429.g432eaa2c6b-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v8 3/4] selftest mm/mseal memory sealing
  2024-01-31 17:50 [PATCH v8 0/4] Introduce mseal jeffxu
  2024-01-31 17:50 ` [PATCH v8 1/4] mseal: Wire up mseal syscall jeffxu
  2024-01-31 17:50 ` [PATCH v8 2/4] mseal: add " jeffxu
@ 2024-01-31 17:50 ` jeffxu
  2024-01-31 17:50 ` [PATCH v8 4/4] mseal:add documentation jeffxu
  2024-01-31 19:34 ` [PATCH v8 0/4] Introduce mseal Liam R. Howlett
  4 siblings, 0 replies; 50+ messages in thread
From: jeffxu @ 2024-01-31 17:50 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds,
	usama.anjum, rdunlap
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

selftest for memory sealing change in mmap() and mseal().

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 tools/testing/selftests/mm/.gitignore   |    1 +
 tools/testing/selftests/mm/Makefile     |    1 +
 tools/testing/selftests/mm/mseal_test.c | 2024 +++++++++++++++++++++++
 3 files changed, 2026 insertions(+)
 create mode 100644 tools/testing/selftests/mm/mseal_test.c

diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index 4ff10ea61461..76474c51c786 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -46,3 +46,4 @@ gup_longterm
 mkdirty
 va_high_addr_switch
 hugetlb_fault_after_madv
+mseal_test
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 2453add65d12..ba36a5c2b1fc 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -59,6 +59,7 @@ TEST_GEN_FILES += mlock2-tests
 TEST_GEN_FILES += mrelease_test
 TEST_GEN_FILES += mremap_dontunmap
 TEST_GEN_FILES += mremap_test
+TEST_GEN_FILES += mseal_test
 TEST_GEN_FILES += on-fault-limit
 TEST_GEN_FILES += pagemap_ioctl
 TEST_GEN_FILES += thuge-gen
diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
new file mode 100644
index 000000000000..746bb0f96fe4
--- /dev/null
+++ b/tools/testing/selftests/mm/mseal_test.c
@@ -0,0 +1,2024 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/mman.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <stdbool.h>
+#include "../kselftest.h"
+#include <syscall.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <fcntl.h>
+#include <assert.h>
+#include <sys/ioctl.h>
+#include <sys/vfs.h>
+#include <sys/stat.h>
+
+/*
+ * need those definition for manually build using gcc.
+ * gcc -I ../../../../usr/include   -DDEBUG -O3  -DDEBUG -O3 mseal_test.c -o mseal_test
+ */
+#ifndef MAP_SEALABLE
+#define MAP_SEALABLE 0x8000000
+#endif
+
+#ifndef PROT_SEAL
+#define PROT_SEAL 0x04000000
+#endif
+
+#ifndef PKEY_DISABLE_ACCESS
+# define PKEY_DISABLE_ACCESS    0x1
+#endif
+
+#ifndef PKEY_DISABLE_WRITE
+# define PKEY_DISABLE_WRITE     0x2
+#endif
+
+#ifndef PKEY_BITS_PER_KEY
+#define PKEY_BITS_PER_PKEY      2
+#endif
+
+#ifndef PKEY_MASK
+#define PKEY_MASK       (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)
+#endif
+
+#define FAIL_TEST_IF_FALSE(c) do {\
+		if (!(c)) {\
+			ksft_test_result_fail("%s, line:%d\n", __func__, __LINE__);\
+			goto test_end;\
+		} \
+	} \
+	while (0)
+
+#define SKIP_TEST_IF_FALSE(c) do {\
+		if (!(c)) {\
+			ksft_test_result_skip("%s, line:%d\n", __func__, __LINE__);\
+			goto test_end;\
+		} \
+	} \
+	while (0)
+
+
+#define TEST_END_CHECK() {\
+		ksft_test_result_pass("%s\n", __func__);\
+		return;\
+test_end:\
+		return;\
+}
+
+#ifndef u64
+#define u64 unsigned long long
+#endif
+
+static unsigned long get_vma_size(void *addr)
+{
+	FILE *maps;
+	char line[256];
+	int size = 0;
+	uintptr_t  addr_start, addr_end;
+
+	maps = fopen("/proc/self/maps", "r");
+	if (!maps)
+		return 0;
+
+	while (fgets(line, sizeof(line), maps)) {
+		if (sscanf(line, "%lx-%lx", &addr_start, &addr_end) == 2) {
+			if (addr_start == (uintptr_t) addr) {
+				size = addr_end - addr_start;
+				break;
+			}
+		}
+	}
+	fclose(maps);
+	return size;
+}
+
+/*
+ * define sys_xyx to call syscall directly.
+ */
+static int sys_mseal(void *start, size_t len)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(__NR_mseal, start, len, 0);
+	return sret;
+}
+
+static int sys_mprotect(void *ptr, size_t size, unsigned long prot)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(__NR_mprotect, ptr, size, prot);
+	return sret;
+}
+
+static int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
+		unsigned long pkey)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(__NR_pkey_mprotect, ptr, size, orig_prot, pkey);
+	return sret;
+}
+
+static void *sys_mmap(void *addr, unsigned long len, unsigned long prot,
+	unsigned long flags, unsigned long fd, unsigned long offset)
+{
+	void *sret;
+
+	errno = 0;
+	sret = (void *) syscall(__NR_mmap, addr, len, prot,
+		flags, fd, offset);
+	return sret;
+}
+
+static int sys_munmap(void *ptr, size_t size)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(__NR_munmap, ptr, size);
+	return sret;
+}
+
+static int sys_madvise(void *start, size_t len, int types)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(__NR_madvise, start, len, types);
+	return sret;
+}
+
+static int sys_pkey_alloc(unsigned long flags, unsigned long init_val)
+{
+	int ret = syscall(__NR_pkey_alloc, flags, init_val);
+
+	return ret;
+}
+
+static unsigned int __read_pkey_reg(void)
+{
+	unsigned int eax, edx;
+	unsigned int ecx = 0;
+	unsigned int pkey_reg;
+
+	asm volatile(".byte 0x0f,0x01,0xee\n\t"
+			: "=a" (eax), "=d" (edx)
+			: "c" (ecx));
+	pkey_reg = eax;
+	return pkey_reg;
+}
+
+static void __write_pkey_reg(u64 pkey_reg)
+{
+	unsigned int eax = pkey_reg;
+	unsigned int ecx = 0;
+	unsigned int edx = 0;
+
+	asm volatile(".byte 0x0f,0x01,0xef\n\t"
+			: : "a" (eax), "c" (ecx), "d" (edx));
+	assert(pkey_reg == __read_pkey_reg());
+}
+
+static unsigned long pkey_bit_position(int pkey)
+{
+	return pkey * PKEY_BITS_PER_PKEY;
+}
+
+static u64 set_pkey_bits(u64 reg, int pkey, u64 flags)
+{
+	unsigned long shift = pkey_bit_position(pkey);
+
+	/* mask out bits from pkey in old value */
+	reg &= ~((u64)PKEY_MASK << shift);
+	/* OR in new bits for pkey */
+	reg |= (flags & PKEY_MASK) << shift;
+	return reg;
+}
+
+static void set_pkey(int pkey, unsigned long pkey_value)
+{
+	unsigned long mask = (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE);
+	u64 new_pkey_reg;
+
+	assert(!(pkey_value & ~mask));
+	new_pkey_reg = set_pkey_bits(__read_pkey_reg(), pkey, pkey_value);
+	__write_pkey_reg(new_pkey_reg);
+}
+
+static void setup_single_address(int size, void **ptrOut)
+{
+	void *ptr;
+
+	ptr = sys_mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0);
+	assert(ptr != (void *)-1);
+	*ptrOut = ptr;
+}
+
+static void setup_single_address_rw_sealable(int size, void **ptrOut, bool sealable)
+{
+	void *ptr;
+	unsigned long mapflags = MAP_ANONYMOUS | MAP_PRIVATE;
+
+	if (sealable)
+		mapflags |= MAP_SEALABLE;
+
+	ptr = sys_mmap(NULL, size, PROT_READ | PROT_WRITE, mapflags, -1, 0);
+	assert(ptr != (void *)-1);
+	*ptrOut = ptr;
+}
+
+static void clean_single_address(void *ptr, int size)
+{
+	int ret;
+
+	ret = munmap(ptr, size);
+	assert(!ret);
+}
+
+static void seal_single_address(void *ptr, int size)
+{
+	int ret;
+
+	ret = sys_mseal(ptr, size);
+	assert(!ret);
+}
+
+bool seal_support(void)
+{
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+
+	ptr = sys_mmap(NULL, page_size, PROT_READ | PROT_SEAL, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (ptr == (void *) -1)
+		return false;
+
+	ret = sys_mseal(ptr, page_size);
+	if (ret < 0)
+		return false;
+
+	return true;
+}
+
+bool pkey_supported(void)
+{
+	int pkey = sys_pkey_alloc(0, 0);
+
+	if (pkey > 0)
+		return true;
+	return false;
+}
+
+static void test_seal_addseal(void)
+{
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_unmapped_start(void)
+{
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	/* munmap 2 pages from ptr. */
+	ret = sys_munmap(ptr, 2 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* mprotect will fail because 2 pages from ptr are unmapped. */
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* mseal will fail because 2 pages from ptr are unmapped. */
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	ret = sys_mseal(ptr + 2 * page_size, 2 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_unmapped_middle(void)
+{
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	/* munmap 2 pages from ptr + page. */
+	ret = sys_munmap(ptr + page_size, 2 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* mprotect will fail, since middle 2 pages are unmapped. */
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* mseal will fail as well. */
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* we still can add seal to the first page and last page*/
+	ret = sys_mseal(ptr, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	ret = sys_mseal(ptr + 3 * page_size, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_unmapped_end(void)
+{
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	/* unmap last 2 pages. */
+	ret = sys_munmap(ptr + 2 * page_size, 2 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* mprotect will fail since last 2 pages are unmapped. */
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* mseal will fail as well. */
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* The first 2 pages is not sealed, and can add seals */
+	ret = sys_mseal(ptr, 2 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_multiple_vmas(void)
+{
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split the vma into 3. */
+	ret = sys_mprotect(ptr + page_size, 2 * page_size,
+			PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* mprotect will get applied to all 4 pages - 3 VMAs. */
+	ret = sys_mprotect(ptr, size, PROT_READ);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* use mprotect to split the vma into 3. */
+	ret = sys_mprotect(ptr + page_size, 2 * page_size,
+			PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* mseal get applied to all 4 pages - 3 VMAs. */
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_split_start(void)
+{
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split at middle */
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* seal the first page, this will split the VMA */
+	ret = sys_mseal(ptr, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* add seal to the remain 3 pages */
+	ret = sys_mseal(ptr + page_size, 3 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_split_end(void)
+{
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split at middle */
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* seal the last page */
+	ret = sys_mseal(ptr + 3 * page_size, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* Adding seals to the first 3 pages */
+	ret = sys_mseal(ptr, 3 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_invalid_input(void)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(8 * page_size, &ptr);
+	clean_single_address(ptr + 4 * page_size, 4 * page_size);
+
+	/* invalid flag */
+	ret = syscall(__NR_mseal, ptr, size, 0x20);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* unaligned address */
+	ret = sys_mseal(ptr + 1, 2 * page_size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* length too big */
+	ret = sys_mseal(ptr, 5 * page_size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* length overflow */
+	ret = sys_mseal(ptr, UINT64_MAX/page_size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* start is not in a valid VMA */
+	ret = sys_mseal(ptr - page_size, 5 * page_size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_zero_length(void)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_mprotect(ptr, 0, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* seal 0 length will be OK, same as mprotect */
+	ret = sys_mseal(ptr, 0);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* verify the 4 pages are not sealed by previous call. */
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_zero_address(void)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	/* use mmap to change protection. */
+	ptr = sys_mmap(0, size, PROT_NONE | PROT_SEAL,
+			MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr == 0);
+
+	size = get_vma_size(ptr);
+	FAIL_TEST_IF_FALSE(size == 4 * page_size);
+
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* verify the 4 pages are sealed by previous call. */
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_twice(void)
+{
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* apply the same seal will be OK. idempotent. */
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mprotect(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_single_address(ptr, size);
+
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_start_mprotect(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_single_address(ptr, page_size);
+
+	/* the first page is sealed. */
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	/* pages after the first page is not sealed. */
+	ret = sys_mprotect(ptr + page_size, page_size * 3,
+			PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_end_mprotect(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_single_address(ptr + page_size, 3 * page_size);
+
+	/* first page is not sealed */
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* last 3 page are sealed */
+	ret = sys_mprotect(ptr + page_size, page_size * 3,
+			PROT_READ | PROT_WRITE);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mprotect_unalign_len(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_single_address(ptr, page_size * 2 - 1);
+
+	/* 2 pages are sealed. */
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	ret = sys_mprotect(ptr + page_size * 2, page_size,
+			PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mprotect_unalign_len_variant_2(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+	if (seal)
+		seal_single_address(ptr, page_size * 2 + 1);
+
+	/* 3 pages are sealed. */
+	ret = sys_mprotect(ptr, page_size * 3, PROT_READ | PROT_WRITE);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	ret = sys_mprotect(ptr + page_size * 3, page_size,
+			PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mprotect_two_vma(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split */
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	if (seal)
+		seal_single_address(ptr, page_size * 4);
+
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	ret = sys_mprotect(ptr + page_size * 2, page_size * 2,
+			PROT_READ | PROT_WRITE);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mprotect_two_vma_with_split(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split as two vma. */
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* mseal can apply across 2 vma, also split them. */
+	if (seal)
+		seal_single_address(ptr + page_size, page_size * 2);
+
+	/* the first page is not sealed. */
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* the second page is sealed. */
+	ret = sys_mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	/* the third page is sealed. */
+	ret = sys_mprotect(ptr + 2 * page_size, page_size,
+			PROT_READ | PROT_WRITE);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	/* the fouth page is not sealed. */
+	ret = sys_mprotect(ptr + 3 * page_size, page_size,
+			PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mprotect_partial_mprotect(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* seal one page. */
+	if (seal)
+		seal_single_address(ptr, page_size);
+
+	/* mprotect first 2 page will fail, since the first page are sealed. */
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mprotect_two_vma_with_gap(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split. */
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* use mprotect to split. */
+	ret = sys_mprotect(ptr + 3 * page_size, page_size,
+			PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* use munmap to free two pages in the middle */
+	ret = sys_munmap(ptr + page_size, 2 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* mprotect will fail, because there is a gap in the address. */
+	/* notes, internally mprotect still updated the first page. */
+	ret = sys_mprotect(ptr, 4 * page_size, PROT_READ);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* mseal will fail as well. */
+	ret = sys_mseal(ptr, 4 * page_size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* the first page is not sealed. */
+	ret = sys_mprotect(ptr, page_size, PROT_READ);
+	FAIL_TEST_IF_FALSE(ret == 0);
+
+	/* the last page is not sealed. */
+	ret = sys_mprotect(ptr + 3 * page_size, page_size, PROT_READ);
+	FAIL_TEST_IF_FALSE(ret == 0);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mprotect_split(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split. */
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* seal all 4 pages. */
+	if (seal) {
+		ret = sys_mseal(ptr, 4 * page_size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* mprotect is sealed. */
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+
+	ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mprotect_merge(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split one page. */
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* seal first two pages. */
+	if (seal) {
+		ret = sys_mseal(ptr, 2 * page_size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* 2 pages are sealed. */
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	/* last 2 pages are not sealed. */
+	ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ);
+	FAIL_TEST_IF_FALSE(ret == 0);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_munmap(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* 4 pages are sealed. */
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+/*
+ * allocate 4 pages,
+ * use mprotect to split it as two VMAs
+ * seal the whole range
+ * munmap will fail on both
+ */
+static void test_seal_munmap_two_vma(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split */
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	ret = sys_munmap(ptr, page_size * 2);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	ret = sys_munmap(ptr + page_size, page_size * 2);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+/*
+ * allocate a VMA with 4 pages.
+ * munmap the middle 2 pages.
+ * seal the whole 4 pages, will fail.
+ * note: one of the pages are sealed
+ * munmap the first page will be OK.
+ * munmap the last page will be OK.
+ */
+static void test_seal_munmap_vma_with_gap(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_munmap(ptr + page_size, page_size * 2);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	if (seal) {
+		/* can't have gap in the middle. */
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(ret < 0);
+	}
+
+	ret = sys_munmap(ptr, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	ret = sys_munmap(ptr + page_size * 2, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	ret = sys_munmap(ptr, size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_munmap_start_freed(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* unmap the first page. */
+	ret = sys_munmap(ptr, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* seal the last 3 pages. */
+	if (seal) {
+		ret = sys_mseal(ptr + page_size, 3 * page_size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* unmap from the first page. */
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		/* note: this will be OK, even the first page is */
+		/* already unmapped. */
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_munmap_end_freed(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+	/* unmap last page. */
+	ret = sys_munmap(ptr + page_size * 3, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* seal the first 3 pages. */
+	if (seal) {
+		ret = sys_mseal(ptr, 3 * page_size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* unmap all pages. */
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_munmap_middle_freed(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+	/* unmap 2 pages in the middle. */
+	ret = sys_munmap(ptr + page_size, page_size * 2);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* seal the first page. */
+	if (seal) {
+		ret = sys_mseal(ptr, page_size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* munmap all 4 pages. */
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mremap_shrink(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* shrink from 4 pages to 2 pages. */
+	ret2 = mremap(ptr, size, 2 * page_size, 0, 0);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else {
+		FAIL_TEST_IF_FALSE(ret2 != MAP_FAILED);
+
+	}
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mremap_expand(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	/* ummap last 2 pages. */
+	ret = sys_munmap(ptr + 2 * page_size, 2 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	if (seal) {
+		ret = sys_mseal(ptr, 2 * page_size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* expand from 2 page to 4 pages. */
+	ret2 = mremap(ptr, 2 * page_size, 4 * page_size, 0, 0);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else {
+		FAIL_TEST_IF_FALSE(ret2 == ptr);
+
+	}
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mremap_move(bool seal)
+{
+	void *ptr, *newPtr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	setup_single_address(size, &newPtr);
+	clean_single_address(newPtr, size);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* move from ptr to fixed address. */
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newPtr);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else {
+		FAIL_TEST_IF_FALSE(ret2 != MAP_FAILED);
+
+	}
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mmap_overwrite_prot(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* use mmap to change protection. */
+	ret2 = sys_mmap(ptr, size, PROT_NONE,
+			MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else
+		FAIL_TEST_IF_FALSE(ret2 == ptr);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mmap_expand(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 12 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	/* ummap last 4 pages. */
+	ret = sys_munmap(ptr + 8 * page_size, 4 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	if (seal) {
+		ret = sys_mseal(ptr, 8 * page_size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* use mmap to expand. */
+	ret2 = sys_mmap(ptr, size, PROT_READ,
+			MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else
+		FAIL_TEST_IF_FALSE(ret2 == ptr);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mmap_shrink(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 12 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* use mmap to shrink. */
+	ret2 = sys_mmap(ptr, 8 * page_size, PROT_READ,
+			MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else
+		FAIL_TEST_IF_FALSE(ret2 == ptr);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mremap_shrink_fixed(bool seal)
+{
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	setup_single_address(size, &newAddr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* mremap to move and shrink to fixed address */
+	ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED,
+			newAddr);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else
+		FAIL_TEST_IF_FALSE(ret2 == newAddr);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mremap_expand_fixed(bool seal)
+{
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(page_size, &ptr);
+	setup_single_address(size, &newAddr);
+
+	if (seal) {
+		ret = sys_mseal(newAddr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* mremap to move and expand to fixed address */
+	ret2 = mremap(ptr, page_size, size, MREMAP_MAYMOVE | MREMAP_FIXED,
+			newAddr);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else
+		FAIL_TEST_IF_FALSE(ret2 == newAddr);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mremap_move_fixed(bool seal)
+{
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	setup_single_address(size, &newAddr);
+
+	if (seal) {
+		ret = sys_mseal(newAddr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* mremap to move to fixed address */
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newAddr);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else
+		FAIL_TEST_IF_FALSE(ret2 == newAddr);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mremap_move_fixed_zero(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/*
+	 * MREMAP_FIXED can move the mapping to zero address
+	 */
+	ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED,
+			0);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else {
+		FAIL_TEST_IF_FALSE(ret2 == 0);
+
+	}
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mremap_move_dontunmap(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* mremap to move, and don't unmap src addr. */
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, 0);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else {
+		FAIL_TEST_IF_FALSE(ret2 != MAP_FAILED);
+
+	}
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mremap_move_dontunmap_anyaddr(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/*
+	 * The 0xdeaddead should not have effect on dest addr
+	 * when MREMAP_DONTUNMAP is set.
+	 */
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP,
+			0xdeaddead);
+	if (seal) {
+		FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+		FAIL_TEST_IF_FALSE(errno == EPERM);
+	} else {
+		FAIL_TEST_IF_FALSE(ret2 != MAP_FAILED);
+		FAIL_TEST_IF_FALSE((long)ret2 != 0xdeaddead);
+
+	}
+
+	TEST_END_CHECK();
+}
+
+
+static void test_seal_mmap_seal(void)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	ptr = sys_mmap(NULL, size, PROT_READ | PROT_SEAL, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr != (void *)-1);
+
+	ret = sys_munmap(ptr, size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_merge_and_split(void)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size;
+	int ret;
+
+	/* (24 RO) */
+	setup_single_address(24 * page_size, &ptr);
+
+	/* use mprotect(NONE) to set out boundary */
+	/* (1 NONE) (22 RO) (1 NONE) */
+	ret = sys_mprotect(ptr, page_size, PROT_NONE);
+	FAIL_TEST_IF_FALSE(!ret);
+	ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr + page_size);
+	FAIL_TEST_IF_FALSE(size == 22 * page_size);
+
+	/* use mseal to split from beginning */
+	/* (1 NONE) (1 RO_SEAL) (21 RO) (1 NONE) */
+	ret = sys_mseal(ptr + page_size, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr + page_size);
+	FAIL_TEST_IF_FALSE(size == page_size);
+	size = get_vma_size(ptr + 2 * page_size);
+	FAIL_TEST_IF_FALSE(size == 21 * page_size);
+
+	/* use mseal to split from the end. */
+	/* (1 NONE) (1 RO_SEAL) (20 RO) (1 RO_SEAL) (1 NONE) */
+	ret = sys_mseal(ptr + 22 * page_size, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr + 22 * page_size);
+	FAIL_TEST_IF_FALSE(size == page_size);
+	size = get_vma_size(ptr + 2 * page_size);
+	FAIL_TEST_IF_FALSE(size == 20 * page_size);
+
+	/* merge with prev. */
+	/* (1 NONE) (2 RO_SEAL) (19 RO) (1 RO_SEAL) (1 NONE) */
+	ret = sys_mseal(ptr + 2 * page_size, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr +  page_size);
+	FAIL_TEST_IF_FALSE(size ==  2 * page_size);
+
+	/* merge with after. */
+	/* (1 NONE) (2 RO_SEAL) (18 RO) (2 RO_SEALS) (1 NONE) */
+	ret = sys_mseal(ptr + 21 * page_size, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr +  21 * page_size);
+	FAIL_TEST_IF_FALSE(size ==  2 * page_size);
+
+	/* split and merge from prev */
+	/* (1 NONE) (3 RO_SEAL) (17 RO) (2 RO_SEALS) (1 NONE) */
+	ret = sys_mseal(ptr + 2 * page_size, 2 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr +  1 * page_size);
+	FAIL_TEST_IF_FALSE(size ==  3 * page_size);
+	ret = sys_munmap(ptr + page_size,  page_size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+	ret = sys_mprotect(ptr + 2 * page_size, page_size,  PROT_NONE);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	/* split and merge from next */
+	/* (1 NONE) (3 RO_SEAL) (16 RO) (3 RO_SEALS) (1 NONE) */
+	ret = sys_mseal(ptr + 20 * page_size, 2 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr +  20 * page_size);
+	FAIL_TEST_IF_FALSE(size ==  3 * page_size);
+
+	/* merge from middle of prev and middle of next. */
+	/* (1 NONE) (22 RO_SEAL) (1 NONE) */
+	ret = sys_mseal(ptr + 2 * page_size, 20 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr +  page_size);
+	FAIL_TEST_IF_FALSE(size ==  22 * page_size);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_mmap_merge(void)
+{
+
+	void *ptr, *ptr2;
+	unsigned long page_size = getpagesize();
+	unsigned long size;
+	int ret;
+
+	/* (24 RO) */
+	setup_single_address(24 * page_size, &ptr);
+
+	/* use mprotect(NONE) to set out boundary */
+	/* (1 NONE) (22 RO) (1 NONE) */
+	ret = sys_mprotect(ptr, page_size, PROT_NONE);
+	FAIL_TEST_IF_FALSE(!ret);
+	ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr + page_size);
+	FAIL_TEST_IF_FALSE(size == 22 * page_size);
+
+	/* use munmap to free 2 segment of memory. */
+	/* (1 NONE) (1 free) (20 RO) (1 free) (1 NONE) */
+	ret = sys_munmap(ptr + page_size, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	ret = sys_munmap(ptr + 22 * page_size, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* apply seal to the middle */
+	/* (1 NONE) (1 free) (20 RO_SEAL) (1 free) (1 NONE) */
+	ret = sys_mseal(ptr + 2 * page_size, 20 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr + 2 * page_size);
+	FAIL_TEST_IF_FALSE(size == 20 * page_size);
+
+	/* allocate a mapping at beginning, and make sure it merges. */
+	/* (1 NONE) (21 RO_SEAL) (1 free) (1 NONE) */
+	ptr2 = sys_mmap(ptr + page_size, page_size, PROT_READ | PROT_SEAL,
+		MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr2 != (void *)-1);
+	size = get_vma_size(ptr + page_size);
+	FAIL_TEST_IF_FALSE(size == 21 * page_size);
+
+	/* allocate a mapping at end, and make sure it merges. */
+	/* (1 NONE) (22 RO_SEAL) (1 NONE) */
+	ptr2 = sys_mmap(ptr + 22 * page_size, page_size, PROT_READ | PROT_SEAL,
+		MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr != (void *)-1);
+	size = get_vma_size(ptr + page_size);
+	FAIL_TEST_IF_FALSE(size == 22 * page_size);
+
+	TEST_END_CHECK();
+}
+
+static void test_not_sealable(void)
+{
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	ptr = sys_mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr != (void *)-1);
+
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	TEST_END_CHECK();
+}
+
+static void test_mmap_fixed_change_to_sealable(void)
+{
+	int ret;
+	void *ptr, *ptr2;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	ptr = sys_mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr != (void *)-1);
+
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	ptr2 = sys_mmap(ptr, size, PROT_READ,
+		MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr2 == ptr);
+
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_mmap_fixed_change_to_not_sealable(void)
+{
+	int ret;
+	void *ptr, *ptr2;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	ptr = sys_mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr != (void *)-1);
+
+	ptr2 = sys_mmap(ptr, size, PROT_READ,
+		MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr2 == ptr);
+
+	ret = sys_mseal(ptr, size);
+	FAIL_TEST_IF_FALSE(ret < 0);
+
+	TEST_END_CHECK();
+}
+
+static void test_merge_sealable(void)
+{
+	int ret;
+	void *ptr, *ptr2;
+	unsigned long page_size = getpagesize();
+	unsigned long size;
+
+	/* (24 RO) */
+	setup_single_address(24 * page_size, &ptr);
+
+	/* use mprotect(NONE) to set out boundary */
+	/* (1 NONE) (22 RO) (1 NONE) */
+	ret = sys_mprotect(ptr, page_size, PROT_NONE);
+	FAIL_TEST_IF_FALSE(!ret);
+	ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr + page_size);
+	FAIL_TEST_IF_FALSE(size == 22 * page_size);
+
+	/* (1 NONE) (RO) (4 free) (17 RO) (1 NONE) */
+	ret = sys_munmap(ptr + 2 * page_size,  4 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+	size = get_vma_size(ptr + page_size);
+	FAIL_TEST_IF_FALSE(size == 1 * page_size);
+	size = get_vma_size(ptr +  6 * page_size);
+	FAIL_TEST_IF_FALSE(size == 17 * page_size);
+
+	/* (1 NONE) (RO) (1 free) (2 RO) (1 free) (17 RO) (1 NONE) */
+	ptr2 = sys_mmap(ptr + 3 * page_size, 2 * page_size, PROT_READ,
+		MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0);
+	size = get_vma_size(ptr + 3 * page_size);
+	FAIL_TEST_IF_FALSE(size == 2 * page_size);
+
+	/* (1 NONE) (RO) (1 free) (20 RO) (1 NONE) */
+	ptr2 = sys_mmap(ptr + 5 * page_size, 1 * page_size, PROT_READ,
+		MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr2 != (void *)-1);
+	size = get_vma_size(ptr + 3 * page_size);
+	FAIL_TEST_IF_FALSE(size == 20 * page_size);
+
+	/* (1 NONE) (RO) (1 free) (19 RO) (1 RO_SEAL) (1 NONE) */
+	ret = sys_mseal(ptr + 22 * page_size, page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* (1 NONE) (RO) (not sealable) (19 RO) (1 RO_SEAL) (1 NONE) */
+	ptr2 = sys_mmap(ptr + 2 * page_size, page_size, PROT_READ,
+		MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr2 != (void *)-1);
+	size = get_vma_size(ptr + page_size);
+	FAIL_TEST_IF_FALSE(size == page_size);
+	size = get_vma_size(ptr + 2 * page_size);
+	FAIL_TEST_IF_FALSE(size == page_size);
+
+	/* (1 NONE) (1 free) (1 NOT_SEALABLE) (19 free) (1 RO_SEAL) (1 NONE) */
+	ret = sys_munmap(ptr + page_size,  page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+	ret = sys_munmap(ptr + 3 * page_size,  19 * page_size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* (1 NONE) (2 NOT_SEALABLE) (19 free) (1 RO_SEAL) (1 NONE) */
+	ptr2 = sys_mmap(ptr + page_size, page_size, PROT_READ,
+		MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr2 != (void *)-1);
+	size = get_vma_size(ptr + page_size);
+	FAIL_TEST_IF_FALSE(size == 2 * page_size);
+
+	/* (1 NONE) (21 NOT_SEALABLE)(1 RO_SEAL) (1 NONE) */
+	ptr2 = sys_mmap(ptr + 3 * page_size, 19 * page_size, PROT_READ,
+		MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr2 != (void *)-1);
+	size = get_vma_size(ptr + page_size);
+	FAIL_TEST_IF_FALSE(size == 21 * page_size);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_discard_ro_anon_on_rw(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address_rw_sealable(size, &ptr, seal);
+	FAIL_TEST_IF_FALSE(ptr != (void *)-1);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* sealing doesn't take effect on RW memory. */
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* base seal still apply. */
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_discard_ro_anon_on_pkey(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	int pkey;
+
+	SKIP_TEST_IF_FALSE(pkey_supported());
+
+	setup_single_address_rw_sealable(size, &ptr, seal);
+	FAIL_TEST_IF_FALSE(ptr != (void *)-1);
+
+	pkey = sys_pkey_alloc(0, 0);
+	FAIL_TEST_IF_FALSE(pkey > 0);
+
+	ret = sys_mprotect_pkey((void *)ptr, size, PROT_READ | PROT_WRITE, pkey);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* sealing doesn't take effect if PKRU allow write. */
+	set_pkey(pkey, 0);
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	/* sealing will take effect if PKRU deny write. */
+	set_pkey(pkey, PKEY_DISABLE_WRITE);
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	/* base seal still apply. */
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_discard_ro_anon_on_filebacked(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	int fd;
+	unsigned long mapflags = MAP_PRIVATE;
+
+	if (seal)
+		mapflags |= MAP_SEALABLE;
+
+	fd = memfd_create("test", 0);
+	FAIL_TEST_IF_FALSE(fd > 0);
+
+	ret = fallocate(fd, 0, 0, size);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	ptr = sys_mmap(NULL, size, PROT_READ, mapflags, fd, 0);
+	FAIL_TEST_IF_FALSE(ptr != MAP_FAILED);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* sealing doesn't apply for file backed mapping. */
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+	close(fd);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_discard_ro_anon_on_shared(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	unsigned long mapflags = MAP_ANONYMOUS | MAP_SHARED;
+
+	if (seal)
+		mapflags |= MAP_SEALABLE;
+
+	ptr = sys_mmap(NULL, size, PROT_READ, mapflags, -1, 0);
+	FAIL_TEST_IF_FALSE(ptr != (void *)-1);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size);
+		FAIL_TEST_IF_FALSE(!ret);
+	}
+
+	/* sealing doesn't apply for shared mapping. */
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	FAIL_TEST_IF_FALSE(!ret);
+
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+static void test_seal_discard_ro_anon(bool seal)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_single_address(ptr, size);
+
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		FAIL_TEST_IF_FALSE(ret < 0);
+	else
+		FAIL_TEST_IF_FALSE(!ret);
+
+	TEST_END_CHECK();
+}
+
+int main(int argc, char **argv)
+{
+	bool test_seal = seal_support();
+
+	ksft_print_header();
+
+	if (!test_seal)
+		ksft_exit_skip("sealing not supported, check CONFIG_64BIT\n");
+
+	if (!pkey_supported())
+		ksft_print_msg("PKEY not supported\n");
+
+	ksft_set_plan(86);
+
+	test_seal_addseal();
+	test_seal_unmapped_start();
+	test_seal_unmapped_middle();
+	test_seal_unmapped_end();
+	test_seal_multiple_vmas();
+	test_seal_split_start();
+	test_seal_split_end();
+	test_seal_invalid_input();
+	test_seal_zero_length();
+	test_seal_twice();
+
+	test_seal_mprotect(false);
+	test_seal_mprotect(true);
+
+	test_seal_start_mprotect(false);
+	test_seal_start_mprotect(true);
+
+	test_seal_end_mprotect(false);
+	test_seal_end_mprotect(true);
+
+	test_seal_mprotect_unalign_len(false);
+	test_seal_mprotect_unalign_len(true);
+
+	test_seal_mprotect_unalign_len_variant_2(false);
+	test_seal_mprotect_unalign_len_variant_2(true);
+
+	test_seal_mprotect_two_vma(false);
+	test_seal_mprotect_two_vma(true);
+
+	test_seal_mprotect_two_vma_with_split(false);
+	test_seal_mprotect_two_vma_with_split(true);
+
+	test_seal_mprotect_partial_mprotect(false);
+	test_seal_mprotect_partial_mprotect(true);
+
+	test_seal_mprotect_two_vma_with_gap(false);
+	test_seal_mprotect_two_vma_with_gap(true);
+
+	test_seal_mprotect_merge(false);
+	test_seal_mprotect_merge(true);
+
+	test_seal_mprotect_split(false);
+	test_seal_mprotect_split(true);
+
+	test_seal_munmap(false);
+	test_seal_munmap(true);
+	test_seal_munmap_two_vma(false);
+	test_seal_munmap_two_vma(true);
+	test_seal_munmap_vma_with_gap(false);
+	test_seal_munmap_vma_with_gap(true);
+
+	test_munmap_start_freed(false);
+	test_munmap_start_freed(true);
+	test_munmap_middle_freed(false);
+	test_munmap_middle_freed(true);
+	test_munmap_end_freed(false);
+	test_munmap_end_freed(true);
+
+	test_seal_mremap_shrink(false);
+	test_seal_mremap_shrink(true);
+	test_seal_mremap_expand(false);
+	test_seal_mremap_expand(true);
+	test_seal_mremap_move(false);
+	test_seal_mremap_move(true);
+
+	test_seal_mremap_shrink_fixed(false);
+	test_seal_mremap_shrink_fixed(true);
+	test_seal_mremap_expand_fixed(false);
+	test_seal_mremap_expand_fixed(true);
+	test_seal_mremap_move_fixed(false);
+	test_seal_mremap_move_fixed(true);
+	test_seal_mremap_move_dontunmap(false);
+	test_seal_mremap_move_dontunmap(true);
+	test_seal_mremap_move_fixed_zero(false);
+	test_seal_mremap_move_fixed_zero(true);
+	test_seal_mremap_move_dontunmap_anyaddr(false);
+	test_seal_mremap_move_dontunmap_anyaddr(true);
+	test_seal_discard_ro_anon(false);
+	test_seal_discard_ro_anon(true);
+	test_seal_discard_ro_anon_on_rw(false);
+	test_seal_discard_ro_anon_on_rw(true);
+	test_seal_discard_ro_anon_on_shared(false);
+	test_seal_discard_ro_anon_on_shared(true);
+	test_seal_discard_ro_anon_on_filebacked(false);
+	test_seal_discard_ro_anon_on_filebacked(true);
+	test_seal_mmap_overwrite_prot(false);
+	test_seal_mmap_overwrite_prot(true);
+	test_seal_mmap_expand(false);
+	test_seal_mmap_expand(true);
+	test_seal_mmap_shrink(false);
+	test_seal_mmap_shrink(true);
+
+	test_seal_mmap_seal();
+	test_seal_merge_and_split();
+	test_seal_mmap_merge();
+
+	test_not_sealable();
+	test_merge_sealable();
+	test_mmap_fixed_change_to_sealable();
+	test_mmap_fixed_change_to_not_sealable();
+
+	test_seal_zero_address();
+
+	test_seal_discard_ro_anon_on_pkey(false);
+	test_seal_discard_ro_anon_on_pkey(true);
+
+	ksft_finished();
+	return 0;
+}
-- 
2.43.0.429.g432eaa2c6b-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v8 4/4] mseal:add documentation
  2024-01-31 17:50 [PATCH v8 0/4] Introduce mseal jeffxu
                   ` (2 preceding siblings ...)
  2024-01-31 17:50 ` [PATCH v8 3/4] selftest mm/mseal memory sealing jeffxu
@ 2024-01-31 17:50 ` jeffxu
  2024-01-31 19:34 ` [PATCH v8 0/4] Introduce mseal Liam R. Howlett
  4 siblings, 0 replies; 50+ messages in thread
From: jeffxu @ 2024-01-31 17:50 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds,
	usama.anjum, rdunlap
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Add documentation for mseal().

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 Documentation/userspace-api/index.rst |   1 +
 Documentation/userspace-api/mseal.rst | 215 ++++++++++++++++++++++++++
 2 files changed, 216 insertions(+)
 create mode 100644 Documentation/userspace-api/mseal.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 09f61bd2ac2e..178f6a1d79cb 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -26,6 +26,7 @@ place where this information is gathered.
    iommu
    iommufd
    media/index
+   mseal
    netlink/index
    sysfs-platform_profile
    vduse
diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
new file mode 100644
index 000000000000..6bfac0622178
--- /dev/null
+++ b/Documentation/userspace-api/mseal.rst
@@ -0,0 +1,215 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Introduction of mseal
+=====================
+
+:Author: Jeff Xu <jeffxu@chromium.org>
+
+Modern CPUs support memory permissions such as RW and NX bits. The memory
+permission feature improves security stance on memory corruption bugs, i.e.
+the attacker can’t just write to arbitrary memory and point the code to it,
+the memory has to be marked with X bit, or else an exception will happen.
+
+Memory sealing additionally protects the mapping itself against
+modifications. This is useful to mitigate memory corruption issues where a
+corrupted pointer is passed to a memory management system. For example,
+such an attacker primitive can break control-flow integrity guarantees
+since read-only memory that is supposed to be trusted can become writable
+or .text pages can get remapped. Memory sealing can automatically be
+applied by the runtime loader to seal .text and .rodata pages and
+applications can additionally seal security critical data at runtime.
+
+A similar feature already exists in the XNU kernel with the
+VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
+
+User API
+========
+Two system calls are involved in virtual memory sealing, mseal() and mmap().
+
+mseal()
+-----------
+The mseal() syscall has the following signature:
+
+``int mseal(void addr, size_t len, unsigned long flags)``
+
+**addr/len**: virtual memory address range.
+
+The address range set by ``addr``/``len`` must meet:
+   - The start address must be in an allocated VMA.
+   - The start address must be page aligned.
+   - The end address (``addr`` + ``len``) must be in an allocated VMA.
+   - no gap (unallocated memory) between start and end address.
+
+The ``len`` will be paged aligned implicitly by the kernel.
+
+**flags**: reserved for future use.
+
+**return values**:
+
+- ``0``: Success.
+
+- ``-EINVAL``:
+    - Invalid input ``flags``.
+    - The start address (``addr``) is not page aligned.
+    - Address range (``addr`` + ``len``) overflow.
+
+- ``-ENOMEM``:
+    - The start address (``addr``) is not allocated.
+    - The end address (``addr`` + ``len``) is not allocated.
+    - A gap (unallocated memory) between start and end address.
+
+- ``-EACCES``:
+    - ``MAP_SEALABLE`` is not set during mmap().
+
+- ``-EPERM``:
+    - sealing is supported only on 64-bit CPUs, 32-bit is not supported.
+
+- For above error cases, users can expect the given memory range is
+  unmodified, i.e. no partial update.
+
+- There might be other internal errors/cases not listed here, e.g.
+  error during merging/splitting VMAs, or the process reaching the max
+  number of supported VMAs. In those cases, partial updates to the given
+  memory range could happen. However, those cases should be rare.
+
+**Blocked operations after sealing**:
+    Unmapping, moving to another location, and shrinking the size,
+    via munmap() and mremap(), can leave an empty space, therefore
+    can be replaced with a VMA with a new set of attributes.
+
+    Moving or expanding a different VMA into the current location,
+    via mremap().
+
+    Modifying a VMA via mmap(MAP_FIXED).
+
+    Size expansion, via mremap(), does not appear to pose any
+    specific risks to sealed VMAs. It is included anyway because
+    the use case is unclear. In any case, users can rely on
+    merging to expand a sealed VMA.
+
+    mprotect() and pkey_mprotect().
+
+    Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
+    for anonymous memory, when users don't have write permission to the
+    memory. Those behaviors can alter region contents by discarding pages,
+    effectively a memset(0) for anonymous memory.
+
+    Kernel will return -EPERM for blocked operations.
+
+**Note**:
+
+- mseal() only works on 64-bit CPUs, not 32-bit CPU.
+
+- users can call mseal() multiple times, mseal() on an already sealed memory
+  is a no-action (not error).
+
+- munseal() is not supported.
+
+mmap()
+----------
+``void *mmap(void* addr, size_t length, int prot, int flags, int fd,
+off_t offset);``
+
+We add two changes in ``prot`` and ``flags`` of  mmap() related to
+memory sealing.
+
+**prot**
+
+The ``PROT_SEAL`` bit in ``prot`` field of mmap().
+
+When present, it marks the memory is sealed since creation.
+
+This is useful as optimization because it avoids having to make two
+system calls: one for mmap() and one for mseal().
+
+It's worth noting that even though the sealing is set via the
+``prot`` field in mmap(), it can't be set in the ``prot``
+field in later mprotect(). This is unlike the ``PROT_READ``,
+``PROT_WRITE``, ``PROT_EXEC`` bits, e.g. if ``PROT_WRITE`` is not set in
+mprotect(), it means that the region is not writable.
+
+Setting ``PROT_SEAL`` implies setting ``MAP_SEALABLE`` below.
+
+**flags**
+
+The ``MAP_SEALABLE`` bit in the ``flags`` field of mmap().
+
+When present, it marks the map as sealable. A map created
+without ``MAP_SEALABLE`` will not support sealing. In other words,
+mseal() will fail for such a map.
+
+Applications that don't care about sealing will expect their
+behavior unchanged. For those that need sealing support, opt in
+by adding ``MAP_SEALABLE`` in mmap().
+
+Use Case:
+=========
+- glibc:
+  The dynamic linker, during loading ELF executables, can apply sealing to
+  non-writable memory segments.
+
+- Chrome browser: protect some security sensitive data-structures.
+
+Notes On MAP_SEALABLE
+=====================
+With the MAP_SEALABLE flag in mmap(), the memory must be mmap() with
+MAP_SEALABLE, otherwise mseal() will fail. This raises the bar of
+which memory can be sealed.
+
+Today, in linux, sealing have known side effects if applied in below
+two cases:
+
+- aio/shm
+
+  aio/shm can mmap/munmap on behalf of userspace, e.g. ksys_shmdt() in shm.c. The lifetime of those mapping are not tied to the lifetime of the process. If those memories are sealed from userspace, then unmap will fail, causing leaks in VMA address space during the lifetime of the process.
+
+- Brk (heap/stack)
+
+  Currently, userspace applications can seal parts of the heap by calling malloc() and mseal().
+  let's assume following calls from user space:
+
+  - ptr = malloc(size);
+  - mprotect(ptr, size, RO);
+  - mseal(ptr, size);
+  - free(ptr);
+
+  Technically, before mseal() is added, the user can change the protection of the heap by calling mprotect(RO). As long as the user changes the protection back to RW before free(), the memory can be reused.
+
+  Adding mseal() into the picture, however, the heap is then sealed partially, the user can still free it, but the memory remains to be RO. In addition, the result of brk-shrink is nondeterministic, depending on if munmap() will try to free the sealed memory.(brk uses munmap to shrink the heap).
+
+  Given the heap is not marked with MAP_SEALABLE (at the time of this document's writing), this might discourage the inadvertent sealing on the heap.
+
+  It is noteworthy, nonetheless, for mappings that were created without the MAP_SEALABLE flag, a knowledgeable developer who wants to assume ownership of the memory range still has the option of mmap(MAP_FIXED|MAP_SEALABLE), which is equivalent to invoking munmap() and then mmap(MAP_FIXED). Indeed, a "not-allow-sealing" feature is not possible without some level of baseline sealing support and is out-of-scope currently.
+
+  In summary, the considerations for having MAP_SEALABLE are as follows:
+
+- Grants software owners the ability to incrementally incorporate sealing support for their designated memory ranges, such as brk.
+- Raises the bar for which memory can be sealed, and discourages inadvertent sealing.
+- Such a decision is reversible. In other words, a sysctl could be implemented to render all memory sealable in the future. However, if all memory were allowed to be sealable from the beginning, reversing that decision would be problematic.
+
+Additional notes:
+=================
+As Jann Horn pointed out in [3], there are still a few ways to write
+to RO memory, which is, in a way, by design. Those cases are not covered
+by mseal(). If applications want to block such cases, sandbox tools (such as
+seccomp, LSM, etc) might be considered.
+
+Those cases are:
+
+- Write to read-only memory through /proc/self/mem interface.
+- Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
+- userfaultfd.
+
+The idea that inspired this patch comes from Stephen Röttger’s work in V8
+CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
+
+Reference:
+==========
+[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
+
+[2] https://man.openbsd.org/mimmutable.2
+
+[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
+
+[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
-- 
2.43.0.429.g432eaa2c6b-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-01-31 17:50 [PATCH v8 0/4] Introduce mseal jeffxu
                   ` (3 preceding siblings ...)
  2024-01-31 17:50 ` [PATCH v8 4/4] mseal:add documentation jeffxu
@ 2024-01-31 19:34 ` Liam R. Howlett
  2024-02-01  1:27   ` Jeff Xu
  4 siblings, 1 reply; 50+ messages in thread
From: Liam R. Howlett @ 2024-01-31 19:34 UTC (permalink / raw)
  To: jeffxu, Jonathan Corbet
  Cc: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds,
	usama.anjum, rdunlap, jeffxu, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, pedro.falcato, dave.hansen,
	linux-hardening, deraadt

Please add me to the Cc list of these patches.

* jeffxu@chromium.org <jeffxu@chromium.org> [240131 12:50]:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> This patchset proposes a new mseal() syscall for the Linux kernel.
> 
> In a nutshell, mseal() protects the VMAs of a given virtual memory
> range against modifications, such as changes to their permission bits.
> 
> Modern CPUs support memory permissions, such as the read/write (RW)
> and no-execute (NX) bits. Linux has supported NX since the release of
> kernel version 2.6.8 in August 2004 [1]. The memory permission feature
> improves the security stance on memory corruption bugs, as an attacker
> cannot simply write to arbitrary memory and point the code to it. The
> memory must be marked with the X bit, or else an exception will occur.
> Internally, the kernel maintains the memory permissions in a data
> structure called VMA (vm_area_struct). mseal() additionally protects
> the VMA itself against modifications of the selected seal type.

... The v8 cut Jonathan's email discussion [1] off and instead of
replying there, I'm going to add my question here.

The best plan to ensure it is a general safety measure for all of linux
is to work with the community before it lands upstream.  It's much
harder to change functionality provided to users after it is upstream.
I'm happy to hear google is super excited about sharing this, but so
far, the community isn't as excited.

It seems Theo has a lot of experience trying to add a feature very close
to what you are doing and has real data on how this went [2].  Can we
see if there is a solution that is, at least, different enough from what
he tried to do for a shot of success?  Do we have anyone in the
toolchain groups that sees this working well?  If this means Stephen
needs to do something, can we get that to happen please?

I mean, you specifically state that this is a 'very specific
requirement' in your cover letter.  Does this mean even other browsers
have no use for it?

I am very concerned this feature will land and have to be maintained by
the core mm people for the one user it was specifically targeting.

Can we also get some benchmarking on the impact of this feature?  I
believe my answer in v7 removed the worst offender, but since there is
no benchmarking we really are guessing (educated or not, hard data would
help).  We still have an extra loop in madvise, mprotect_pkey, mremap_to
(and mreamp syscall?).

You also did not clean up the loop you copied from mlock, which I
pointed out [3].  Stating that your copy/paste is easier to review is
not sufficient to keep unneeded assignments around.

[1]. https://lore.kernel.org/linux-mm/87a5ong41h.fsf@meer.lwn.net/
[2]. https://lore.kernel.org/linux-mm/86181.1705962897@cvs.openbsd.org/
[3]. https://lore.kernel.org/linux-mm/20240124200628.ti327diy7arb7byb@revolver/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-01-31 19:34 ` [PATCH v8 0/4] Introduce mseal Liam R. Howlett
@ 2024-02-01  1:27   ` Jeff Xu
  2024-02-01  1:46     ` Theo de Raadt
                       ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Jeff Xu @ 2024-02-01  1:27 UTC (permalink / raw)
  To: Liam R. Howlett, jeffxu, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, torvalds, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt

On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> Please add me to the Cc list of these patches.
Ok.
>
> * jeffxu@chromium.org <jeffxu@chromium.org> [240131 12:50]:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > This patchset proposes a new mseal() syscall for the Linux kernel.
> >
> > In a nutshell, mseal() protects the VMAs of a given virtual memory
> > range against modifications, such as changes to their permission bits.
> >
> > Modern CPUs support memory permissions, such as the read/write (RW)
> > and no-execute (NX) bits. Linux has supported NX since the release of
> > kernel version 2.6.8 in August 2004 [1]. The memory permission feature
> > improves the security stance on memory corruption bugs, as an attacker
> > cannot simply write to arbitrary memory and point the code to it. The
> > memory must be marked with the X bit, or else an exception will occur.
> > Internally, the kernel maintains the memory permissions in a data
> > structure called VMA (vm_area_struct). mseal() additionally protects
> > the VMA itself is against modifications of the selected seal type.
>
> ... The v8 cut Jonathan's email discussion [1] off and
> instead of
> replying there, I'm going to add my question here.
>
> The best plan to ensure it is a general safety measure for all of linux
> is to work with the community before it lands upstream.  It's much
> harder to change functionality provided to users after it is upstream.
> I'm happy to hear google is super excited about sharing this, but so
> far, the community isn't as excited.
>
> It seems Theo has a lot of experience trying to add a feature very close
> to what you are doing and has real data on how this went [2].  Can we
> see if there is a solution that is, at least, different enough from what
> he tried to do for a shot of success?  Do we have anyone in the
> toolchain groups that sees this working well?  If this means Stephen
> needs to do something, can we get that to happen please?
>
For Theo's input from OpenBSD's perspective;
IIUC: as today, the mseal-Linux and mimmutable-OpenBSD has the same
scope on what operations to seal, e.g. considering the progress made
on both sides since the beginning of the RFC:
- mseal(Linux): dropped "multiple-bit" approach.
- mimmutable(OpenBSD): Dropped "downgradable"; Added madvise(DONOTNEED).

The difference is in mmap(), i.e.
- mseal(Linux): support of PROT_SEAL in mmap().
- mseal(Linux): use of MAP_SEALABLE in mmap().

I considered Theo's inputs from OpenBSD's perspective regarding the
difference, and I wasn't convinced that Linux should remove these. In
my view, those are two different kernels code, and the difference in
Linux is not added without reasons (for MAP_SEALABLE, there is a note
in the documentation section with details).

I would love to hear more from Linux developers on this.

> I mean, you specifically state that this is a 'very specific
> requirement' in your cover letter.  Does this mean even other browsers
> have no use for it?
>
No, I don’t mean “other browsers have no use for it”.

About specific requirements from Chrome, that refers to "The lifetime
of those mappings are not tied to the lifetime of the process, which
is not the case of libc" as in the cover letter. This addition to the
cover letter was made in V3, thus, it might be beneficial to provide
additional context to help answer the question.

This patch series begins with multiple-bit approaches (v1,v2,v3), the
rationale for this is that I am uncertain if Chrome's specific needs
are common enough for other use cases.  Consequently, I am unable to
make this decision myself without input from the community. To
accommodate this, multiple bits are selected initially due to their
adaptability.

Since V1, after hearing from the community, Chrome has changed its
design (no longer relying on separating out mprotect), and Linus
acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs,
today mseal() has a simple design that:
 - meet Chrome's specific needs.
 - meet Libc's needs.
 - Chrome's specific need doesn't interfere with Libc's.

[1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/

> I am very concerned this feature will land and have to be maintained by
> the core mm people for the one user it was specifically targeting.
>
See above. This feature is not specifically targeting Chrome.

> Can we also get some benchmarking on the impact of this feature?  I
> believe my answer in v7 removed the worst offender, but since there is
> no benchmarking we really are guessing (educated or not, hard data would
> help).  We still have an extra loop in madvise, mprotect_pkey, mremap_to
> (and mreamp syscall?).
>
Yes. There is an extra loop in mmap(FIXED), munmap(),
madvise(DONOTNEED), mremap(), to emulate the VMAs for the given
address range. I suspect the impact would be low, but having some hard
data would be good. I will see what I can find to assist the perf
testing. If you have a specific test suite in mind, I can also try it.

> You also did not clean up the loop you copied from mlock, which I
> pointed out [3].  Stating that your copy/paste is easier to review is
> not sufficient to keep unneeded assignments around.
>
OK.

> [1]. https://lore.kernel.org/linux-mm/87a5ong41h.fsf@meer.lwn.net/
> [2]. https://lore.kernel.org/linux-mm/86181.1705962897@cvs.openbsd.org/
> [3]. https://lore.kernel.org/linux-mm/20240124200628.ti327diy7arb7byb@revolver/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01  1:27   ` Jeff Xu
@ 2024-02-01  1:46     ` Theo de Raadt
  2024-02-01 16:56       ` Bird, Tim
  2024-02-01  1:55     ` Theo de Raadt
  2024-02-01 20:45     ` Liam R. Howlett
  2 siblings, 1 reply; 50+ messages in thread
From: Theo de Raadt @ 2024-02-01  1:46 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, torvalds, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

Jeff Xu <jeffxu@chromium.org> wrote:

> I considered Theo's inputs from OpenBSD's perspective regarding the
> difference, and I wasn't convinced that Linux should remove these. In
> my view, those are two different kernels code, and the difference in
> Linux is not added without reasons (for MAP_SEALABLE, there is a note
> in the documentation section with details).

That note is describing a fiction.

> I would love to hear more from Linux developers on this.

I'm not sure you are capable of listening.

But I'll repeat for others to stop this train wreck:


1. When execve() maps a programs's .data section, does the kernel set
   MAP_SEALABLE on that region?  Or does it not set MAP_SEALABLE?

   Does the kernel seal the .data section?  It cannot, because of RELRO
   and IFUNCS.  Do you know what those are?  (like in OpenBSD) the kernel
   cannot and will *not* seal the .data section, it lets later code do that.

2. When execve() maps a programs's .bss section, does the kernel set
   MAP_SEALABLE on that region?  Or does it not set MAP_SEALABLE?

   Does the kernel seal the .bss section?  It cannot, because of RELRO
   and IFUNCS.  Do you know what those are?  (like in OpenBSD) the kernel
   cannot and will *not* seal the .bss section, it lets later code do that.

In the proposed diff, the kernel does not set MAP_SEALABLE on those
regions.

How does a userland program seal the .data and .bss regions?

It cannot.  It is too late to set the MAP_SEALABLE, because the kernel
already decided not do to it.

So those regions cannot be sealed.

3. When execve() maps a programs's stack, does the kernel set
   MAP_SEALABLE on that region?  Or does it not set MAP_SEALABLE?

In the proposed diff, the kernel does not set MAP_SEALABLE.

You think you can seal the stack in the kernel??  Sorry to be the bearer
of bad news, but glibc has code which on occasion will mprotects the
stack executable.

But if userland decides that mprotect case won't occur -- how does a
userland program seal its stack?  It is now too late to set MAP_SEALABLE.

So the stack must remain unsealed.

4. What about the text segment?

5. Do you know what a text-relocation is?  They are now rare, but there
   are still compile/linker stages which will produce them, and there is
   software which requires that to work.  It means userland fixes it's
   own .text, then calls mprotect.  The kernel does not know if this will
   happen.

6. When execve() maps the .text segment, will it set MAP_SEALABLE?

If it doesn't set it, userland cannot seal it's text after it makes the
decision to do.


You can continue to extrapolate those same points for all other segments
of a static binary, all segments of a dynamic binary, all segments of the
shared library linker.

And then you can go further, and recognize the logic that will be needed
in the shared library linker to *make the same decisions*.

In each case, the *decision* to make a mapping happens in one piece of
code, and the decision to use and NOW SEAL THAT MAPPING, happens in a
different piece of code.


The only answer to these problems will be to always set MAP_SEALABLE.
To go through the entire Linux ecosystem, and change every call to mmap()
to use this new MAP_SEALABLE flag, and it will look something like this:

+#ifndef MAP_SEALABLE
+#define MAP_SEALABLE 0
+#endif
-	ptr = mmap(...., MAP...
-	ptr = mmap(...., MAP_SEALABLE | MAP...

Every single one of them, and you'll need to do it in the kernel.




If you had spent a second trying to make this work in a second piece of
software, you would have realized that the ONLY way this could work
is by adding a flag with the opposite meaning:

   MAP_NOTSEALABLE

But nothing will use that.  I promise you


> I would love to hear more from Linux developers on this.

I'm not sure you are capable of listening.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01  1:27   ` Jeff Xu
  2024-02-01  1:46     ` Theo de Raadt
@ 2024-02-01  1:55     ` Theo de Raadt
  2024-02-01 20:45     ` Liam R. Howlett
  2 siblings, 0 replies; 50+ messages in thread
From: Theo de Raadt @ 2024-02-01  1:55 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, torvalds, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

I'd like to propose a new flag to the Linux open() system call.

It is

   O_DUPABLE

You mix it with other O_* flags to the open call, everyone is familiar
with this, it is very easy to use.

If the O_DUPABLE flag is set, the file descriptor may be cloned with
dup(), dup2() or similar call.  If not set, those calls will return with
-1 EPERM.

I know it goes strongly against the grain of ancient assumptions that
file descriptors (just like memory) are fully mutable, and therefore
managed with care.  But in these trying times, we need protection against
file descriptor desecration.

It protects programmers from accidentally making clones of file
descriptors and leaking them out of programs, like I dunno, runc.
OK, besides this one very specific place that could (maybe) use
it today, there is other code which can use this but the margin is too narrow to contain.

The documentation can describe the behaviour as similar to MAP_SEALABLE,
so that noone is shocked.

/sarc

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v8 0/4] Introduce mseal
  2024-02-01  1:46     ` Theo de Raadt
@ 2024-02-01 16:56       ` Bird, Tim
  0 siblings, 0 replies; 50+ messages in thread
From: Bird, Tim @ 2024-02-01 16:56 UTC (permalink / raw)
  To: Theo de Raadt, Jeff Xu
  Cc: Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, torvalds, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening



> -----Original Message-----
> From: Theo de Raadt <deraadt@openbsd.org>
> > I would love to hear more from Linux developers on this.
> 
> I'm not sure you are capable of listening.
> 

Theo,

It is possible to make your technical points, and even to express frustration that it has
been difficult to get them across, without resorting to personal attacks.

 -- Tim


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01  1:27   ` Jeff Xu
  2024-02-01  1:46     ` Theo de Raadt
  2024-02-01  1:55     ` Theo de Raadt
@ 2024-02-01 20:45     ` Liam R. Howlett
  2024-02-01 22:24       ` Theo de Raadt
                         ` (2 more replies)
  2 siblings, 3 replies; 50+ messages in thread
From: Liam R. Howlett @ 2024-02-01 20:45 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Jonathan Corbet, akpm, keescook, jannh, sroettger, willy, gregkh,
	torvalds, usama.anjum, rdunlap, jeffxu, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening, deraadt

* Jeff Xu <jeffxu@chromium.org> [240131 20:27]:
> On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> >
> > Please add me to the Cc list of these patches.
> Ok.
> >
> > * jeffxu@chromium.org <jeffxu@chromium.org> [240131 12:50]:
> > > From: Jeff Xu <jeffxu@chromium.org>
> > >
> > > This patchset proposes a new mseal() syscall for the Linux kernel.
> > >
> > > In a nutshell, mseal() protects the VMAs of a given virtual memory
> > > range against modifications, such as changes to their permission bits.
> > >
> > > Modern CPUs support memory permissions, such as the read/write (RW)
> > > and no-execute (NX) bits. Linux has supported NX since the release of
> > > kernel version 2.6.8 in August 2004 [1]. The memory permission feature
> > > improves the security stance on memory corruption bugs, as an attacker
> > > cannot simply write to arbitrary memory and point the code to it. The
> > > memory must be marked with the X bit, or else an exception will occur.
> > > Internally, the kernel maintains the memory permissions in a data
> > > structure called VMA (vm_area_struct). mseal() additionally protects
> > > the VMA itself is against modifications of the selected seal type.
> >
> > ... The v8 cut Jonathan's email discussion [1] off and
> > instead of
> > replying there, I'm going to add my question here.
> >
> > The best plan to ensure it is a general safety measure for all of linux
> > is to work with the community before it lands upstream.  It's much
> > harder to change functionality provided to users after it is upstream.
> > I'm happy to hear google is super excited about sharing this, but so
> > far, the community isn't as excited.
> >
> > It seems Theo has a lot of experience trying to add a feature very close
> > to what you are doing and has real data on how this went [2].  Can we
> > see if there is a solution that is, at least, different enough from what
> > he tried to do for a shot of success?  Do we have anyone in the
> > toolchain groups that sees this working well?  If this means Stephen
> > needs to do something, can we get that to happen please?
> >
> For Theo's input from OpenBSD's perspective;
> IIUC: as today, the mseal-Linux and mimmutable-OpenBSD has the same
> scope on what operations to seal, e.g. considering the progress made
> on both sides since the beginning of the RFC:
> - mseal(Linux): dropped "multiple-bit" approach.
> - mimmutable(OpenBSD): Dropped "downgradable"; Added madvise(DONOTNEED).
> 
> The difference is in mmap(), i.e.
> - mseal(Linux): support of PROT_SEAL in mmap().
> - mseal(Linux): use of MAP_SEALABLE in mmap().
> 
> I considered Theo's inputs from OpenBSD's perspective regarding the
> difference, and I wasn't convinced that Linux should remove these. In
> my view, those are two different kernels code, and the difference in
> Linux is not added without reasons (for MAP_SEALABLE, there is a note
> in the documentation section with details).
> 
> I would love to hear more from Linux developers on this.

Linus said it was really important to get the semantics correct, but you
took his (unfinished) list and kept going.  I think there are some
unanswered questions and that's frustrating some people as you may not
be valuing the experience they have in this area.

You dropped the RFC from the topic and incremented the version numbering
on the patch set. I thought it was customary to restart counting after
the RFC was complete?  Maybe I'm wrong, but it seemed a bit odd to see
that happen.  The documentation also implies there are still questions
to be answered, so it seems this is still an RFC in some ways?


I'd like to talk about the design some more.

Having to opt-in to allowing mseal will probably not work well.

Initial library mappings happen in one huge chunk then it's cut up into
smaller VMAs, at least that's what I see with my maple tree tracing.  If
you opt-in, then the entire library will have to opt-in and so the
'discourage inadvertent sealing' argument is not very strong.

It also makes a somewhat messy tracking of inheritance of the attribute
across splitting, MAP_FIXED replacement, vma_move, vma_copy.  I think
most of this is forced on the user?

It makes your call less flexible, it means you have to hope that the VMA
origin was blessed before you decide you want to mseal it.

What if you want to ensure the library mapped by a parent or on launch
is mseal'ed?

What about the initial relocated VMA (expand/shrink of VMA)?

Creating something as "non-sealable" is pointless.  If you don't want it
sealed, then don't mseal() that region.

If your use case doesn't need it, then can we please drop the opt-in
behaviour and just have all VMAs treated the same?

If it does need it, can you explain why?

The glibc relocation/fixup will then work.  glibc could mseal once it is
complete - or an application could bypass glibc support and use the
feature itself.

If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the
heap/stack concerns.  We can either let people shoot their own feet off
or try to protect them.

Right now, you seem to be trying to protect them.  Keeping with that, I
guess we could either get the kernel to mark those VMAs or tell some
other way?  I'd suggest a range, but people do very strange things with
these special VMAs [1].  I don't think you can predict enough crazy
actions to make a difference in trying to protect people.

There are far fewer VMAs that should not be allowed to be mseal'ed than
should be, and the kernel creates those so it seems logical to only let
the kernel opt-out on those ones.

I'd rather just let people shoot themselves and return an error.

I also hope it reduces the complexity of this code while increasing the
flexibility of the feature.  As stated before, we remove the dependency
of needing support from the initial loader.

Merging VMAs
I can see this going Very Bad with brk + mseal.  But, again, if someone
decides to mseal these VMAs then they should expect Bad Things to
happen (or maybe they know what they are doing even in some complex
situation?)

vma_merge() can also expand a VMA.  I think this is okay as it checks
for the same flags, so you will allow VMA expansion of two (or three)
vma areas to become one.  Is this okay in your model?

> 
> > I mean, you specifically state that this is a 'very specific
> > requirement' in your cover letter.  Does this mean even other browsers
> > have no use for it?
> >
> No, I don’t mean “other browsers have no use for it”.
> 
> About specific requirements from Chrome, that refers to "The lifetime
> of those mappings are not tied to the lifetime of the process, which
> is not the case of libc" as in the cover letter. This addition to the
> cover letter was made in V3, thus, it might be beneficial to provide
> additional context to help answer the question.
> 
> This patch series begins with multiple-bit approaches (v1,v2,v3), the
> rationale for this is that I am uncertain if Chrome's specific needs
> are common enough for other use cases.  Consequently, I am unable to
> make this decision myself without input from the community. To
> accommodate this, multiple bits are selected initially due to their
> adaptability.
> 
> Since V1, after hearing from the community, Chrome has changed its
> design (no longer relying on separating out mprotect), and Linus
> acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs,
> today mseal() has a simple design that:
>  - meet Chrome's specific needs.

How many VMAs will chrome have that are mseal'ed?  Is this a common
operation?

PROT_SEAL seems like an extra flag we could drop.  I don't expect we'll
be sealing enough VMAs that a hand full of extra syscalls would make a
difference?

>  - meet Libc's needs.

What needs of libc are you referring to?  I'm looking through the
version changelog and I guess you mean return EPERM?

>  - Chrome's specific need doesn't interfere with Libc's.
> 
> [1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/

Linus said he'd be happier if we made the change in general.

> 
> > I am very concerned this feature will land and have to be maintained by
> > the core mm people for the one user it was specifically targeting.
> >
> See above. This feature is not specifically targeting Chrome.
> 
> > Can we also get some benchmarking on the impact of this feature?  I
> > believe my answer in v7 removed the worst offender, but since there is
> > no benchmarking we really are guessing (educated or not, hard data would
> > help).  We still have an extra loop in madvise, mprotect_pkey, mremap_to
> > (and mreamp syscall?).
> >
> Yes. There is an extra loop in mmap(FIXED), munmap(),
> madvise(DONOTNEED), mremap(), to emulate the VMAs for the given
> address range. I suspect the impact would be low, but having some hard
> data would be good. I will see what I can find to assist the perf
> testing. If you have a specific test suite in mind, I can also try it.

You should look at mmtests [2]. But since you are adding loops across
VMA ranges, you need to test loops across several ranges of VMAs.  That
is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or
some subset of small and large numbers to get an idea of complexity we
are adding.  My hope is that the looping will be cache-hot in the maple
tree and have minimum effect.

In my personal testing, I've seen munmap often do a single VMA, or 3, or
more rare 7 on x86_64.  There should be some good starting points in
mmtests for the common operations.

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mmapstress/mmapstress03.c
[2] https://github.com/gormanm/mmtests

Thanks,
Liam

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01 20:45     ` Liam R. Howlett
@ 2024-02-01 22:24       ` Theo de Raadt
  2024-02-02  1:06         ` Greg KH
  2024-02-01 22:37       ` Jeff Xu
  2024-02-02  3:14       ` Jeff Xu
  2 siblings, 1 reply; 50+ messages in thread
From: Theo de Raadt @ 2024-02-01 22:24 UTC (permalink / raw)
  To: Liam R. Howlett, Jeff Xu, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, torvalds, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

There is another problem with adding PROT_SEAL to the mprotect()
call.

What are the precise semantics?

If one reviews how mprotect() behaves, it is quickly clear that
it is very sloppy specification.  We spent quite a bit of effort
making our manual page as clear as possible to the most it gaurantees,
in the standard, and in all the various Unix:

     Not all implementations will guarantee protection on a page basis; the
     granularity of protection changes may be as large as an entire region.
     Nor will all implementations guarantee to give exactly the requested
     permissions; more permissions may be granted than requested by prot.
     However, if PROT_WRITE was not specified then the page will not be
     writable.

Anything else is different.

That is the specification in case of PROT_READ, PROT_WRITE, and PROT_EXEC.

What happens if you add additional PROT_* flags?

Does mprotect still behave just as sloppy (as specified)?

Or it now return an error partway through an operation?

When it returns the error, does it skip doing the work on the remaining
region?

Or does it skip doing any protection operation at all? (That means the code
has to do two passes over the region; first one checks if it may proceed,
second pass performs the change.  I think I've reat PROT_SEAL was supposed
to try to do things as one pass; is that actually possible without requiring
a second pass in the kernel?

To wit, do these two sequences have _exactly_ the same behaviour in
all cases that we can think of
    - unmapped sub-regions
    - sealed sub-regions
    - and who knows what else mprotect() may encounter

a)

    mprotect(addr, len, PROT_READ);
    mseal(addr, len, 0);

b)

    mprotect(addr, len, PROT_READ | PROT_SEAL);

Are they the same, or are they different?

Here's what I think: mprotect() behaves quite differently if you add
the PROT_SEAL flag, but I can't quite tell precisely what happens because
I don't understand the linux vm system enough.


(As an outsider, I have glanced at the new PROT_MTE flag changes; that
one seem to just "set a flag where possible", rather than performing
an action which could result in an error, and seems to not have this
problem).


As an outsider, Linux development is really strange:

Two sub-features are being pushed very hard, and the primary developer
doesn't have code which uses either of them.  And once it goes in, it
cannot be changed.

It's very different from my world, where the absolutely minimal
interface was written to apply to a whole operating system plus 10,000+
applications, and then took months of testing before it was approved for
inclusion.  And if it was subtly wrong, we would be able to change it.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01 20:45     ` Liam R. Howlett
  2024-02-01 22:24       ` Theo de Raadt
@ 2024-02-01 22:37       ` Jeff Xu
  2024-02-01 22:54         ` Theo de Raadt
  2024-02-02  3:14       ` Jeff Xu
  2 siblings, 1 reply; 50+ messages in thread
From: Jeff Xu @ 2024-02-01 22:37 UTC (permalink / raw)
  To: Liam R. Howlett, Jeff Xu, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, torvalds, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt

On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > I would love to hear more from Linux developers on this.
>
> Linus said it was really important to get the semantics correct, but you
> took his (unfinished) list and kept going.  I think there are some
> unanswered questions and that's frustrating some people as you may not
> be valuing the experience they have in this area.
>
Perhaps you didn't follow the discussions closely during the RFCs, so
I like to clarify the timeline:

- Dec.12:
RFC V3 was  out for comments: [1]
This version added MAP_SEALABLE and sealing type in mmap()
The sealing type in mmap() was suggested by  Pedro Falcato during V1. [2]
And MAP_SEALABLE is new to V3 and I added an open discussion in the
cover letter.

- Dec.14
Linus made a set of recommendations based on V3 [3], this is where
Linus mentioned the semantics.

Quoted below:
"Particularly for new system calls with fairly specialized use, I think
it's very important that the semantics are sensible on a conceptual
level, and that we do not add system calls that are based on "random
implementation issue of the day".

- Jan.4:
I sent out V4 of that patch for comments [5]
This version implements all of Linus's recommendations made in V3.

In V3, I didn't receive comments about MAP_SEALABLE, so I kept that as
an open discussion item in V4 and specifically mentioned it in the
first sentence of the V4 cover letter.

"This is V4 of the patch, the patch has improved significantly since V1,
thanks to diverse inputs, a few discussions remain, please read those
in the open discussion section of v4 of change history."

- Jan.4:
Linus  gave a comment on V4: [6]

Quoted below:
"Other than that, this seems all reasonable to me now."

To me, this means Linus is OK with the general signatures of the APIs.

-Jan.9
During comments for V5.
[7]  Kees suggested dropping RFC from subsequent versions, given
Linus's general approval
on the v4.

[1] https://lore.kernel.org/all/80897.1705769947@cvs.openbsd.org/T/#mbf4749d465b80a575e1eda3c6f0c66d995abfc39

[2]
https://lore.kernel.org/lkml/CAKbZUD2A+=bp_sd+Q0Yif7NJqMu8p__eb4yguq0agEcmLH8SDQ@mail.gmail.com/

[3]
https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/

[4]
https://lore.kernel.org/all/CABi2SkUTdF6PHrudHTZZ0oWK-oU+T-5+7Eqnei4yCj2fsW2jHg@mail.gmail.com/#t

[5]
https://lore.kernel.org/lkml/796b6877-0548-4d2a-a484-ba4156104a20@infradead.org/T/#mb5c8bfe234759589cadf0bcee10eaa7e07b2301a

[6]
https://lore.kernel.org/lkml/CAHk-=wiy0nHG9+3rXzQa=W8gM8F6-MhsHrs_ZqWaHtjmPK4=FA@mail.gmail.com/

[7]
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/T/#m657fffd96ffff91902da53dc9dbc1bb093fe367c

> You dropped the RFC from the topic and increment the version numbering
> on the patch set. I thought it was customary to restart counting after
> the RFC was complete?  Maybe I'm wrong, but it seemed a bit odd to see
> that happen.  The documentation also implies there are still questions
> to be answered, so it seems this is still an RFC in some ways?
>
The RFC has been dropped since V6.
That said, I'm open to feedback from Linux developers.
I will respond to the rest of your email in seperate emails.

Best Regards.
-Jeff

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01 22:37       ` Jeff Xu
@ 2024-02-01 22:54         ` Theo de Raadt
  2024-02-01 23:15           ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Theo de Raadt @ 2024-02-01 22:54 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, torvalds, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

> To me, this means Linus is OK with the general signatures of the APIs.


Linus, you are in for a shock when the proposal doesn't work for glibc
and all the applications!




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 2/4] mseal: add mseal syscall
  2024-01-31 17:50 ` [PATCH v8 2/4] mseal: add " jeffxu
@ 2024-02-01 23:11   ` Eric Biggers
  2024-02-02  3:30     ` Jeff Xu
  0 siblings, 1 reply; 50+ messages in thread
From: Eric Biggers @ 2024-02-01 23:11 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds,
	usama.anjum, rdunlap, jeffxu, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, pedro.falcato, dave.hansen,
	linux-hardening, deraadt

On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
> [PATCH v8 2/4] mseal: add mseal syscall
[...]
> +/*
> + * The PROT_SEAL defines memory sealing in the prot argument of mmap().
> + */
> +#define PROT_SEAL	0x04000000	/* _BITUL(26) */
> +
>  /* 0x01 - 0x03 are defined in linux/mman.h */
>  #define MAP_TYPE	0x0f		/* Mask for type of mapping */
>  #define MAP_FIXED	0x10		/* Interpret addr exactly */
> @@ -33,6 +38,9 @@
>  #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
>  					 * uninitialized */
>  
> +/* map is sealable */
> +#define MAP_SEALABLE	0x8000000	/* _BITUL(27) */

IMO this patch is misleading, as it claims to just be adding a new syscall, but
it actually adds three new UAPIs, only one of which is the new syscall.  The
other two new UAPIs are new flags to the mmap syscall.

Based on recent discussions, it seems the usefulness of the new mmap flags has
not yet been established.  Note also that there are only a limited number of
mmap flags remaining, so we should be careful about allocating them.

Therefore, why not start by just adding the mseal syscall, without the new mmap
flags alongside it?

I'll also note that the existing PROT_* flags seem to be conventionally used for
the CPU page protections, as opposed to kernel-specific properties of the VMA
object.  As such, PROT_SEAL feels a bit out of place anyway.  If it's added at
all it perhaps should be a MAP_* flag, not PROT_*.  I'm not sure this aspect has
been properly discussed yet, seeing as the patchset is presented as just adding
sys_mseal().  Some reviewers may not have noticed or considered the new flags.

- Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01 22:54         ` Theo de Raadt
@ 2024-02-01 23:15           ` Linus Torvalds
  2024-02-01 23:43             ` Theo de Raadt
                               ` (3 more replies)
  0 siblings, 4 replies; 50+ messages in thread
From: Linus Torvalds @ 2024-02-01 23:15 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Jeff Xu, Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, usama.anjum, rdunlap, jeffxu, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

On Thu, 1 Feb 2024 at 14:54, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Linus, you are in for a shock when the proposal doesn't work for glibc
> and all the applications!

Heh. I've enjoyed seeing your argumentative style that made you so
famous back in the days. Maybe it's always been there, but I haven't
seen the BSD people in so long that I'd forgotten all about it.

That said, famously argumentative or not, I think Theo is right, and I
do think the MAP_SEALABLE bit is nonsensical.

If somebody wants to mseal() a memory region, why would they need to
express that ahead of time?

So the part I think is sane is the mseal() system call itself, in that
it allows *potential* future expansion of the semantics.

But hopefully said future expansion isn't even needed, and all users
want the base experience, which is why I think PROT_SEAL (both to mmap
and to mprotect) makes sense as an alternative form.

So yes, to my mind

    mprotect(addr, len, PROT_READ);
    mseal(addr, len, 0);

should basically give identical results to

    mprotect(addr, len, PROT_READ | PROT_SEAL);

and using PROT_SEAL at mmap() time is similarly the same obvious
notion of "map this, and then seal that mapping".

The reason for having "mseal()" as a separate call at all from the
PROT_SEAL bit is that it does allow possible future expansion (while
PROT_SEAL is just a single bit, and it won't change semantics) but
also so that you can do whatever prep-work in stages if you want to,
and then just go "now we seal it all".

          Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01 23:15           ` Linus Torvalds
@ 2024-02-01 23:43             ` Theo de Raadt
  2024-02-02  0:26             ` Theo de Raadt
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 50+ messages in thread
From: Theo de Raadt @ 2024-02-01 23:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Xu, Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, usama.anjum, rdunlap, jeffxu, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So yes, to my mind
> 
>     mprotect(addr, len, PROT_READ);
>     mseal(addr, len, 0);
> 
> should basically give identical results to
> 
>     mprotect(addr, len, PROT_READ | PROT_SEAL);
> 
> and using PROT_SEAL at mmap() time is similarly the same obvious
> notion of "map this, and then seal that mapping".

I think that isn't easy to do.  Let's expand it to show error checking.

    if (mprotect(addr, len, PROT_READ) == -1)
       react to the errno value
    if (mseal(addr, len, 0) == -1)
       react to the errno value

and

    if (mprotect(addr, len, PROT_READ | PROT_SEAL) == -1)
       react to the errno value

For current mprotect(), the errno values are mostly related to range
issues with the parameters.

After sealing a region, mprotect() also has the new errno EPERM.

But what is the return value supposed to be from "PROT_READ | PROT_SEAL"
over various sub-region types?

Say I have a region 3 pages long.  One page is unmapped, one page is
regular, and one page is sealed.  Re-arrange those 3 pages in all 6
permutations.  Try them all.

Does the returned errno change, based upon the order?
Does it do part of the operation, or all of the operation?

If the sealed page is first, the regular page is second, and the unmapped
page is 3rd, does it return an error or return 0?  Does it change the
permission on the 3rd page?  If it returns an error, has it changed any
permissions?

I don't think the diff follows the principle of

if an error is returned --> we know nothing was changed.
if success is returned --> we know all the requests were satisfied

> The reason for having "mseal()" as a separate call at all from the
> PROT_SEAL bit is that it does allow possible future expansion (while
> PROT_SEAL is just a single bit, and it won't change semantics) but
> also so that you can do whatever prep-work in stages if you want to,
> and then just go "now we seal it all".




How about you add basic mseal() that is maximum compatible with mimmutable(),
and then we can all talk about whether PROT_SEAL makes sense once there
are applications that demand it, and can prove they need it?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01 23:15           ` Linus Torvalds
  2024-02-01 23:43             ` Theo de Raadt
@ 2024-02-02  0:26             ` Theo de Raadt
  2024-02-02  3:20             ` Jeff Xu
  2024-02-02 17:05             ` Theo de Raadt
  3 siblings, 0 replies; 50+ messages in thread
From: Theo de Raadt @ 2024-02-02  0:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Xu, Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, usama.anjum, rdunlap, jeffxu, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> and using PROT_SEAL at mmap() time is similarly the same obvious
> notion of "map this, and then seal that mapping".

The usual way is:

    ptr = mmap(NULL, len PROT_READ|PROT_WRITE, ...)

    initialize region between ptr, ptr+len

    mprotect(ptr, len, PROT_READ)
    mseal(ptr, len, 0);


Our source tree contains one place where a locking happens very close
to a mmap().

It is the shared-library-linker 'hints file', this is a file that gets
mapped PROT_READ and then we lock it.

It feels like that could be one operation?  It can't be.

        addr = (void *)mmap(0, hsize, PROT_READ, MAP_PRIVATE, hfd, 0);
        if (_dl_mmap_error(addr))
                goto bad_hints;

        hheader = (struct hints_header *)addr;
        if (HH_BADMAG(*hheader) || hheader->hh_ehints > hsize)
                goto bad_hints;

	/* couple more error checks */

	mimmutable(addr, hsize);
	close(hfd);
	return (0);
bad_hints:
	munmap(addr, hsize);
	...

See the problem?  It unmaps it if the contents are broken.  So even that
case cannot use something like "PROT_SEAL".

These are not hypotheticals.  I'm grepping an entire Unix kernel and
userland source tree, and I know what 100,000+ applications do.  I found
piece of code that could almost use it, but upon inspection it can't,
and it is obvious why: it is best idiom to allow a programmer to insert
an inspection operation between two disctinct operations, and especially
critical if the 2nd operation cannot be reversed.

Noone needs PROT_SEAL as a shortcut operation in mmap() or mprotect().

Throwing around ideas without proving their use in practice is very
unscientific.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01 22:24       ` Theo de Raadt
@ 2024-02-02  1:06         ` Greg KH
  2024-02-02  3:24           ` Jeff Xu
  0 siblings, 1 reply; 50+ messages in thread
From: Greg KH @ 2024-02-02  1:06 UTC (permalink / raw)
  To: Liam R. Howlett, Jeff Xu, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, torvalds, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote:
> As an outsider, Linux development is really strange:
> 
> Two sub-features are being pushed very hard, and the primary developer
> doesn't have code which uses either of them.  And once it goes in, it
> cannot be changed.
> 
> It's very different from my world, where the absolutely minimal
> interface was written to apply to a whole operating system plus 10,000+
> applications, and then took months of testing before it was approved for
> inclusion.  And if it was subtly wrong, we would be able to change it.

No, it's this "feature" submission that is strange to think that we
don't need that.  We do need, and will require, an actual working
userspace something to use it, otherwise as you say, there's no way to
actually know if it works properly or not and we can't change it once we
accept it.

So along those lines, Jeff, do you have a pointer to the Chrome patches,
or glibc patches, that use this new interface that proves that it
actually works?  Those would be great to see to at least verify it's
been tested in a real-world situation and actually works for your use
case.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01 20:45     ` Liam R. Howlett
  2024-02-01 22:24       ` Theo de Raadt
  2024-02-01 22:37       ` Jeff Xu
@ 2024-02-02  3:14       ` Jeff Xu
  2024-02-02 15:13         ` Liam R. Howlett
  2 siblings, 1 reply; 50+ messages in thread
From: Jeff Xu @ 2024-02-02  3:14 UTC (permalink / raw)
  To: Liam R. Howlett, Jeff Xu, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, torvalds, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt

On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Jeff Xu <jeffxu@chromium.org> [240131 20:27]:
> > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
> > <Liam.Howlett@oracle.com> wrote:
> > >
>
> Having to opt-in to allowing mseal will probably not work well.
I'm leaving the opt-in discussion in Linus's thread.

> Initial library mappings happen in one huge chunk then it's cut up into
> smaller VMAs, at least that's what I see with my maple tree tracing.  If
> you opt-in, then the entire library will have to opt-in and so the
> 'discourage inadvertent sealing' argument is not very strong.
>
Regarding "The initial library mappings happen in one huge chunk then
it is cut up into smaller VMAS", this is not a problem.

As example of elf loading (fs/binfmt_elf.c), there is just a few
places to pass in what type of memory to be allocated, e.g.
MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can  add MAP_SEALABLE at those
places.
If glic does additional splitting on the memory range, by using
mprotect(), then the MAP_SEALABLE is automatically applied after
splitting.
If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).

> It also makes a somewhat messy tracking of inheritance of the attribute
> across splitting, MAP_FIXED replacement, vma_move, vma_copy.  I think
> most of this is forced on the user?
>
The inheritance is the same as other VMA flags.

> It makes your call less flexible, it means you have to hope that the VMA
> origin was blessed before you decide you want to mseal it.
>
> What if you want to ensure the library mapped by a parent or on launch
> is mseal'ed?
>
> What about the initial relocated VMA (expand/shrink of VMA)?
>
> Creating something as "non-sealable" is pointless.  If you don't want it
> sealed, then don't mseal() that region.
>
> If your use case doesn't need it, then can we please drop the opt-in
> behaviour and just have all VMAs treated the same?
>
> If it does need it, can you explain why?
>
> The glibc relocation/fixup will then work.  glibc could mseal once it is
> complete - or an application could bypass glibc support and use the
> feature itself.

Yes. That is the idea.

>
> If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the
> heap/stack concerns.  We can either let people shoot their own feet off
> or try to protect them.
>
> Right now, you seem to be trying to protect them.  Keeping with that, I
> guess we could either get the kernel to mark those VMAs or tell some
> other way?  I'd suggest a range, but people do very strange things with
> these special VMAs [1].  I don't think you can predict enough crazy
> actions to make a difference in trying to protect people.
>
> There are far fewer VMAs that should not be allowed to be mseal'ed than
> should be, and the kernel creates those so it seems logical to only let
> the kernel opt-out on those ones.
>
> I'd rather just let people shoot themselves and return an error.
>
> I also hope it reduces the complexity of this code while increasing the
> flexibility of the feature.  As stated before, we remove the dependency
> of needing support from the initial loader.
>
> Merging VMAs
> I can see this going Very Bad with brk + mseal.  But, again, if someone
> decides to mseal these VMAs then they should expect Bad Things to
> happen (or maybe they know what they are doing even in some complex
> situation?)
>
> vma_merge() can also expand a VMA.  I think this is okay as it checks
> for the same flags, so you will allow VMA expansion of two (or three)
> vma areas to become one.  Is this okay in your model?
>
> >
> > > I mean, you specifically state that this is a 'very specific
> > > requirement' in your cover letter.  Does this mean even other browsers
> > > have no use for it?
> > >
> > No, I don’t mean “other browsers have no use for it”.
> >
> > About specific requirements from Chrome, that refers to "The lifetime
> > of those mappings are not tied to the lifetime of the process, which
> > is not the case of libc" as in the cover letter. This addition to the
> > cover letter was made in V3, thus, it might be beneficial to provide
> > additional context to help answer the question.
> >
> > This patch series begins with multiple-bit approaches (v1,v2,v3), the
> > rationale for this is that I am uncertain if Chrome's specific needs
> > are common enough for other use cases.  Consequently, I am unable to
> > make this decision myself without input from the community. To
> > accommodate this, multiple bits are selected initially due to their
> > adaptability.
> >
> > Since V1, after hearing from the community, Chrome has changed its
> > design (no longer relying on separating out mprotect), and Linus
> > acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs,
> > today mseal() has a simple design that:
> >  - meet Chrome's specific needs.
>
> How many VMAs will chrome have that are mseal'ed?  Is this a common
> operation?
>
> PROT_SEAL seems like an extra flag we could drop.  I don't expect we'll
> be sealing enough VMAs that a hand full of extra syscalls would make a
> difference?
>
> >  - meet Libc's needs.
>
> What needs of libc are you referring to?  I'm looking through the
> version changelog and I guess you mean return EPERM?
>
I meant libc's sealing RO part of the elf binary, those memory's
lifetime are associated with the lifetime of the process.

> >  - Chrome's specific need doesn't interfere with Libc's.
> >
> > [1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/
>
> Linus said he'd be happier if we made the change in general.
>
> >
> > > I am very concerned this feature will land and have to be maintained by
> > > the core mm people for the one user it was specifically targeting.
> > >
> > See above. This feature is not specifically targeting Chrome.
> >
> > > Can we also get some benchmarking on the impact of this feature?  I
> > > believe my answer in v7 removed the worst offender, but since there is
> > > no benchmarking we really are guessing (educated or not, hard data would
> > > help).  We still have an extra loop in madvise, mprotect_pkey, mremap_to
> > > (and mreamp syscall?).
> > >
> > Yes. There is an extra loop in mmap(FIXED), munmap(),
> > madvise(DONOTNEED), mremap(), to emulate the VMAs for the given
> > address range. I suspect the impact would be low, but having some hard
> > data would be good. I will see what I can find to assist the perf
> > testing. If you have a specific test suite in mind, I can also try it.
>
> You should look at mmtests [2]. But since you are adding loops across
> VMA ranges, you need to test loops across several ranges of VMAs.  That
> is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or
> some subset of small and large numbers to get an idea of complexity we
> are adding.  My hope is that the looping will be cache-hot in the maple
> tree and have minimum effect.
>
> In my personal testing, I've seen munmap often do a single VMA, or 3, or
> more rare 7 on x86_64.  There should be some good starting points in
> mmtests for the common operations.
>
Thanks. Will do.


> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mmapstress/mmapstress03.c
> [2] https://github.com/gormanm/mmtests
>
> Thanks,
> Liam

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01 23:15           ` Linus Torvalds
  2024-02-01 23:43             ` Theo de Raadt
  2024-02-02  0:26             ` Theo de Raadt
@ 2024-02-02  3:20             ` Jeff Xu
  2024-02-02  4:05               ` Theo de Raadt
  2024-02-02 17:05             ` Theo de Raadt
  3 siblings, 1 reply; 50+ messages in thread
From: Jeff Xu @ 2024-02-02  3:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theo de Raadt, Jeff Xu, Liam R. Howlett, Jonathan Corbet, akpm,
	keescook, jannh, sroettger, willy, gregkh, usama.anjum, rdunlap,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

On Thu, Feb 1, 2024 at 3:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 1 Feb 2024 at 14:54, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Linus, you are in for a shock when the proposal doesn't work for glibc
> > and all the applications!
>
> Heh. I've enjoyed seeing your argumentative style that made you so
> famous back in the days. Maybe it's always been there, but I haven't
> seen the BSD people in so long that I'd forgotten all about it.
>
> That said, famously argumentative or not, I think Theo is right, and I
> do think the MAP_SEALABLE bit is nonsensical.
>
> If somebody wants to mseal() a memory region, why would they need to
> express that ahead of time?
>
I like to look at things from the point of view of average Linux
userspace developers,  they might not have the same level of expertise
as the other folks on this email list or they might not have time and
mileage for those details.

To me, the most important thing is to deliver a feature that's easy to
use and works well. I don't want users to mess things up, so if I'm
the one giving them the tools, I'm going to make sure they have all
the information they need and that there are safeguards in place.

e.g. considering the following user case:
1> a security sensitive data is allocated from heap, using malloc,
from the software component A, and filled with information.
2> software component B then uses mprotect to change it to RO, and
seal it using mseal().

Yes. we could choose to allow it. But there are complications:

1> Is this the right pattern ? why don't component A already seal it
if they think it is important ?
2> Why  heap, why not mmap() a new memory mapping for that security data ?
3>  free() will not respect the situation of whether the memory is
sealed or not. How would a new developer know they probably shall
never free the sealed memory ?
4>  brk-shrink will never be able to pass the VMA that gets splited
out by mseal(), there are memory footprint implications to the
process.
5>  what if the security sensitive data happens to be the first VMA or
last VMA of the heap, will sealing the first VMA/last VMA cause any
issue there ? since they might carry important VMA flags ? ( I don't
know enough about brk.)
6> If we ever support sealing the heap for its entirety (make it not
executable), and still want to support other brk behaviors, such as
shrink/grow, would that conflict with current mseal(), if we allow it
on heap from beginning ?

Questions like that, without clear answers, to me it is premature to
already let developers start using  mseal() for heap.

And even if we have all the answers for heap, how about stack, or
other types of virtual memory ?

Again, I don't have enough knowledge to get a complete list that
shouldn't be sealed,  the input from Theo is none should I worry
about.  However it  is clearly not none to me, besides heap mentioned,
there is also aio/shm.

So MAP_SEALABLE is a conservative approach to limit the scope to ***
two known use cases *** that  I want to work on (libc and chrome) and
give  time needed to answer those questions. It is like a claim: only
those marked by MAP_SEALABLE support the sealing at this point of
time.

And MAP_SEALABLE is reversible, e.g. a sysctl could be added to make
all memory sealable in the future, or we could obsoleted it entirely
when time comes, an application that already passes MAP_SEALABLE can
be treated as noop. However, if all memory were allowed to be sealable
from the beginning, reversing that decision would be hard.

After those considerations, if MAP_SEALABLE is still not preferred by
you. Then I have the following options for you to choose:

1. MAP_NOT_SEALABLE in the mmap(). And I will use them for the
heap/aio/shm case.
This basically says Linux does not officially support sealing on
those,  until we support them, we discourage the sealing on those
mappings.

2. make MAP_NOT_SEALABLE only a kernel visible flag. So application
space won't be able to use it.

3. open for all, and list as much as details in the documentation.
 If we choose this route, I would like to have more discussion on the
heap/stack, at least the Linux developers will learn from those
discussions.

> So the part I think is sane is the mseal() system call itself, in that
> it allows *potential* future expansion of the semantics.
>
> But hopefully said future expansion isn't even needed, and all users
> want the base experience, which is why I think PROT_SEAL (both to mmap
> and to mprotect) makes sense as an alternative form.
>
> So yes, to my mind
>
>     mprotect(addr, len, PROT_READ);
>     mseal(addr, len, 0);
>
> should basically give identical results to
>
>     mprotect(addr, len, PROT_READ | PROT_SEAL);
>
> and using PROT_SEAL at mmap() time is similarly the same obvious
> notion of "map this, and then seal that mapping".
>
> The reason for having "mseal()" as a separate call at all from the
> PROT_SEAL bit is that it does allow possible future expansion (while
> PROT_SEAL is just a single bit, and it won't change semantics) but
> also so that you can do whatever prep-work in stages if you want to,
> and then just go "now we seal it all".
>

To clarify: do you mean to have the following ?

mmap(PROT_READ|PROT_SEAL)
mseal(addr,len,0)
mprotect(addr,len,PROT_READ|PROT_SEAL) ?

I have to think about the mprotect() case.

For mmap(PROT_READ|PROT_SEAL),  I might  have a use case already:

fs/binfmt_elf.c
if (current->personality & MMAP_PAGE_ZERO) {
                /* Why this, you ask???  Well SVr4 maps page 0 as read-only,
                   and some applications "depend" upon this behavior.
                   Since we do not have the power to recompile these, we
                   emulate the SVr4 behavior. Sigh. */

                error = vm_mmap(NULL, 0, PAGE_SIZE,
                                PROT_READ | PROT_EXEC,   <-- add PROT_SEAL
                                MAP_FIXED | MAP_PRIVATE, 0);
        }

I don't see the benefit of RWX page 0, which might make a null
pointers error to become executable for some code.

Best Regards,
-Jeff

>           Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02  1:06         ` Greg KH
@ 2024-02-02  3:24           ` Jeff Xu
  2024-02-02  3:29             ` Linus Torvalds
  2024-02-02 15:18             ` Greg KH
  0 siblings, 2 replies; 50+ messages in thread
From: Jeff Xu @ 2024-02-02  3:24 UTC (permalink / raw)
  To: Greg KH
  Cc: Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, torvalds, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

On Thu, Feb 1, 2024 at 5:06 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote:
> > As an outsider, Linux development is really strange:
> >
> > Two sub-features are being pushed very hard, and the primary developer
> > doesn't have code which uses either of them.  And once it goes in, it
> > cannot be changed.
> >
> > It's very different from my world, where the absolutely minimal
> > interface was written to apply to a whole operating system plus 10,000+
> > applications, and then took months of testing before it was approved for
> > inclusion.  And if it was subtly wrong, we would be able to change it.
>
> No, it's this "feature" submission that is strange to think that we
> don't need that.  We do need, and will require, an actual working
> userspace something to use it, otherwise as you say, there's no way to
> actually know if it works properly or not and we can't change it once we
> accept it.
>
> So along those lines, Jeff, do you have a pointer to the Chrome patches,
> or glibc patches, that use this new interface that proves that it
> actually works?  Those would be great to see to at least verify it's
> been tested in a real-world situation and actually works for your use
> case.
>
The MAP_SEALABLE is raised because of other concerns not related to libc.

The patch Stephan developed was based on V1 of the patch, IIRC, which
is really ancient, and it is not based on MAP_SEALABLE, which is a
more recent development entirely from me.

I don't see unresolvable problems  with glibc though,  E.g. For the
elf case (binfmt_elf.c), there are two places I need to add
MAP_SEALABLE, then the memory  to user space is marked with sealable.
There might be cases where glibc needs to add MAP_SEALABLE it uses
mmap(FIXED) to split the memory.

If the decision of MAP_SELABLE depends on the glibc case being able to
use it, we can develop such a patch, but it will take a while, say a
few weeks to months, due to vacation, work load, etc.

Best Regards,
-Jeff

> thanks,
>
> greg k-h

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02  3:24           ` Jeff Xu
@ 2024-02-02  3:29             ` Linus Torvalds
  2024-02-02  3:46               ` Jeff Xu
  2024-02-02 15:18             ` Greg KH
  1 sibling, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2024-02-02  3:29 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Greg KH, Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, usama.anjum, rdunlap, jeffxu, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

On Thu, 1 Feb 2024 at 19:24, Jeff Xu <jeffxu@chromium.org> wrote:
>
> The patch Stephan developed was based on V1 of the patch, IIRC, which
> is really ancient, and it is not based on MAP_SEALABLE, which is a
> more recent development entirely from me.

So the problem with this whole patch series from the very beginning
was that it was very specialized, and COMPLETELY OVER-ENGINEERED.

It got simpler at one point. And then you started adding these
features that have absolutely no reason for them. Again.

It's frustrating. And it's not making it more likely to be ever merged.

               Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 2/4] mseal: add mseal syscall
  2024-02-01 23:11   ` Eric Biggers
@ 2024-02-02  3:30     ` Jeff Xu
  2024-02-02  3:54       ` Theo de Raadt
  0 siblings, 1 reply; 50+ messages in thread
From: Jeff Xu @ 2024-02-02  3:30 UTC (permalink / raw)
  To: Eric Biggers
  Cc: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds,
	usama.anjum, rdunlap, jeffxu, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, pedro.falcato, dave.hansen,
	linux-hardening, deraadt

On Thu, Feb 1, 2024 at 3:11 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
> > [PATCH v8 2/4] mseal: add mseal syscall
> [...]
> > +/*
> > + * The PROT_SEAL defines memory sealing in the prot argument of mmap().
> > + */
> > +#define PROT_SEAL    0x04000000      /* _BITUL(26) */
> > +
> >  /* 0x01 - 0x03 are defined in linux/mman.h */
> >  #define MAP_TYPE     0x0f            /* Mask for type of mapping */
> >  #define MAP_FIXED    0x10            /* Interpret addr exactly */
> > @@ -33,6 +38,9 @@
> >  #define MAP_UNINITIALIZED 0x4000000  /* For anonymous mmap, memory could be
> >                                        * uninitialized */
> >
> > +/* map is sealable */
> > +#define MAP_SEALABLE 0x8000000       /* _BITUL(27) */
>
> IMO this patch is misleading, as it claims to just be adding a new syscall, but
> it actually adds three new UAPIs, only one of which is the new syscall.  The
> other two new UAPIs are new flags to the mmap syscall.
>
The description does include all three. I could update the patch title.

> Based on recent discussions, it seems the usefulness of the new mmap flags has
> not yet been established.  Note also that there are only a limited number of
> mmap flags remaining, so we should be careful about allocating them.
>
> Therefore, why not start by just adding the mseal syscall, without the new mmap
> flags alongside it?
>
> I'll also note that the existing PROT_* flags seem to be conventionally used for
> the CPU page protections, as opposed to kernel-specific properties of the VMA
> object.  As such, PROT_SEAL feels a bit out of place anyway.  If it's added at
> all it perhaps should be a MAP_* flag, not PROT_*.  I'm not sure this aspect has
> been properly discussed yet, seeing as the patchset is presented as just adding
> sys_mseal().  Some reviewers may not have noticed or considered the new flags.
>
MAP_ flags is more used for type of mapping, such as MAP_FIXED_NOREPLACE.

The PROT_SEAL might make more sense because sealing the protection bit
is the main functionality of the sealing at this moment.

Thanks
-Jeff




> - Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02  3:29             ` Linus Torvalds
@ 2024-02-02  3:46               ` Jeff Xu
  0 siblings, 0 replies; 50+ messages in thread
From: Jeff Xu @ 2024-02-02  3:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Greg KH, Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, usama.anjum, rdunlap, jeffxu, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

On Thu, Feb 1, 2024 at 7:29 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 1 Feb 2024 at 19:24, Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > The patch Stephan developed was based on V1 of the patch, IIRC, which
> > is really ancient, and it is not based on MAP_SEALABLE, which is a
> > more recent development entirely from me.
>
> So the problem with this whole patch series from the very beginning
> was that it was very specialized, and COMPLETELY OVER-ENGINEERED.
>
> It got simpler at one point. And then you started adding these
> features that have absolutely no reason for them. Again.
>
> It's frustrating. And it's not making it more likely to be ever merged.
>
I'm sorry for over-thinking.
Remove the MAP_SEALABLE it is then.

Keep with mseal(addr,len,0) only ?

-Jeff
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 2/4] mseal: add mseal syscall
  2024-02-02  3:30     ` Jeff Xu
@ 2024-02-02  3:54       ` Theo de Raadt
  2024-02-02  4:03         ` Jeff Xu
  0 siblings, 1 reply; 50+ messages in thread
From: Theo de Raadt @ 2024-02-02  3:54 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Eric Biggers, akpm, keescook, jannh, sroettger, willy, gregkh,
	torvalds, usama.anjum, rdunlap, jeffxu, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

Jeff Xu <jeffxu@chromium.org> wrote:

> On Thu, Feb 1, 2024 at 3:11 PM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
> > > [PATCH v8 2/4] mseal: add mseal syscall
> > [...]
> > > +/*
> > > + * The PROT_SEAL defines memory sealing in the prot argument of mmap().
> > > + */
> > > +#define PROT_SEAL    0x04000000      /* _BITUL(26) */
> > > +
> > >  /* 0x01 - 0x03 are defined in linux/mman.h */
> > >  #define MAP_TYPE     0x0f            /* Mask for type of mapping */
> > >  #define MAP_FIXED    0x10            /* Interpret addr exactly */
> > > @@ -33,6 +38,9 @@
> > >  #define MAP_UNINITIALIZED 0x4000000  /* For anonymous mmap, memory could be
> > >                                        * uninitialized */
> > >
> > > +/* map is sealable */
> > > +#define MAP_SEALABLE 0x8000000       /* _BITUL(27) */
> >
> > IMO this patch is misleading, as it claims to just be adding a new syscall, but
> > it actually adds three new UAPIs, only one of which is the new syscall.  The
> > other two new UAPIs are new flags to the mmap syscall.
> >
> The description does include all three. I could update the patch title.
> 
> > Based on recent discussions, it seems the usefulness of the new mmap flags has
> > not yet been established.  Note also that there are only a limited number of
> > mmap flags remaining, so we should be careful about allocating them.
> >
> > Therefore, why not start by just adding the mseal syscall, without the new mmap
> > flags alongside it?
> >
> > I'll also note that the existing PROT_* flags seem to be conventionally used for
> > the CPU page protections, as opposed to kernel-specific properties of the VMA
> > object.  As such, PROT_SEAL feels a bit out of place anyway.  If it's added at
> > all it perhaps should be a MAP_* flag, not PROT_*.  I'm not sure this aspect has
> > been properly discussed yet, seeing as the patchset is presented as just adding
> > sys_mseal().  Some reviewers may not have noticed or considered the new flags.
> >
> MAP_ flags is more used for type of mapping, such as MAP_FIXED_NOREPLACE.
> 
> The PROT_SEAL might make more sense because sealing the protection bit
> is the main functionality of the sealing at this moment.

Jeff, please show a piece of software that needs to do PROT_SEAL as
mprotect() or mmap() argument.

Please don't write it as a vague essay.

Instead, take a piece of existing code, write a diff, and show your work.

Then explain that diff, justify why doing the PROT_SEAL as an argument
of mprotect() or mmap() is a required improvement, and show your Linux
developer peers that you can do computer science.

I did the same work in OpenBSD, at least 25% time over 2 years, and I
had to prove my work inside my development community.  I had to prove
that it worked system wide, not in 1 program, with hand-waving for the
rest.  If I had said "Looks, it works in ssh, trust me it works in other
programs", it would not have gone further.

glibc is the best example to demonstrate, but smaller examples might
convince.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 2/4] mseal: add mseal syscall
  2024-02-02  3:54       ` Theo de Raadt
@ 2024-02-02  4:03         ` Jeff Xu
  2024-02-02  4:10           ` Theo de Raadt
  0 siblings, 1 reply; 50+ messages in thread
From: Jeff Xu @ 2024-02-02  4:03 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Eric Biggers, akpm, keescook, jannh, sroettger, willy, gregkh,
	torvalds, usama.anjum, rdunlap, jeffxu, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

On Thu, Feb 1, 2024 at 7:54 PM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Jeff Xu <jeffxu@chromium.org> wrote:
>
> > On Thu, Feb 1, 2024 at 3:11 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > >
> > > On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
> > > > [PATCH v8 2/4] mseal: add mseal syscall
> > > [...]
> > > > +/*
> > > > + * The PROT_SEAL defines memory sealing in the prot argument of mmap().
> > > > + */
> > > > +#define PROT_SEAL    0x04000000      /* _BITUL(26) */
> > > > +
> > > >  /* 0x01 - 0x03 are defined in linux/mman.h */
> > > >  #define MAP_TYPE     0x0f            /* Mask for type of mapping */
> > > >  #define MAP_FIXED    0x10            /* Interpret addr exactly */
> > > > @@ -33,6 +38,9 @@
> > > >  #define MAP_UNINITIALIZED 0x4000000  /* For anonymous mmap, memory could be
> > > >                                        * uninitialized */
> > > >
> > > > +/* map is sealable */
> > > > +#define MAP_SEALABLE 0x8000000       /* _BITUL(27) */
> > >
> > > IMO this patch is misleading, as it claims to just be adding a new syscall, but
> > > it actually adds three new UAPIs, only one of which is the new syscall.  The
> > > other two new UAPIs are new flags to the mmap syscall.
> > >
> > The description does include all three. I could update the patch title.
> >
> > > Based on recent discussions, it seems the usefulness of the new mmap flags has
> > > not yet been established.  Note also that there are only a limited number of
> > > mmap flags remaining, so we should be careful about allocating them.
> > >
> > > Therefore, why not start by just adding the mseal syscall, without the new mmap
> > > flags alongside it?
> > >
> > > I'll also note that the existing PROT_* flags seem to be conventionally used for
> > > the CPU page protections, as opposed to kernel-specific properties of the VMA
> > > object.  As such, PROT_SEAL feels a bit out of place anyway.  If it's added at
> > > all it perhaps should be a MAP_* flag, not PROT_*.  I'm not sure this aspect has
> > > been properly discussed yet, seeing as the patchset is presented as just adding
> > > sys_mseal().  Some reviewers may not have noticed or considered the new flags.
> > >
> > MAP_ flags is more used for type of mapping, such as MAP_FIXED_NOREPLACE.
> >
> > The PROT_SEAL might make more sense because sealing the protection bit
> > is the main functionality of the sealing at this moment.
>
> Jeff, please show a piece of software that needs to do PROT_SEAL as
> mprotect() or mmap() argument.
>
I didn't propose mprotect().

for mmap() here is a potential use case:

fs/binfmt_elf.c
if (current->personality & MMAP_PAGE_ZERO) {
                /* Why this, you ask???  Well SVr4 maps page 0 as read-only,
                   and some applications "depend" upon this behavior.
                   Since we do not have the power to recompile these, we
                   emulate the SVr4 behavior. Sigh. */

                error = vm_mmap(NULL, 0, PAGE_SIZE,
                                PROT_READ | PROT_EXEC,   <-- add PROT_SEAL
                                MAP_FIXED | MAP_PRIVATE, 0);
        }

I don't see the benefit of RWX page 0, which might make a null
pointers error to become executable for some code.


> Please don't write it as a vague essay.
>
> Instead, take a piece of existing code, write a diff, and show your work.
>
> Then explain that diff, justify why doing the PROT_SEAL as an argument
> of mprotect() or mmap() is a required improvement, and show your Linux
> developer peers that you can do computer science.
>
> I did the same work in OpenBSD, at least 25% time over 2 years, and I
> had to prove my work inside my development community.  I had to prove
> that it worked system wide, not in 1 program, with hand-waving for the
> rest.  If I had said "Looks, it works in ssh, trust me it works in other
> programs", it would not have gone further.
>
> glibc is the best example to demonstrate, but smaller examples might
> convince.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02  3:20             ` Jeff Xu
@ 2024-02-02  4:05               ` Theo de Raadt
  2024-02-02  4:54                 ` Jeff Xu
  0 siblings, 1 reply; 50+ messages in thread
From: Theo de Raadt @ 2024-02-02  4:05 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Linus Torvalds, Jeff Xu, Liam R. Howlett, Jonathan Corbet, akpm,
	keescook, jannh, sroettger, willy, gregkh, usama.anjum, rdunlap,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

Jeff Xu <jeffxu@google.com> wrote:

> To me, the most important thing is to deliver a feature that's easy to
> use and works well. I don't want users to mess things up, so if I'm
> the one giving them the tools, I'm going to make sure they have all
> the information they need and that there are safeguards in place.
> 
> e.g. considering the following user case:
> 1> a security sensitive data is allocated from heap, using malloc,
> from the software component A, and filled with information.
> 2> software component B then uses mprotect to change it to RO, and
> seal it using mseal().

  p = malloc(80);
  mprotect(p & ~4095, 4096, PROT_NONE);
  free(p);

Will you save such a developer also?  No.

Since the same problem you describe already exists with mprotect() what
does mseal() even have to do with your proposal?

What about this?

  p = malloc(80);
  munmap(p & ~4095, 4096);
  free(p);

And since it is not sealed, how about madvise operations on a proper
non-malloc memory allocation?  Well, the process smashes it's own
memory.  And why is it not sealed?  You make it harder to seal memory!

How about this?

  p = malloc(80);
  bzero(p, 100000;

Yes it is a buffer overflow.  But this is all the same class of software
problem:

Memory belongs to processes, which belongs to the program, which is coded
by the programmer, who has to learn to be careful and handle the memory correctly.

mseal() / mimmutable() add *no new expectation* to a careful programmer,
because they expected to only use it on memory that they *promise will never
be de-allocated or re-permissioned*.

What you are proposing is not a "mitigation", it entirely cripples the
proposed subsystem because you are afraid of it; because you have cloned a
memory subsystem primitive you don't fully understand; and this is because
you've not seen a complete operating system using it.

When was the last time you developed outside of Chrome?

This is systems programming.  The kernel supports all the programs, not
just the one holy program from god.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 2/4] mseal: add mseal syscall
  2024-02-02  4:03         ` Jeff Xu
@ 2024-02-02  4:10           ` Theo de Raadt
  2024-02-02  4:22             ` Jeff Xu
  0 siblings, 1 reply; 50+ messages in thread
From: Theo de Raadt @ 2024-02-02  4:10 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Eric Biggers, akpm, keescook, jannh, sroettger, willy, gregkh,
	torvalds, usama.anjum, rdunlap, jeffxu, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

Jeff Xu <jeffxu@chromium.org> wrote:

> On Thu, Feb 1, 2024 at 7:54 PM Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > > On Thu, Feb 1, 2024 at 3:11 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > > >
> > > > On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
> > > > > [PATCH v8 2/4] mseal: add mseal syscall
> > > > [...]
> > > > > +/*
> > > > > + * The PROT_SEAL defines memory sealing in the prot argument of mmap().
> > > > > + */
> > > > > +#define PROT_SEAL    0x04000000      /* _BITUL(26) */
> > > > > +
> > > > >  /* 0x01 - 0x03 are defined in linux/mman.h */
> > > > >  #define MAP_TYPE     0x0f            /* Mask for type of mapping */
> > > > >  #define MAP_FIXED    0x10            /* Interpret addr exactly */
> > > > > @@ -33,6 +38,9 @@
> > > > >  #define MAP_UNINITIALIZED 0x4000000  /* For anonymous mmap, memory could be
> > > > >                                        * uninitialized */
> > > > >
> > > > > +/* map is sealable */
> > > > > +#define MAP_SEALABLE 0x8000000       /* _BITUL(27) */
> > > >
> > > > IMO this patch is misleading, as it claims to just be adding a new syscall, but
> > > > it actually adds three new UAPIs, only one of which is the new syscall.  The
> > > > other two new UAPIs are new flags to the mmap syscall.
> > > >
> > > The description does include all three. I could update the patch title.
> > >
> > > > Based on recent discussions, it seems the usefulness of the new mmap flags has
> > > > not yet been established.  Note also that there are only a limited number of
> > > > mmap flags remaining, so we should be careful about allocating them.
> > > >
> > > > Therefore, why not start by just adding the mseal syscall, without the new mmap
> > > > flags alongside it?
> > > >
> > > > I'll also note that the existing PROT_* flags seem to be conventionally used for
> > > > the CPU page protections, as opposed to kernel-specific properties of the VMA
> > > > object.  As such, PROT_SEAL feels a bit out of place anyway.  If it's added at
> > > > all it perhaps should be a MAP_* flag, not PROT_*.  I'm not sure this aspect has
> > > > been properly discussed yet, seeing as the patchset is presented as just adding
> > > > sys_mseal().  Some reviewers may not have noticed or considered the new flags.
> > > >
> > > MAP_ flags is more used for type of mapping, such as MAP_FIXED_NOREPLACE.
> > >
> > > The PROT_SEAL might make more sense because sealing the protection bit
> > > is the main functionality of the sealing at this moment.
> >
> > Jeff, please show a piece of software that needs to do PROT_SEAL as
> > mprotect() or mmap() argument.
> >
> I didn't propose mprotect().
> 
> for mmap() here is a potential use case:
> 
> fs/binfmt_elf.c
> if (current->personality & MMAP_PAGE_ZERO) {
>                 /* Why this, you ask???  Well SVr4 maps page 0 as read-only,
>                    and some applications "depend" upon this behavior.
>                    Since we do not have the power to recompile these, we
>                    emulate the SVr4 behavior. Sigh. */
> 
>                 error = vm_mmap(NULL, 0, PAGE_SIZE,
>                                 PROT_READ | PROT_EXEC,   <-- add PROT_SEAL
>                                 MAP_FIXED | MAP_PRIVATE, 0);
>         }
> 
> I don't see the benefit of RWX page 0, which might make a null
> pointers error to become executable for some code.



And this is a lot faster than doing the operation as a second step?


But anyways, that's kernel code.  It is not userland exposed API used
by programs.

The question is the damage you create by adding API exposed to
userland (since this is Linux: forever).

I should be the first person thrilled to see Linux make API/ABI mistakes
they have to support forever, but I can't be that person.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 2/4] mseal: add mseal syscall
  2024-02-02  4:10           ` Theo de Raadt
@ 2024-02-02  4:22             ` Jeff Xu
  0 siblings, 0 replies; 50+ messages in thread
From: Jeff Xu @ 2024-02-02  4:22 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Eric Biggers, akpm, keescook, jannh, sroettger, willy, gregkh,
	torvalds, usama.anjum, rdunlap, jeffxu, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

On Thu, Feb 1, 2024 at 8:10 PM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Jeff Xu <jeffxu@chromium.org> wrote:
>
> > On Thu, Feb 1, 2024 at 7:54 PM Theo de Raadt <deraadt@openbsd.org> wrote:
> > >
> > > Jeff Xu <jeffxu@chromium.org> wrote:
> > >
> > > > On Thu, Feb 1, 2024 at 3:11 PM Eric Biggers <ebiggers@kernel.org> wrote:
> > > > >
> > > > > On Wed, Jan 31, 2024 at 05:50:24PM +0000, jeffxu@chromium.org wrote:
> > > > > > [PATCH v8 2/4] mseal: add mseal syscall
> > > > > [...]
> > > > > > +/*
> > > > > > + * The PROT_SEAL defines memory sealing in the prot argument of mmap().
> > > > > > + */
> > > > > > +#define PROT_SEAL    0x04000000      /* _BITUL(26) */
> > > > > > +
> > > > > >  /* 0x01 - 0x03 are defined in linux/mman.h */
> > > > > >  #define MAP_TYPE     0x0f            /* Mask for type of mapping */
> > > > > >  #define MAP_FIXED    0x10            /* Interpret addr exactly */
> > > > > > @@ -33,6 +38,9 @@
> > > > > >  #define MAP_UNINITIALIZED 0x4000000  /* For anonymous mmap, memory could be
> > > > > >                                        * uninitialized */
> > > > > >
> > > > > > +/* map is sealable */
> > > > > > +#define MAP_SEALABLE 0x8000000       /* _BITUL(27) */
> > > > >
> > > > > IMO this patch is misleading, as it claims to just be adding a new syscall, but
> > > > > it actually adds three new UAPIs, only one of which is the new syscall.  The
> > > > > other two new UAPIs are new flags to the mmap syscall.
> > > > >
> > > > The description does include all three. I could update the patch title.
> > > >
> > > > > Based on recent discussions, it seems the usefulness of the new mmap flags has
> > > > > not yet been established.  Note also that there are only a limited number of
> > > > > mmap flags remaining, so we should be careful about allocating them.
> > > > >
> > > > > Therefore, why not start by just adding the mseal syscall, without the new mmap
> > > > > flags alongside it?
> > > > >
> > > > > I'll also note that the existing PROT_* flags seem to be conventionally used for
> > > > > the CPU page protections, as opposed to kernel-specific properties of the VMA
> > > > > object.  As such, PROT_SEAL feels a bit out of place anyway.  If it's added at
> > > > > all it perhaps should be a MAP_* flag, not PROT_*.  I'm not sure this aspect has
> > > > > been properly discussed yet, seeing as the patchset is presented as just adding
> > > > > sys_mseal().  Some reviewers may not have noticed or considered the new flags.
> > > > >
> > > > MAP_ flags is more used for type of mapping, such as MAP_FIXED_NOREPLACE.
> > > >
> > > > The PROT_SEAL might make more sense because sealing the protection bit
> > > > is the main functionality of the sealing at this moment.
> > >
> > > Jeff, please show a piece of software that needs to do PROT_SEAL as
> > > mprotect() or mmap() argument.
> > >
> > I didn't propose mprotect().
> >
> > for mmap() here is a potential use case:
> >
> > fs/binfmt_elf.c
> > if (current->personality & MMAP_PAGE_ZERO) {
> >                 /* Why this, you ask???  Well SVr4 maps page 0 as read-only,
> >                    and some applications "depend" upon this behavior.
> >                    Since we do not have the power to recompile these, we
> >                    emulate the SVr4 behavior. Sigh. */
> >
> >                 error = vm_mmap(NULL, 0, PAGE_SIZE,
> >                                 PROT_READ | PROT_EXEC,   <-- add PROT_SEAL
> >                                 MAP_FIXED | MAP_PRIVATE, 0);
> >         }
> >
> > I don't see the benefit of RWX page 0, which might make a null
> > pointers error to become executable for some code.
>
>
>
> And this is a lot faster than doing the operation as a second step?
>
>
> But anyways, that's kernel code.  It is not userland exposed API used
> by programs.
>
> The question is the damage you create by adding API exposed to
> userland (since this is Linux: forever).
>
> I should be the first person thrilled to see Linux make API/ABI mistakes
> they have to support forever, but I can't be that person.
>
Point taken.
I can remove PROT_SEAL.

>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02  4:05               ` Theo de Raadt
@ 2024-02-02  4:54                 ` Jeff Xu
  2024-02-02  5:00                   ` Theo de Raadt
  0 siblings, 1 reply; 50+ messages in thread
From: Jeff Xu @ 2024-02-02  4:54 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Jeff Xu, Linus Torvalds, Liam R. Howlett, Jonathan Corbet, akpm,
	keescook, jannh, sroettger, willy, gregkh, usama.anjum, rdunlap,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

On Thu, Feb 1, 2024 at 8:05 PM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Jeff Xu <jeffxu@google.com> wrote:
>
> > To me, the most important thing is to deliver a feature that's easy to
> > use and works well. I don't want users to mess things up, so if I'm
> > the one giving them the tools, I'm going to make sure they have all
> > the information they need and that there are safeguards in place.
> >
> > e.g. considering the following user case:
> > 1> a security sensitive data is allocated from heap, using malloc,
> > from the software component A, and filled with information.
> > 2> software component B then uses mprotect to change it to RO, and
> > seal it using mseal().
>
>   p = malloc(80);
>   mprotect(p & ~4095, 4096, PROT_NONE);
>   free(p);
>
> Will you save such a developer also?  No.
>
> Since the same problem you describe already exists with mprotect() what
> does mseal() even have to do with your proposal?
>
> What about this?
>
>   p = malloc(80);
>   munmap(p & ~4095, 4096);
>   free(p);
>
> And since it is not sealed, how about madvise operations on a proper
> non-malloc memory allocation?  Well, the process smashes it's own
> memory.  And why is it not sealed?  You make it harder to seal memory!
>
> How about this?
>
>   p = malloc(80);
>   bzero(p, 100000;
>
> Yes it is a buffer overflow.  But this is all the same class of software
> problem:
>
> Memory belongs to processes, which belongs to the program, which is coded
> by the programmer, who has to learn to be careful and handle the memory correctly.
>
> mseal() / mimmutable() add *no new expectation* to a careful programmer,
> because they expected to only use it on memory that they *promise will never
> be de-allocated or re-permissioned*.
>
> What you are proposing is not a "mitigation", it entirely cripples the
> proposed subsystem because you are afraid of it; because you have cloned a
> memory subsystem primitive you don't fully understand; and this is because
> you've not seen a complete operating system using it.
>
> When was the last time you developed outside of Chrome?
>
> This is systems programming.  The kernel supports all the programs, not
> just the one holy program from god.
>
Even without free.
I personally do not like the heap getting sealed like that.

Component A.
p=malloc(4096);
writing something to p.

Component B:
mprotect(p,4096, RO)
mseal(p,4096)

This will split the heap VMA, and prevent the heap from shrinking, if
this is in a frequent code path, then it might hurt the process's
memory usage.

The existing code is more likely to use malloc than mmap(), so it is
easier for dev to seal a piece of data belonging to another component.
I hope this pattern is not wide-spreading.

The ideal way will be just changing the library A to use mmap.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02  4:54                 ` Jeff Xu
@ 2024-02-02  5:00                   ` Theo de Raadt
  2024-02-02 17:58                     ` Jeff Xu
  0 siblings, 1 reply; 50+ messages in thread
From: Theo de Raadt @ 2024-02-02  5:00 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Jeff Xu, Linus Torvalds, Liam R. Howlett, Jonathan Corbet, akpm,
	keescook, jannh, sroettger, willy, gregkh, usama.anjum, rdunlap,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

Jeff Xu <jeffxu@chromium.org> wrote:

> Even without free.
> I personally do not like the heap getting sealed like that.
> 
> Component A.
> p=malloc(4096);
> writing something to p.
> 
> Component B:
> mprotect(p,4096, RO)
> mseal(p,4096)
> 
> This will split the heap VMA, and prevent the heap from shrinking, if
> this is in a frequent code path, then it might hurt the process's
> memory usage.
> 
> The existing code is more likely to use malloc than mmap(), so it is
> easier for dev to seal a piece of data belonging to another component.
> I hope this pattern is not wide-spreading.
> 
> The ideal way will be just changing the library A to use mmap.

I think you are lacking some test programs to see how it actually
behaves; the effect is worse than you think, and the impact is immediately
visible to the programmer, and the lesson is clear:

	you can only seal objects which you gaurantee never get recycled.

	Pushing a sealed object back into reuse is a disasterous bug.

	Noone should call this interface, unless they understand that.

I'll say again, you don't have a test program for various allocators to
understand how it behaves.  The failure modes described in your docuemnts
are not correct.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02  3:14       ` Jeff Xu
@ 2024-02-02 15:13         ` Liam R. Howlett
  2024-02-02 17:24           ` Jeff Xu
  0 siblings, 1 reply; 50+ messages in thread
From: Liam R. Howlett @ 2024-02-02 15:13 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Jeff Xu, Jonathan Corbet, akpm, keescook, jannh, sroettger,
	willy, gregkh, torvalds, usama.anjum, rdunlap, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening, deraadt

* Jeff Xu <jeffxu@google.com> [240201 22:15]:
> On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Jeff Xu <jeffxu@chromium.org> [240131 20:27]:
> > > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
> > > <Liam.Howlett@oracle.com> wrote:
> > > >
> >
> > Having to opt-in to allowing mseal will probably not work well.
> I'm leaving the opt-in discussion in Linus's thread.
> 
> > Initial library mappings happen in one huge chunk then it's cut up into
> > smaller VMAs, at least that's what I see with my maple tree tracing.  If
> > you opt-in, then the entire library will have to opt-in and so the
> > 'discourage inadvertent sealing' argument is not very strong.
> >
> Regarding "The initial library mappings happen in one huge chunk then
> it is cut up into smaller VMAS", this is not a problem.
> 
> As example of elf loading (fs/binfmt_elf.c), there is just a few
> places to pass in what type of memory to be allocated, e.g.
> MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can  add MAP_SEALABLE at those
> places.
> If glic does additional splitting on the memory range, by using
> mprotect(), then the MAP_SEALABLE is automatically applied after
> splitting.
> If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).

You are adding a flag that requires a new glibc.  When I try to point
out how this is unnecessary and excessive, you tell me it's fine and
probably not a whole lot of work.

This isn't working with developers, you are dismissing the developers
who are trying to help you.

Can you please:

Provide code that uses this feature.

Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and
32 VMAs.

Provide code that tests and checks the failure paths.  Failures at the
start, middle, and end of the modifications.

Document what happens in those failure paths.

And, most importantly: keep an open mind and allow your opinion to
change when presented with new information.

All of these things are to help you.  We need to know what needs fixing
so you can be successful.


Thanks,
Liam

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02  3:24           ` Jeff Xu
  2024-02-02  3:29             ` Linus Torvalds
@ 2024-02-02 15:18             ` Greg KH
  1 sibling, 0 replies; 50+ messages in thread
From: Greg KH @ 2024-02-02 15:18 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, torvalds, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

On Thu, Feb 01, 2024 at 07:24:02PM -0800, Jeff Xu wrote:
> On Thu, Feb 1, 2024 at 5:06 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote:
> > > As an outsider, Linux development is really strange:
> > >
> > > Two sub-features are being pushed very hard, and the primary developer
> > > doesn't have code which uses either of them.  And once it goes in, it
> > > cannot be changed.
> > >
> > > It's very different from my world, where the absolutely minimal
> > > interface was written to apply to a whole operating system plus 10,000+
> > > applications, and then took months of testing before it was approved for
> > > inclusion.  And if it was subtly wrong, we would be able to change it.
> >
> > No, it's this "feature" submission that is strange to think that we
> > don't need that.  We do need, and will require, an actual working
> > userspace something to use it, otherwise as you say, there's no way to
> > actually know if it works properly or not and we can't change it once we
> > accept it.
> >
> > So along those lines, Jeff, do you have a pointer to the Chrome patches,
> > or glibc patches, that use this new interface that proves that it
> > actually works?  Those would be great to see to at least verify it's
> > been tested in a real-world situation and actually works for your use
> > case.
> >
> The MAP_SEALABLE is raised because of other concerns not related to libc.
> 
> The patch Stephan developed was based on V1 of the patch, IIRC, which
> is really ancient, and it is not based on MAP_SEALABLE, which is a
> more recent development entirely from me.
> 
> I don't see unresolvable problems  with glibc though,  E.g. For the
> elf case (binfmt_elf.c), there are two places I need to add
> MAP_SEALABLE, then the memory  to user space is marked with sealable.
> There might be cases where glibc needs to add MAP_SEALABLE it uses
> mmap(FIXED) to split the memory.
> 
> If the decision of MAP_SELABLE depends on the glibc case being able to
> use it, we can develop such a patch, but it will take a while, say a
> few weeks to months, due to vacation, work load, etc.

There's no rush here, and no deadlines in kernel development.  If you
don't have a working userspace user for your new feature(s), there is no
way we can accept the changes to the kernel (and hint, you don't want us
to either...)

good luck!

greg k-h

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-01 23:15           ` Linus Torvalds
                               ` (2 preceding siblings ...)
  2024-02-02  3:20             ` Jeff Xu
@ 2024-02-02 17:05             ` Theo de Raadt
  2024-02-02 21:02               ` Jeff Xu
  3 siblings, 1 reply; 50+ messages in thread
From: Theo de Raadt @ 2024-02-02 17:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Xu, Liam R. Howlett, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, usama.anjum, rdunlap, jeffxu, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

Another interaction to consider is sigaltstack().

In OpenBSD, sigaltstack() forces MAP_STACK onto the specified
(pre-allocated) region, because on kernel-entry we require the "sp"
register to point to a MAP_STACK region (this severely damages ROP pivot
methods).  Linux does not have MAP_STACK enforcement (yet), but one day
someone may try to do that work.

This interacted poorly with mimmutable() because some applications
allocate the memory being provided poorly.  I won't get into the details
unless pushed, because what we found makes me upset.  Over the years,
we've upstreamed diffs to applications to resolve all the nasty
allocation patterns.  I think the software ecosystem is now mostly
clean.

I suggest someone in Linux look into whether sigaltstack() is a mseal()
bypass, perhaps somewhat similar to madvise MADV_FREE, and consider the
correct strategy.

This is our documented strategy:

     On OpenBSD some additional restrictions prevent dangerous address space
     modifications.  The proposed space at ss_sp is verified to be
     contiguously mapped for read-write permissions (no execute) and incapable
     of syscall entry (see msyscall(2)).  If those conditions are met, a page-
     aligned inner region will be freshly mapped (all zero) with MAP_STACK
     (see mmap(2)), destroying the pre-existing data in the region.  Once the
     sigaltstack is disabled, the MAP_STACK attribute remains on the memory,
     so it is best to deallocate the memory via a method that results in
     munmap(2).

OK, I better provide the details of what people were doing.
sigaltstacks() in .data, in .bss, using malloc(), on a buffer on the
stack, we even found one creating a sigaltstack inside a buffer on a
pthread stack.  We told everyone to use mmap() and munmap(), with MAP_STACK
if #ifdef MAP_STACK finds a definition.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 15:13         ` Liam R. Howlett
@ 2024-02-02 17:24           ` Jeff Xu
  2024-02-02 19:21             ` Liam R. Howlett
  0 siblings, 1 reply; 50+ messages in thread
From: Jeff Xu @ 2024-02-02 17:24 UTC (permalink / raw)
  To: Liam R. Howlett, Jeff Xu, Jeff Xu, Jonathan Corbet, akpm,
	keescook, jannh, sroettger, willy, gregkh, torvalds, usama.anjum,
	rdunlap, jorgelo, groeck, linux-kernel, linux-kselftest,
	linux-mm, pedro.falcato, dave.hansen, linux-hardening, deraadt

On Fri, Feb 2, 2024 at 7:13 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Jeff Xu <jeffxu@google.com> [240201 22:15]:
> > On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Jeff Xu <jeffxu@chromium.org> [240131 20:27]:
> > > > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
> > > > <Liam.Howlett@oracle.com> wrote:
> > > > >
> > >
> > > Having to opt-in to allowing mseal will probably not work well.
> > I'm leaving the opt-in discussion in Linus's thread.
> >
> > > Initial library mappings happen in one huge chunk then it's cut up into
> > > smaller VMAs, at least that's what I see with my maple tree tracing.  If
> > > you opt-in, then the entire library will have to opt-in and so the
> > > 'discourage inadvertent sealing' argument is not very strong.
> > >
> > Regarding "The initial library mappings happen in one huge chunk then
> > it is cut up into smaller VMAS", this is not a problem.
> >
> > As example of elf loading (fs/binfmt_elf.c), there is just a few
> > places to pass in what type of memory to be allocated, e.g.
> > MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can  add MAP_SEALABLE at those
> > places.
> > If glic does additional splitting on the memory range, by using
> > mprotect(), then the MAP_SEALABLE is automatically applied after
> > splitting.
> > If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).
>
> You are adding a flag that requires a new glibc.  When I try to point
> out how this is unnecessary and excessive, you tell me it's fine and
> probably not a whole lot of work.
>
> This isn't working with developers, you are dismissing the developers
> who are trying to help you.
>
> Can you please:
>
> Provide code that uses this feature.
>
> Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and
> 32 VMAs.
>
I will prepare for the benchmark tests.

> Provide code that tests and checks the failure paths.  Failures at the
> start, middle, and end of the modifications.
>
Regarding, "Failures at the start, middle, and end of the modifications."

With the current implementation, e.g. it checks if the sealing is
applied before actual modification of VMAs, so partial modifications
are avoided in mprotect, mremap, munmap.

There are test cases in the selftests to cover the failure path,
including the beginning, middle and end of VMAs.
test_seal_unmapped_start
test_seal_unmapped_middle
test_seal_unmapped_end
test_seal_invalid_input
test_seal_start_mprotect
test_seal_end_mprotect
etc.

Are those what you are looking for ?

> Document what happens in those failure paths.
>
> And, most importantly: keep an open mind and allow your opinion to
> change when presented with new information.
>
> All of these things are to help you.  We need to know what needs fixing
> so you can be successful.
>
Thanks for those feedbacks.

I sincerely wish for more of those help so this syscall can be useful.

Thanks.
Best Regards,
-Jeff

>
> Thanks,
> Liam

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02  5:00                   ` Theo de Raadt
@ 2024-02-02 17:58                     ` Jeff Xu
  2024-02-02 18:51                       ` Pedro Falcato
  0 siblings, 1 reply; 50+ messages in thread
From: Jeff Xu @ 2024-02-02 17:58 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Jeff Xu, Linus Torvalds, Liam R. Howlett, Jonathan Corbet, akpm,
	keescook, jannh, sroettger, willy, gregkh, usama.anjum, rdunlap,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Jeff Xu <jeffxu@chromium.org> wrote:
>
> > Even without free.
> > I personally do not like the heap getting sealed like that.
> >
> > Component A.
> > p=malloc(4096);
> > writing something to p.
> >
> > Component B:
> > mprotect(p,4096, RO)
> > mseal(p,4096)
> >
> > This will split the heap VMA, and prevent the heap from shrinking, if
> > this is in a frequent code path, then it might hurt the process's
> > memory usage.
> >
> > The existing code is more likely to use malloc than mmap(), so it is
> > easier for dev to seal a piece of data belonging to another component.
> > I hope this pattern is not wide-spreading.
> >
> > The ideal way will be just changing the library A to use mmap.
>
> I think you are lacking some test programs to see how it actually
> behaves; the effect is worse than you think, and the impact is immediately
> visible to the programmer, and the lesson is clear:
>
>         you can only seal objects which you gaurantee never get recycled.
>
>         Pushing a sealed object back into reuse is a disasterous bug.
>
>         Noone should call this interface, unless they understand that.
>
> I'll say again, you don't have a test program for various allocators to
> understand how it behaves.  The failure modes described in your docuemnts
> are not correct.
>
I understand what you mean: I will add that part to the document:
Try to recycle a sealed memory is disastrous, e.g.
p=malloc(4096);
mprotect(p,4096,RO)
mseal(p,4096)
free(p);

My point is:
I think sealing an object from the heap is a bad pattern in general,
even dev doesn't free it. That was one of the reasons for the sealable
flag, I hope saying this doesn't be perceived as looking for excuses.

>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 17:58                     ` Jeff Xu
@ 2024-02-02 18:51                       ` Pedro Falcato
  2024-02-02 21:20                         ` Jeff Xu
  2024-02-04 19:39                         ` David Laight
  0 siblings, 2 replies; 50+ messages in thread
From: Pedro Falcato @ 2024-02-02 18:51 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Theo de Raadt, Jeff Xu, Linus Torvalds, Liam R. Howlett,
	Jonathan Corbet, akpm, keescook, jannh, sroettger, willy, gregkh,
	usama.anjum, rdunlap, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, dave.hansen, linux-hardening

On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu <jeffxu@chromium.org> wrote:
>
> On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > > Even without free.
> > > I personally do not like the heap getting sealed like that.
> > >
> > > Component A.
> > > p=malloc(4096);
> > > writing something to p.
> > >
> > > Compohave nent B:
> > > mprotect(p,4096, RO)
> > > mseal(p,4096)
> > >
> > > This will split the heap VMA, and prevent the heap from shrinking, if
> > > this is in a frequent code path, then it might hurt the process's
> > > memory usage.
> > >
> > > The existing code is more likely to use malloc than mmap(), so it is
> > > easier for dev to seal a piece of data belonging to another component.
> > > I hope this pattern is not wide-spreading.
> > >
> > > The ideal way will be just changing the library A to use mmap.
> >
> > I think you are lacking some test programs to see how it actually
> > behaves; the effect is worse than you think, and the impact is immediately
> > visible to the programmer, and the lesson is clear:
> >
> >         you can only seal objects which you gaurantee never get recycled.
> >
> >         Pushing a sealed object back into reuse is a disasterous bug.
> >
> >         Noone should call this interface, unless they understand that.
> >
> > I'll say again, you don't have a test program for various allocators to
> > understand how it behaves.  The failure modes described in your docuemnts
> > are not correct.
> >
> I understand what you mean: I will add that part to the document:
> Try to recycle a sealed memory is disastrous, e.g.
> p=malloc(4096);
> mprotect(p,4096,RO)
> mseal(p,4096)
> free(p);
>
> My point is:
> I think sealing an object from the heap is a bad pattern in general,
> even dev doesn't free it. That was one of the reasons for the sealable
> flag, I hope saying this doesn't be perceived as looking for excuses.

The point you're missing is that adding MAP_SEALABLE reduces
composability. With MAP_SEALABLE, everything that mmaps some part of
the address space that may ever be sealed will need to be modified to
know about MAP_SEALABLE.

Say you did the same thing for mprotect. MAP_PROTECT would control the
mprotectability of the map. You'd stop:

p = malloc(4096);
mprotect(p, 4096, PROT_READ);
free(p);

! But you'd need to change every spot that mmap()'s something to know
about and use MAP_PROTECT: all "producers" of mmap memory would need
to know about the consumers doing mprotect(). So now either all mmap()
callers mindlessly add MAP_PROTECT out of fear the consumers do
mprotect (and you gain nothing from MAP_PROTECT), or the mmap()
callers need to know the consumers call mprotect(), and thus you
introduce a huge layering violation (and you actually lose from having
MAP_PROTECT).

Hopefully you can map the above to MAP_SEALABLE. Or to any other m*()
operation. For example, if chrome runs on an older glibc that does not
know about MAP_SEALABLE, it will not be able to mseal() its own shared
libraries' .text (even if, yes, that should ideally be left to ld.so).

IMO, UNIX API design has historically mostly been "play stupid games,
win stupid prizes", which is e.g: why things like close(STDOUT_FILENO)
work. If you close stdout (and don't dup/reopen something to stdout)
and printf(), things will break, and you get to keep both pieces.
There's no O_CLOSEABLE, just as there's no O_DUPABLE.

-- 
Pedro

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 17:24           ` Jeff Xu
@ 2024-02-02 19:21             ` Liam R. Howlett
  2024-02-02 19:32               ` Theo de Raadt
  2024-02-02 20:14               ` Jeff Xu
  0 siblings, 2 replies; 50+ messages in thread
From: Liam R. Howlett @ 2024-02-02 19:21 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Jeff Xu, Jonathan Corbet, akpm, keescook, jannh, sroettger,
	willy, gregkh, torvalds, usama.anjum, rdunlap, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening, deraadt

* Jeff Xu <jeffxu@chromium.org> [240202 12:24]:

...

> > Provide code that uses this feature.

Please do this too :)

> >
> > Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and
> > 32 VMAs.
> >
> I will prepare for the benchmark tests.

Thank you, please also include runs of calls that you are modifying for
checking for mseal() as we are adding loops there.

> 
> > Provide code that tests and checks the failure paths.  Failures at the
> > start, middle, and end of the modifications.
> >
> Regarding, "Failures at the start, middle, and end of the modifications."
> 
> With the current implementation, e.g. it checks if the sealing is
> applied before actual modification of VMAs, so partial modifications
> are avoided in mprotect, mremap, munmap.
> 
> There are test cases in the selftests to cover the failure path,
> including the beginning, middle and end of VMAs.
> test_seal_unmapped_start
> test_seal_unmapped_middle
> test_seal_unmapped_end
> test_seal_invalid_input
> test_seal_start_mprotect
> test_seal_end_mprotect
> etc.
> 
> Are those what you are looking for ?

Those are certainly good, but we need more checking in there.  You have
a seal_split test that splits the vma by mseal but you don't check the
flags on the VMAs.

What I'm more concerned about is what happens if you call mseal() on a
range and it can mseal a portion.  Like, what happens to the first vma
in your test_seal_unmapped_middle case?  I see it returns an error, but
is the first VMA mseal()'ed? (no it's not, but test that)

What about the other system calls that will be denied on an mseal() VMA?
Do they still behave the same?  do_mprotect_pkey() will break out of the
loop on the first error it sees - but it has modified some VMAs up to
that point, I believe?  You have changed this to abort before anything
is modified.  This is probably acceptable because it won't affect
existing applications unless they start using mseal(), but that's just
my opinion.

It would be good to state the change in behaviour because it is changing
the fundamental model of changing mprotect/madvise until an issue is
hit.  I think you are covering this by "it blocks X" but it's doing more
than, say, a flag verification.  One could reasonably assume this is
just another flag verification.

> 
> > Document what happens in those failure paths.

I'd like to know how this affects other system calls in the partial
success cases/return error cases.  Some will now return new error codes
and some may change the behaviour.

It may even be okay to allow munmap() to split VMAs at the start/end of
the region and fail to munmap because some VMA in the middle is
mseal()'ed - but maybe not?  I haven't put a whole lot of thought into
it.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 19:21             ` Liam R. Howlett
@ 2024-02-02 19:32               ` Theo de Raadt
  2024-02-02 20:36                 ` Linus Torvalds
  2024-02-02 20:14               ` Jeff Xu
  1 sibling, 1 reply; 50+ messages in thread
From: Theo de Raadt @ 2024-02-02 19:32 UTC (permalink / raw)
  To: Liam R. Howlett, Jeff Xu, Jeff Xu, Jonathan Corbet, akpm,
	keescook, jannh, sroettger, willy, gregkh, torvalds, usama.anjum,
	rdunlap, jorgelo, groeck, linux-kernel, linux-kselftest,
	linux-mm, pedro.falcato, dave.hansen, linux-hardening

> What I'm more concerned about is what happens if you call mseal() on a
> range and it can mseal a portion.  Like, what happens to the first vma
> in your test_seal_unmapped_middle case?  I see it returns an error, but
> is the first VMA mseal()'ed? (no it's not, but test that)

That is correct, Liam.

Unix system calls must be atomic.

They either return an error, and that is a promise they made no changes.

Or they do the work required, and then return success.

In OpenBSD, all mimmutable() aspects were carefully studied to gaurantee
this behaviour.

I am not an expert in the Linux kernel to make the assessment; someone
who is qualified must make that assessment.  Fuzzing with tests is a good
way to judge it simpler.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 19:21             ` Liam R. Howlett
  2024-02-02 19:32               ` Theo de Raadt
@ 2024-02-02 20:14               ` Jeff Xu
  1 sibling, 0 replies; 50+ messages in thread
From: Jeff Xu @ 2024-02-02 20:14 UTC (permalink / raw)
  To: Liam R. Howlett, Jeff Xu, Jeff Xu, Jonathan Corbet, akpm,
	keescook, jannh, sroettger, willy, gregkh, torvalds, usama.anjum,
	rdunlap, jorgelo, groeck, linux-kernel, linux-kselftest,
	linux-mm, pedro.falcato, dave.hansen, linux-hardening, deraadt

On Fri, Feb 2, 2024 at 11:21 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Jeff Xu <jeffxu@chromium.org> [240202 12:24]:
>
> ...
>
> > > Provide code that uses this feature.
>
> Please do this too :)
>
Yes. Will do.


> > >
> > > Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and
> > > 32 VMAs.
> > >
> > I will prepare for the benchmark tests.
>
> Thank you, please also include runs of calls that you are modifying for
> checking for mseal() as we are adding loops there.
>
It will includes mmap/mremap/mprotect/munmap

> >
> > > Provide code that tests and checks the failure paths.  Failures at the
> > > start, middle, and end of the modifications.
> > >
> > Regarding, "Failures at the start, middle, and end of the modifications."
> >
> > With the current implementation, e.g. it checks if the sealing is
> > applied before actual modification of VMAs, so partial modifications
> > are avoided in mprotect, mremap, munmap.
> >
> > There are test cases in the selftests to cover the failure path,
> > including the beginning, middle and end of VMAs.
> > test_seal_unmapped_start
> > test_seal_unmapped_middle
> > test_seal_unmapped_end
> > test_seal_invalid_input
> > test_seal_start_mprotect
> > test_seal_end_mprotect
> > etc.
> >
> > Are those what you are looking for ?
>
> Those are certainly good, but we need more checking in there.  You have
> a seal_split test that splits the vma by mseal but you don't check the
> flags on the VMAs.
>
I can add the flag check.

> What I'm more concerned about is what happens if you call mseal() on a
> range and it can mseal a portion.  Like, what happens to the first vma
> in your test_seal_unmapped_middle case?  I see it returns an error, but
> is the first VMA mseal()'ed? (no it's not, but test that)
>
The first VMA is not sealed.
That was covered by test_seal_mprotect_two_vma_with_gap.

> What about the other system calls that will be denied on an mseal() VMA?
The other system call's behavior is kept as is, if the memory is not sealed.

> Do they still behave the same?  do_mprotect_pkey() will break out of the
> loop on the first error it sees - but it has modified some VMAs up to
> that point, I believe?
Yes. The description about do_mprotect_pkey() is correct.

> You have changed this to abort before anything
> is modified.  This is probably acceptable because it won't affect
> existing applications unless they start using mseal(), but that's just
> my opinion.
>
To make sure this, the test was written with sealing=false, those
tests are passed in the main (before applying my patch) to make sure
the test is correct.

> It would be good to state the change in behaviour because it is changing
> the fundamental model of changing mprotect/madvise until an issue is
> hit.  I think you are covering this by "it blocks X" but it's doing more
> than, say, a flag verification.  One could reasonably assume this is
> just another flag verification.
>
Will add more in documentation.

> >
> > > Document what happens in those failure paths.
>
> I'd like to know how this affects other system calls in the partial
> success cases/return error cases.  Some will now return new error codes
> and some may change the behaviour.
>
For the mapping that is not sealed, all remain unchanged, including
the error handling path.
For the mapping that is sealed, EPREM is returned if the sealing check
fails, and all of VMAs remain unchanged.

> It may even be okay to allow munmap() to split VMAs at the start/end of
> the region and fail to munmap because some VMA in the middle is
> mseal()'ed - but maybe not?  I haven't put a whole lot of thought into
> it.
If you are referring to something like below
[unmapped][map1][unmapped][map2][unmapped][map3][unmapped]
and map2 is sealed.

unmap(start of map1,end of map3) will fail.
mmap/mremap/unmap/mprotect on an address range that includes map2 will
fail with EPERM, with map1/map2/map3 unchanged.

Thanks
-Jeff

>
> Thanks,
> Liam

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 19:32               ` Theo de Raadt
@ 2024-02-02 20:36                 ` Linus Torvalds
  2024-02-02 20:57                   ` Jeff Xu
  2024-02-02 21:18                   ` Liam R. Howlett
  0 siblings, 2 replies; 50+ messages in thread
From: Linus Torvalds @ 2024-02-02 20:36 UTC (permalink / raw)
  To: Liam R. Howlett, Jeff Xu, Jeff Xu, Jonathan Corbet, akpm,
	keescook, jannh, sroettger, willy, gregkh, torvalds, usama.anjum,
	rdunlap, jorgelo, groeck, linux-kernel, linux-kselftest,
	linux-mm, pedro.falcato, dave.hansen, linux-hardening

On Fri, 2 Feb 2024 at 11:32, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Unix system calls must be atomic.
>
> They either return an error, and that is a promise they made no changes.

That's actually not true, and never has been.

It's a good thing to aim for, but several errors means "some or all
may have been done".

EFAULT (for various system calls), ENOMEM and other errors are all
things that can happen after some of the system call has already been
done, and the rest failed.

There are lots of examples, but to pick one obvious VM example,
something like mlock() may well return an error after the area has
been successfully locked, but then the population of said pages failed
for some reason.

Of course, implementations can differ, and POSIX sometimes has insane
language that is actively incorrect.

Furthermore, the definition of "atomic" is unclear. For example, POSIX
claims that a "write()" system call is one atomic thing for regular
files, and some people think that means that you see all or nothing.
That's simply not true, and you'll see the write progress in various
indirect ways (look at intermediate file size with 'stat', look at
intermediate contents with 'mmap' etc etc).

So I agree that atomicity is something that people should always
*strive* for, but it's not some kind of final truth or absolute
requirement.

In the specific case of mseal(), I suspect there are very few reasons
ever *not* to be atomic, so in this particular context atomicity is
likely always something that should be guaranteed. But I just wanted
to point out that it's most definitely not a black-and-white issue in
the general case.

             Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 20:36                 ` Linus Torvalds
@ 2024-02-02 20:57                   ` Jeff Xu
  2024-02-02 21:18                   ` Liam R. Howlett
  1 sibling, 0 replies; 50+ messages in thread
From: Jeff Xu @ 2024-02-02 20:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Liam R. Howlett, Jeff Xu, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, usama.anjum, rdunlap, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

On Fri, Feb 2, 2024 at 12:37 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 2 Feb 2024 at 11:32, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Unix system calls must be atomic.
> >
> > They either return an error, and that is a promise they made no changes.
>
> That's actually not true, and never has been.
>
> It's a good thing to aim for, but several errors means "some or all
> may have been done".
>
> EFAULT (for various system calls), ENOMEM and other errors are all
> things that can happen after some of the system call has already been
> done, and the rest failed.
>
> There are lots of examples, but to pick one obvious VM example,
> something like mlock() may well return an error after the area has
> been successfully locked, but then the population of said pages failed
> for some reason.
>
> Of course, implementations can differ, and POSIX sometimes has insane
> language that is actively incorrect.
>
> Furthermore, the definition of "atomic" is unclear. For example, POSIX
> claims that a "write()" system call is one atomic thing for regular
> files, and some people think that means that you see all or nothing.
> That's simply not true, and you'll see the write progress in various
> indirect ways (look at intermediate file size with 'stat', look at
> intermediate contents with 'mmap' etc etc).
>
> So I agree that atomicity is something that people should always
> *strive* for, but it's not some kind of final truth or absolute
> requirement.
>
> In the specific case of mseal(), I suspect there are very few reasons
> ever *not* to be atomic, so in this particular context atomicity is
> likely always something that should be guaranteed. But I just wanted
> to point out that it's most definitely not a black-and-white issue in
> the general case.
>
Thanks.
At least I got this part done right for mseal() :-)

-Jeff


>              Linus
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 17:05             ` Theo de Raadt
@ 2024-02-02 21:02               ` Jeff Xu
  0 siblings, 0 replies; 50+ messages in thread
From: Jeff Xu @ 2024-02-02 21:02 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Linus Torvalds, Liam R. Howlett, Jonathan Corbet, akpm, keescook,
	jannh, sroettger, willy, gregkh, usama.anjum, rdunlap, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

On Fri, Feb 2, 2024 at 9:05 AM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Another interaction to consider is sigaltstack().
>
> In OpenBSD, sigaltstack() forces MAP_STACK onto the specified
> (pre-allocated) region, because on kernel-entry we require the "sp"
> register to point to a MAP_STACK region (this severely damages ROP pivot
> methods).  Linux does not have MAP_STACK enforcement (yet), but one day
> someone may try to do that work.
>
> This interacted poorly with mimmutable() because some applications
> allocate the memory being provided poorly.  I won't get into the details
> unless pushed, because what we found makes me upset.  Over the years,
> we've upstreamed diffs to applications to resolve all the nasty
> allocation patterns.  I think the software ecosystem is now mostly
> clean.
>
> I suggest someone in Linux look into whether sigaltstack() is a mseal()
> bypass, perhaps somewhat similar to madvise MADV_FREE, and consider the
> correct strategy.
>

Thanks for bringing this up. I will follow up on sigaltstack() in Linux.

> This is our documented strategy:
>
>      On OpenBSD some additional restrictions prevent dangerous address space
>      modifications.  The proposed space at ss_sp is verified to be
>      contiguously mapped for read-write permissions (no execute) and incapable
>      of syscall entry (see msyscall(2)).  If those conditions are met, a page-
>      aligned inner region will be freshly mapped (all zero) with MAP_STACK
>      (see mmap(2)), destroying the pre-existing data in the region.  Once the
>      sigaltstack is disabled, the MAP_STACK attribute remains on the memory,
>      so it is best to deallocate the memory via a method that results in
>      munmap(2).
>
> OK, I better provide the details of what people were doing.
> sigaltstacks() in .data, in .bss, using malloc(), on a buffer on the
> stack, we even found one creating a sigaltstack inside a buffer on a
> pthread stack.  We told everyone to use mmap() and munmap(), with MAP_STACK
> if #ifdef MAP_STACK finds a definition.
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 20:36                 ` Linus Torvalds
  2024-02-02 20:57                   ` Jeff Xu
@ 2024-02-02 21:18                   ` Liam R. Howlett
  2024-02-02 23:36                     ` Linus Torvalds
  1 sibling, 1 reply; 50+ messages in thread
From: Liam R. Howlett @ 2024-02-02 21:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Xu, Jeff Xu, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, usama.anjum, rdunlap, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

* Linus Torvalds <torvalds@linux-foundation.org> [240202 15:37]:
> On Fri, 2 Feb 2024 at 11:32, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Unix system calls must be atomic.
> >
> > They either return an error, and that is a promise they made no changes.
> 
> That's actually not true, and never has been.

...

> 
> In the specific case of mseal(), I suspect there are very few reasons
> ever *not* to be atomic, so in this particular context atomicity is
> likely always something that should be guaranteed. But I just wanted
> to point out that it's most definitely not a black-and-white issue in
> the general case.

There will be a larger performance cost to checking up front without
allowing the partial completion.  I don't expect these to be high, but
it's something to keep in mind if we are okay with the flexibility and
less atomic operation.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 18:51                       ` Pedro Falcato
@ 2024-02-02 21:20                         ` Jeff Xu
  2024-02-04 19:39                         ` David Laight
  1 sibling, 0 replies; 50+ messages in thread
From: Jeff Xu @ 2024-02-02 21:20 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Theo de Raadt, Jeff Xu, Linus Torvalds, Liam R. Howlett,
	Jonathan Corbet, akpm, keescook, jannh, sroettger, willy, gregkh,
	usama.anjum, rdunlap, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, dave.hansen, linux-hardening

On Fri, Feb 2, 2024 at 10:52 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote:
> > >
> > > Jeff Xu <jeffxu@chromium.org> wrote:
> > >
> > > > Even without free.
> > > > I personally do not like the heap getting sealed like that.
> > > >
> > > > Component A.
> > > > p=malloc(4096);
> > > > writing something to p.
> > > >
> > > > Compohave nent B:
> > > > mprotect(p,4096, RO)
> > > > mseal(p,4096)
> > > >
> > > > This will split the heap VMA, and prevent the heap from shrinking, if
> > > > this is in a frequent code path, then it might hurt the process's
> > > > memory usage.
> > > >
> > > > The existing code is more likely to use malloc than mmap(), so it is
> > > > easier for dev to seal a piece of data belonging to another component.
> > > > I hope this pattern is not wide-spreading.
> > > >
> > > > The ideal way will be just changing the library A to use mmap.
> > >
> > > I think you are lacking some test programs to see how it actually
> > > behaves; the effect is worse than you think, and the impact is immediately
> > > visible to the programmer, and the lesson is clear:
> > >
> > >         you can only seal objects which you gaurantee never get recycled.
> > >
> > >         Pushing a sealed object back into reuse is a disasterous bug.
> > >
> > >         Noone should call this interface, unless they understand that.
> > >
> > > I'll say again, you don't have a test program for various allocators to
> > > understand how it behaves.  The failure modes described in your docuemnts
> > > are not correct.
> > >
> > I understand what you mean: I will add that part to the document:
> > Try to recycle a sealed memory is disastrous, e.g.
> > p=malloc(4096);
> > mprotect(p,4096,RO)
> > mseal(p,4096)
> > free(p);
> >
> > My point is:
> > I think sealing an object from the heap is a bad pattern in general,
> > even dev doesn't free it. That was one of the reasons for the sealable
> > flag, I hope saying this doesn't be perceived as looking for excuses.
>
> The point you're missing is that adding MAP_SEALABLE reduces
> composability. With MAP_SEALABLE, everything that mmaps some part of
> the address space that may ever be sealed will need to be modified to
> know about MAP_SEALABLE.
>
> Say you did the same thing for mprotect. MAP_PROTECT would control the
> mprotectability of the map. You'd stop:
>
> p = malloc(4096);
> mprotect(p, 4096, PROT_READ);
> free(p);
>
> ! But you'd need to change every spot that mmap()'s something to know
> about and use MAP_PROTECT: all "producers" of mmap memory would need
> to know about the consumers doing mprotect(). So now either all mmap()
> callers mindlessly add MAP_PROTECT out of fear the consumers do
> mprotect (and you gain nothing from MAP_PROTECT), or the mmap()
> callers need to know the consumers call mprotect(), and thus you
> introduce a huge layering violation (and you actually lose from having
> MAP_PROTECT).
>
> Hopefully you can map the above to MAP_SEALABLE. Or to any other m*()
> operation. For example, if chrome runs on an older glibc that does not
> know about MAP_SEALABLE, it will not be able to mseal() its own shared
> libraries' .text (even if, yes, that should ideally be left to ld.so).
>
I think I have heard enough complaints about MAP_SEALABLE from Linux
developers and Linus in the last two days to convince myself that it
is a bad idea :)

For the last time, I was trying to limit the scope of mseal() limited
to two known cases. And MAP_SEALABLE is a reversible decision, a
system ctrl can turn it off, or we can obsolete it in future. (this
was mentioned in the document of V8).

I will rest my case. Obviously from the feedback,  it is loud and
clear that we want to be able to seal all the memory.

> IMO, UNIX API design has historically mostly been "play stupid games,
> win stupid prizes", which is e.g: why things like close(STDOUT_FILENO)
> work. If you close stdout (and don't dup/reopen something to stdout)
> and printf(), things will break, and you get to keep both pieces.
> There's no O_CLOSEABLE, just as there's no O_DUPABLE.
>
> --
> Pedro

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 21:18                   ` Liam R. Howlett
@ 2024-02-02 23:36                     ` Linus Torvalds
  2024-02-03  4:45                       ` Liam R. Howlett
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2024-02-02 23:36 UTC (permalink / raw)
  To: Liam R. Howlett, Linus Torvalds, Jeff Xu, Jeff Xu,
	Jonathan Corbet, akpm, keescook, jannh, sroettger, willy, gregkh,
	usama.anjum, rdunlap, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, pedro.falcato, dave.hansen,
	linux-hardening

On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> There will be a larger performance cost to checking up front without
> allowing the partial completion.

I suspect that for mseal(), the only half-way common case will be
sealing an area that is entirely contained within one vma.

So the cost will be the vma splitting (if it's not the whole vma), and
very unlikely to be any kind of "walk the vma's to check that they can
all be sealed" loop up-front.

We'll see, but that's my gut feel, at least.

               Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-02 23:36                     ` Linus Torvalds
@ 2024-02-03  4:45                       ` Liam R. Howlett
  2024-02-05 22:13                         ` Suren Baghdasaryan
  0 siblings, 1 reply; 50+ messages in thread
From: Liam R. Howlett @ 2024-02-03  4:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Xu, Jeff Xu, Jonathan Corbet, akpm, keescook, jannh,
	sroettger, willy, gregkh, usama.anjum, rdunlap, jorgelo, groeck,
	linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening

* Linus Torvalds <torvalds@linux-foundation.org> [240202 18:36]:
> On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > There will be a larger performance cost to checking up front without
> > allowing the partial completion.
> 
> I suspect that for mseal(), the only half-way common case will be
> sealing an area that is entirely contained within one vma.

Agreed.

> 
> So the cost will be the vma splitting (if it's not the whole vma), and
> very unlikely to be any kind of "walk the vma's to check that they can
> all be sealed" loop up-front.

That's the cost of calling mseal(), and I think that will be totally
reasonable.

I'm more concerned with the other calls that do affect more than one vma
that will now have to ensure there is not an mseal'ed vma among the
affected area.

As you pointed out, we don't do atomic updates and so we have to add a
loop at the beginning to check this new special case, which is what this
patch set does today.  That means we're going to be looping through
twice for any call that could fail if one is mseal'ed. This includes
munmap() and mprotect().

The impact will vary based on how many vma's are handled. I'd like some
numbers on this so we can see if it is a concern, which Jeff has agreed
to provide in the future - Thank you, Jeff.

It also means we're modifying the behaviour of those calls so they could
fail before anything changes (regardless of where the failure would
occur), and we could still fail later when another aspect of a vma would
cause a failure as we do today.  We are paying the price for a more
atomic update, but we aren't trying very hard to be atomic with our
updates - we don't have many (virtually no) vma checks before
modifications start.

For instance, we could move the mprotect check for map_deny_write_exec()
to the pre-update loop to make it more atomic in nature.  This one seems
somewhat related to mseal, so it would be better if they were both
checked atomic(ish) together.  Although, I wonder if the user visible
changes would be acceptable and worth the risk.

We will have two classes of updates to vma's: the more atomic view and
the legacy view.  The question of what happens when the two mix, or
where a specific check should go will get (more) confusing.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [PATCH v8 0/4] Introduce mseal
  2024-02-02 18:51                       ` Pedro Falcato
  2024-02-02 21:20                         ` Jeff Xu
@ 2024-02-04 19:39                         ` David Laight
  1 sibling, 0 replies; 50+ messages in thread
From: David Laight @ 2024-02-04 19:39 UTC (permalink / raw)
  To: 'Pedro Falcato', Jeff Xu
  Cc: Theo de Raadt, Jeff Xu, Linus Torvalds, Liam R. Howlett,
	Jonathan Corbet, akpm, keescook, jannh, sroettger, willy, gregkh,
	usama.anjum, rdunlap, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, dave.hansen, linux-hardening

...
> IMO, UNIX API design has historically mostly been "play stupid games,
> win stupid prizes", which is e.g: why things like close(STDOUT_FILENO)
> work. If you close stdout (and don't dup/reopen something to stdout)
> and printf(), things will break, and you get to keep both pieces.

That is pretty much why libraries must never use printf().
(Try telling that to people at work!)

In the days when processes could only have 20 files open
it was a much bigger problem.
You couldn't afford to not use 0, 1 and 2.
A certain daemon ended up using fd 1 as a pipe to another daemon.
Someone accidentally used printf() instead of fprintf() for a trace.
When the 10k stdio buffer filled the text got written to the pipe.
The expected fixed size message had a 32bit 'trailer' size.
Although no defined messages supported trailers the second daemon
synchronously discarded the trailer - with the expected side effect.

Wasn't my bug, and someone else found it, but I'd read the broken
code a few times without seeing the fubar.

Trouble is it all worked for quite a long time...

	David
 

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 0/4] Introduce mseal
  2024-02-03  4:45                       ` Liam R. Howlett
@ 2024-02-05 22:13                         ` Suren Baghdasaryan
  0 siblings, 0 replies; 50+ messages in thread
From: Suren Baghdasaryan @ 2024-02-05 22:13 UTC (permalink / raw)
  To: Liam R. Howlett, Linus Torvalds, Jeff Xu, Jeff Xu,
	Jonathan Corbet, akpm, keescook, jannh, sroettger, willy, gregkh,
	usama.anjum, rdunlap, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, pedro.falcato, dave.hansen,
	linux-hardening

On Fri, Feb 2, 2024 at 8:46 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> [240202 18:36]:
> > On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > There will be a larger performance cost to checking up front without
> > > allowing the partial completion.
> >
> > I suspect that for mseal(), the only half-way common case will be
> > sealing an area that is entirely contained within one vma.
>
> Agreed.
>
> >
> > So the cost will be the vma splitting (if it's not the whole vma), and
> > very unlikely to be any kind of "walk the vma's to check that they can
> > all be sealed" loop up-front.
>
> That's the cost of calling mseal(), and I think that will be totally
> reasonable.
>
> I'm more concerned with the other calls that do affect more than one vma
> that will now have to ensure there is not an mseal'ed vma among the
> affected area.
>
> As you pointed out, we don't do atomic updates and so we have to add a
> loop at the beginning to check this new special case, which is what this
> patch set does today.  That means we're going to be looping through
> twice for any call that could fail if one is mseal'ed. This includes
> munmap() and mprotect().
>
> The impact will vary based on how many vma's are handled. I'd like some
> numbers on this so we can see if it is a concern, which Jeff has agreed
> to provide in the future - Thank you, Jeff.

Yes please. The additional walk Liam points to seems to be happening
even if we don't use mseal at all. Android apps often create thousands
of VMAs, so a small regression to a syscall like mprotect might cause
a very visible regression to app launch times (one of the key metrics
for Android). Having performance impact numbers here would be very
helpful.

>
> It also means we're modifying the behaviour of those calls so they could
> fail before anything changes (regardless of where the failure would
> occur), and we could still fail later when another aspect of a vma would
> cause a failure as we do today.  We are paying the price for a more
> atomic update, but we aren't trying very hard to be atomic with our
> updates - we don't have many (virtually no) vma checks before
> modifications start.
>
> For instance, we could move the mprotect check for map_deny_write_exec()
> to the pre-update loop to make it more atomic in nature.  This one seems
> somewhat related to mseal, so it would be better if they were both
> checked atomic(ish) together.  Although, I wonder if the user visible
> changes would be acceptable and worth the risk.
>
> We will have two classes of updates to vma's: the more atomic view and
> the legacy view.  The question of what happens when the two mix, or
> where a specific check should go will get (more) confusing.
>
> Thanks,
> Liam
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2024-02-05 22:13 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-31 17:50 [PATCH v8 0/4] Introduce mseal jeffxu
2024-01-31 17:50 ` [PATCH v8 1/4] mseal: Wire up mseal syscall jeffxu
2024-01-31 17:50 ` [PATCH v8 2/4] mseal: add " jeffxu
2024-02-01 23:11   ` Eric Biggers
2024-02-02  3:30     ` Jeff Xu
2024-02-02  3:54       ` Theo de Raadt
2024-02-02  4:03         ` Jeff Xu
2024-02-02  4:10           ` Theo de Raadt
2024-02-02  4:22             ` Jeff Xu
2024-01-31 17:50 ` [PATCH v8 3/4] selftest mm/mseal memory sealing jeffxu
2024-01-31 17:50 ` [PATCH v8 4/4] mseal:add documentation jeffxu
2024-01-31 19:34 ` [PATCH v8 0/4] Introduce mseal Liam R. Howlett
2024-02-01  1:27   ` Jeff Xu
2024-02-01  1:46     ` Theo de Raadt
2024-02-01 16:56       ` Bird, Tim
2024-02-01  1:55     ` Theo de Raadt
2024-02-01 20:45     ` Liam R. Howlett
2024-02-01 22:24       ` Theo de Raadt
2024-02-02  1:06         ` Greg KH
2024-02-02  3:24           ` Jeff Xu
2024-02-02  3:29             ` Linus Torvalds
2024-02-02  3:46               ` Jeff Xu
2024-02-02 15:18             ` Greg KH
2024-02-01 22:37       ` Jeff Xu
2024-02-01 22:54         ` Theo de Raadt
2024-02-01 23:15           ` Linus Torvalds
2024-02-01 23:43             ` Theo de Raadt
2024-02-02  0:26             ` Theo de Raadt
2024-02-02  3:20             ` Jeff Xu
2024-02-02  4:05               ` Theo de Raadt
2024-02-02  4:54                 ` Jeff Xu
2024-02-02  5:00                   ` Theo de Raadt
2024-02-02 17:58                     ` Jeff Xu
2024-02-02 18:51                       ` Pedro Falcato
2024-02-02 21:20                         ` Jeff Xu
2024-02-04 19:39                         ` David Laight
2024-02-02 17:05             ` Theo de Raadt
2024-02-02 21:02               ` Jeff Xu
2024-02-02  3:14       ` Jeff Xu
2024-02-02 15:13         ` Liam R. Howlett
2024-02-02 17:24           ` Jeff Xu
2024-02-02 19:21             ` Liam R. Howlett
2024-02-02 19:32               ` Theo de Raadt
2024-02-02 20:36                 ` Linus Torvalds
2024-02-02 20:57                   ` Jeff Xu
2024-02-02 21:18                   ` Liam R. Howlett
2024-02-02 23:36                     ` Linus Torvalds
2024-02-03  4:45                       ` Liam R. Howlett
2024-02-05 22:13                         ` Suren Baghdasaryan
2024-02-02 20:14               ` Jeff Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).