linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 00/11] Introduce mseal()
@ 2023-12-12 23:16 jeffxu
  2023-12-12 23:16 ` [RFC PATCH v3 01/11] mseal: Add mseal syscall jeffxu
                   ` (10 more replies)
  0 siblings, 11 replies; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:16 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

This patchset proposes a new mseal() syscall for the Linux kernel.

In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.

Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.

Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.

The new mseal() is an architecture independent syscall, and with
following signature:

mseal(void addr, size_t len, unsigned long types, unsigned long flags)

addr/len: memory range.  Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to documentation patch (mseal.rst) of this
patch set. Those are also fully covered by the selftest.

types: bit mask to specify the sealing types.
MM_SEAL_BASE
MM_SEAL_PROT_PKEY
MM_SEAL_DISCARD_RO_ANON
MM_SEAL_SEAL

The MM_SEAL_BASE:
The base package includes the features common to all VMA sealing
types. It prevents sealed VMAs from:
1> Unmapping, moving to another location, and shrinking the size, via
munmap() and mremap(), can leave an empty space, therefore can be
replaced with a VMA with a new set of attributes.
2> Move or expand a different vma into the current location, via mremap().
3> Modifying sealed VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed
VMA.

We consider the MM_SEAL_BASE feature, on which other sealing features
will depend. For instance, it probably does not make sense to seal
PROT_PKEY without sealing the BASE, and the kernel will implicitly add
SEAL_BASE for SEAL_PROT_PKEY.

The MM_SEAL_PROT_PKEY:
Seal PROT and PKEY of the address range, i.e. mprotect() and
pkey_mprotect() will be denied if the memory is sealed with
MM_SEAL_PROT_PKEY.

The MM_SEAL_DISCARD_RO_ANON
Certain types of madvise() operations are destructive [6], such as
MADV_DONTNEED, which can effectively alter region contents by
discarding pages, especially when memory is anonymous. This blocks
such operations for anonymous memory which is not writable to the
user.

The MM_SEAL_SEAL
MM_SEAL_SEAL denies adding a new seal for an VMA.
This is similar to F_SEAL_SEAL in fcntl.


The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.

Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.

Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).

However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.

Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.

--------------------------------------------------------------------
Change history:
===============
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
  MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
  both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
  destructive operations of madvise. (Suggested by Jann Horn and
  Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap() (Detail see new discussions)
- Add documentation - mseal.rst

Work in progress:
=================
- update man page for mseal() and mmap()

Open discussions:
=================
Several open discussions from V1/V2 were not incorporated into V3. I
believe these are worth more discussion for future versions of this
patch series.

1> mseal() vs mimmutable()
mseal(bitmasks for multiple seal types)
	BASE + PROT_PKEY+ MM_SEAL_DISCARD_RO_ANON
        Apply PROT_PKEY implies BASE, same for DISCARD_RO_ANON

mimmutable() (openBSD)
        This is equal to SEAL_BASE + SEAL_PROT_PKEY in mseal()
        Plus allowing downgrade from W=>NW (OpenBSD)
        Doesn’t have MM_SEAL_DISCARD_RO_ANON

mimmutable() is designed for memory sealing in libc, and mseal()
is designed for both Chrome browser and libc.

For the two memory ranges that Chrome browser wants to seal, as
discussed previously, the allocator still needs to free or discard
memory for the sealed memory. For performance reasons, we have
explored two solutions in the past: first, using PKEY-based munmap()
[7], and second, separating SEAL_MPROTECT (v1 of this patch set).
Recently, we have experimented with an alternative approach that
allows us to remove the separation of SEAL_MPROTECT. For those two
memory ranges, Chrome browser will use BASE + PROT_PKEY +
DISCARD_RO_ANON for all its sealing needs.

In the case of libc, the .text segment can be sealed with the BASE and
PROT_PKEY, and the RO data segments can be sealed with the BASE +
PROT_PKEY + DISCARD_RO_ANON.

From a flexibility standpoint, separating BASE out could be beneficial
for future extensions of sealing features. For instance, applications
might desire downgradable "prot" permissions (X=>NX, W=>NW, R=>NR),
which would conflict with SEAL_PROT_PKEY.

The more sealing features integrated into a single sealing type, the
fewer applications can utilize these features. For example, some
applications might programmatically require DISCARD_RO_ANON memory,
which would render such VMA unsuitable for sealing.

I'd like to get the community's input on this. From Chrome's
perspective, the separation isn't as important anymore, at least in
the short term. However, I prefer the multiple bits approach because
it's more extensible in the long term.

2> mseal() vs mprotect() vs madvise() for setting the seal.

mprotect():
Using prot field, but prot supports unset. It's workable, i.e. let
applications carry the sealing type and set in all subsequent calls to
mprotect(), but it feels like this is an extra thing to care about.

madvise():
uses enum, multiple sealing types might require multiple roundtrips.

IMO: sealing is a major departure from other memory syscalls because
it takes away capabilities. The other memory APIs add behaviors or
change attributes, but sealing does the opposite: it reduces
capabilities. The name of the syscall, mseal(), can help emphasize the
"taking away" part.

My second choice would be madvise().

3> Other:

There is also a topic about ptrace/, /proc/self/mem, Userfaultfd,
which I think can be followed up using v1 thread, where it has the
most context.

New discussions topics:
=======================
During the development of V3, I had new questions and thoughts and
wished to discuss.

1> shm/aio
From reading the code, it seems to me that aio/shm can mmap/munmap
maps on behalf of userspace, e.g. ksys_shmdt() in shm.c. The lifetime
of those mapping are not tied to the lifetime of the process. If those
memories are sealed from userspace, then unmap will fail. This isn’t a
huge problem, since the memory will eventually be freed at exit or
exec. However, it feels like the solution is not complete, because of
the leaks in VMA address space during the lifetime of the process.
There could be two solutions to address this, which I will discuss
later.

2> Brk (heap/stack)
Currently, userspace applications can seal parts of the heap by
calling malloc() and mseal(). This raises the question of what the
expected behavior is when sealing the heap is attempted.

let's assume following calls from user space:

ptr = malloc(size);
mprotect(ptr, size, RO);
mseal(ptr, size, SEAL_PROT_PKEY);
free(ptr);

Technically, before mseal() is added, the user can change the
protection of the heap by calling mprotect(RO). As long as the user
changes the protection back to RW before free(), the memory can be
reused.

Adding mseal() into picture, however, the heap is then sealed
partially, user can still free it, but the memory remains to be RO,
and the result of brk-shrink is nondeterministic, depending on if
munmap() will try to free the sealed memory.(brk uses munmap to shrink
the heap).

3> Above two cases led to the third topic:
There are two options to address the problem mentioned above.
Option 1:  A “MAP_SEALABLE” flag in mmap().
If a map is created without this flag, the mseal() operation will
fail. Applications that are not concerned with sealing will expect
their behavior to be unchanged. For those that are concerned, adding a
flag at mmap time to opt in is not difficult. For the short term, this
solves problems 1 and 2 above. The memory in shm/aio/brk will not have
the MAP_SEALABLE flag at mmap(), and the same is true for the heap.

Option 2: Add MM_SEAL_SEAL during mmap()
It is possible to defensively set MM_SEAL_SEAL for the selected mappings at
creation time. Specifically, we can find all the mmaps that we do not want to
seal, and add the MM_SEAL_SEAL flag in the mmap() call. The difference
between MAP_SEALABLE and MM_SEAL_SEAL is that the first option starts from a
small size and incrementally increases, while the second option is the
opposite.

In my opinion, MAP_SEALABLE is the preferred option. Only a limited set of
mappings need to be sealed, and these are typically created by the runtime. For
the few dedicated applications that manage their own mappings, such as Chrome,
adding an extra flag at mmap() is not a difficult task. It is also a safer
option in terms of regression risk. This is the option included in this
version.

4>
I think it might be possible to seal the stack or other special
mappings created at runtime (vdso, vsyscall, vvar). This means we can
enforce and seal W^X for certain types of application. For instance,
the stack is typically used in read-write mode, but in some cases, it
can become executable. To defend against unintented addition of executable
bit to stack, we could let the application to seal it. 

Sealing the heap (for adding X) requires special handling, since the
heap can shrink, and shrink is implemented through munmap().

Indeed, it might be possible that all virtual memory accessible to user
space, regardless of its usage pattern, could be sealed. However, this
would require additional research and development work.

------------------------------------------------------------------------
v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/

v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/

----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/

Jeff Xu (11):
  mseal: Add mseal syscall.
  mseal: Wire up mseal syscall
  mseal: add can_modify_mm and can_modify_vma
  mseal: add MM_SEAL_BASE
  mseal: add MM_SEAL_PROT_PKEY
  mseal: add sealing support for mmap
  mseal: make sealed VMA mergeable.
  mseal: add MM_SEAL_DISCARD_RO_ANON
  mseal: add MAP_SEALABLE to mmap()
  selftest mm/mseal memory sealing
  mseal:add documentation

 Documentation/userspace-api/mseal.rst       |  189 ++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
 arch/arm/tools/syscall.tbl                  |    1 +
 arch/arm64/include/asm/unistd.h             |    2 +-
 arch/arm64/include/asm/unistd32.h           |    2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
 arch/mips/kernel/vdso.c                     |   10 +-
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
 arch/s390/kernel/syscalls/syscall.tbl       |    1 +
 arch/sh/kernel/syscalls/syscall.tbl         |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
 fs/userfaultfd.c                            |    8 +-
 include/linux/mm.h                          |  178 +-
 include/linux/mm_types.h                    |    8 +
 include/linux/syscalls.h                    |    2 +
 include/uapi/asm-generic/mman-common.h      |   16 +
 include/uapi/asm-generic/unistd.h           |    5 +-
 include/uapi/linux/mman.h                   |    5 +
 kernel/sys_ni.c                             |    1 +
 mm/Kconfig                                  |    9 +
 mm/Makefile                                 |    1 +
 mm/madvise.c                                |   14 +-
 mm/mempolicy.c                              |    2 +-
 mm/mlock.c                                  |    2 +-
 mm/mmap.c                                   |   77 +-
 mm/mprotect.c                               |   12 +-
 mm/mremap.c                                 |   44 +-
 mm/mseal.c                                  |  376 ++++
 tools/testing/selftests/mm/.gitignore       |    1 +
 tools/testing/selftests/mm/Makefile         |    1 +
 tools/testing/selftests/mm/config           |    1 +
 tools/testing/selftests/mm/mseal_test.c     | 2141 +++++++++++++++++++
 41 files changed, 3091 insertions(+), 32 deletions(-)
 create mode 100644 Documentation/userspace-api/mseal.rst
 create mode 100644 mm/mseal.c
 create mode 100644 tools/testing/selftests/mm/mseal_test.c

-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 01/11] mseal: Add mseal syscall.
  2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
@ 2023-12-12 23:16 ` jeffxu
  2023-12-13  7:24   ` Greg KH
  2023-12-12 23:16 ` [RFC PATCH v3 02/11] mseal: Wire up " jeffxu
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:16 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

The new mseal() is an architecture independent syscall, and with
following signature:

mseal(void addr, size_t len, unsigned long types, unsigned long flags)

addr/len: memory range.  Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to comments in mseal.c. Those are also covered
by the selftest.

This CL adds three sealing types.
MM_SEAL_BASE
MM_SEAL_PROT_PKEY
MM_SEAL_SEAL

The MM_SEAL_BASE:
The base package includes the features common to all VMA sealing
types. It prevents sealed VMAs from:
1> Unmapping, moving to another location, and shrinking the size, via
munmap() and mremap(), can leave an empty space, therefore can be
replaced with a VMA with a new set of attributes.
2> Move or expand a different vma into the current location, via mremap().
3> Modifying sealed VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed
VMA.

We consider the MM_SEAL_BASE feature, on which other sealing features
will depend. For instance, it probably does not make sense to seal
PROT_PKEY without sealing the BASE, and the kernel will implicitly add
SEAL_BASE for SEAL_PROT_PKEY. (If the application wants to relax this
in future, we could use the “flags” field  in mseal() to overwrite
this the behavior of implicitly adding SEAL_BASE.)

The MM_SEAL_PROT_PKEY:
Seal PROT and PKEY of the address range, in other words, mprotect()
and pkey_mprotect() will be denied if the memory is sealed with
MM_SEAL_PROT_PKEY.

The MM_SEAL_SEAL
MM_SEAL_SEAL denies adding a new seal for an VMA.
The kernel will remember which seal types are applied, and the
application doesn’t need to repeat all existing seal types in the next
mseal().  Once a seal type is applied, it can’t be unsealed. Call
mseal() on an existing seal type is a no-action, not a failure.

Data structure:
Internally, the vm_area_struct adds a new field, vm_seals, to store
the bit masks. The vm_seals field is added because the existing
vm_flags field is full in 32-bit CPUs. The vm_seals field can be
merged into vm_flags in the future if the size of vm_flags is ever
expanded.

TODO: Sealed VMA won't merge with other VMA in this patch, merging
support will be added in later patch.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 include/linux/mm.h        |  45 ++++++-
 include/linux/mm_types.h  |   7 ++
 include/linux/syscalls.h  |   2 +
 include/uapi/linux/mman.h |   4 +
 kernel/sys_ni.c           |   1 +
 mm/Kconfig                |   9 ++
 mm/Makefile               |   1 +
 mm/mmap.c                 |   3 +
 mm/mseal.c                | 257 ++++++++++++++++++++++++++++++++++++++
 9 files changed, 328 insertions(+), 1 deletion(-)
 create mode 100644 mm/mseal.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 19fc73b02c9f..3d1120570de5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -30,6 +30,7 @@
 #include <linux/kasan.h>
 #include <linux/memremap.h>
 #include <linux/slab.h>
+#include <uapi/linux/mman.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -257,9 +258,17 @@ extern struct rw_semaphore nommu_region_sem;
 extern unsigned int kobjsize(const void *objp);
 #endif
 
+/*
+ * MM_SEAL_ALL is all supported flags in mseal().
+ */
+#define MM_SEAL_ALL ( \
+	MM_SEAL_SEAL | \
+	MM_SEAL_BASE | \
+	MM_SEAL_PROT_PKEY)
+
 /*
  * vm_flags in vm_area_struct, see mm_types.h.
- * When changing, update also include/trace/events/mmflags.h
+ * When changing, update also include/trace/events/mmflags.h.
  */
 #define VM_NONE		0x00000000
 
@@ -3308,6 +3317,40 @@ static inline void mm_populate(unsigned long addr, unsigned long len)
 static inline void mm_populate(unsigned long addr, unsigned long len) {}
 #endif
 
+#ifdef CONFIG_MSEAL
+static inline bool check_vma_seals_mergeable(unsigned long vm_seals)
+{
+	/*
+	 * Set sealed VMA not mergeable with another VMA for now.
+	 * This will be changed in later commit to make sealed
+	 * VMA also mergeable.
+	 */
+	if (vm_seals & MM_SEAL_ALL)
+		return false;
+
+	return true;
+}
+
+/*
+ * return the valid sealing (after mask).
+ */
+static inline unsigned long vma_seals(struct vm_area_struct *vma)
+{
+	return (vma->vm_seals & MM_SEAL_ALL);
+}
+
+#else
+static inline bool check_vma_seals_mergeable(unsigned long vm_seals1)
+{
+	return true;
+}
+
+static inline unsigned long vma_seals(struct vm_area_struct *vma)
+{
+	return 0;
+}
+#endif
+
 /* These take the mm semaphore themselves */
 extern int __must_check vm_brk(unsigned long, unsigned long);
 extern int __must_check vm_brk_flags(unsigned long, unsigned long, unsigned long);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 589f31ef2e84..052799173c86 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -687,6 +687,13 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_MSEAL
+	/*
+	 * bit masks for seal.
+	 * need this since vm_flags is full.
+	 */
+	unsigned long vm_seals;		/* seal flags, see mm.h. */
+#endif
 } __randomize_layout;
 
 #ifdef CONFIG_SCHED_MM_CID
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0901af60d971..b1c766b74765 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -812,6 +812,8 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
 asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
 			unsigned long prot, unsigned long pgoff,
 			unsigned long flags);
+asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned long types,
+			  unsigned long flags);
 asmlinkage long sys_mbind(unsigned long start, unsigned long len,
 				unsigned long mode,
 				const unsigned long __user *nmask,
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index a246e11988d5..f561652886c4 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -55,4 +55,8 @@ struct cachestat {
 	__u64 nr_recently_evicted;
 };
 
+#define MM_SEAL_SEAL		_BITUL(0)
+#define MM_SEAL_BASE		_BITUL(1)
+#define MM_SEAL_PROT_PKEY	_BITUL(2)
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 9db51ea373b0..716d64df522d 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -195,6 +195,7 @@ COND_SYSCALL(migrate_pages);
 COND_SYSCALL(move_pages);
 COND_SYSCALL(set_mempolicy_home_node);
 COND_SYSCALL(cachestat);
+COND_SYSCALL(mseal);
 
 COND_SYSCALL(perf_event_open);
 COND_SYSCALL(accept4);
diff --git a/mm/Kconfig b/mm/Kconfig
index 264a2df5ecf5..63972d476d19 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1258,6 +1258,15 @@ config LOCK_MM_AND_FIND_VMA
 	bool
 	depends on !STACK_GROWSUP
 
+config MSEAL
+	default n
+	bool "Enable mseal() system call"
+	depends on MMU
+	help
+	  Enable the virtual memory sealing.
+	  This feature allows sealing each virtual memory area separately with
+	  multiple sealing types.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index ec65984e2ade..643d8518dac0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,6 +120,7 @@ obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
 obj-$(CONFIG_PAGE_TABLE_CHECK) += page_table_check.o
 obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
 obj-$(CONFIG_SECRETMEM) += secretmem.o
+obj-$(CONFIG_MSEAL) += mseal.o
 obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
diff --git a/mm/mmap.c b/mm/mmap.c
index 9e018d8dd7d6..42462c2a0c35 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -740,6 +740,9 @@ static inline bool is_mergeable_vma(struct vm_area_struct *vma,
 		return false;
 	if (!anon_vma_name_eq(anon_vma_name(vma), anon_name))
 		return false;
+	if (!check_vma_seals_mergeable(vma_seals(vma)))
+		return false;
+
 	return true;
 }
 
diff --git a/mm/mseal.c b/mm/mseal.c
new file mode 100644
index 000000000000..13bbe9ef5883
--- /dev/null
+++ b/mm/mseal.c
@@ -0,0 +1,257 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  Implement mseal() syscall.
+ *
+ *  Copyright (c) 2023 Google, Inc.
+ *
+ *  Author: Jeff Xu <jeffxu@chromium.org>
+ */
+
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/syscalls.h>
+#include <linux/sched.h>
+#include "internal.h"
+
+static bool can_do_mseal(unsigned long types, unsigned long flags)
+{
+	/* check types is a valid bitmap. */
+	if (types & ~MM_SEAL_ALL)
+		return false;
+
+	/* flags isn't used for now. */
+	if (flags)
+		return false;
+
+	return true;
+}
+
+/*
+ * Check if a seal type can be added to VMA.
+ */
+static bool can_add_vma_seals(struct vm_area_struct *vma, unsigned long newSeals)
+{
+	/* When SEAL_MSEAL is set, reject if a new type of seal is added. */
+	if ((vma->vm_seals & MM_SEAL_SEAL) &&
+	    (newSeals & ~(vma_seals(vma))))
+		return false;
+
+	return true;
+}
+
+static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		struct vm_area_struct **prev, unsigned long start,
+		unsigned long end, unsigned long addtypes)
+{
+	int ret = 0;
+
+	if (addtypes & ~(vma_seals(vma))) {
+		/*
+		 * Handle split at start and end.
+		 * For now sealed VMA doesn't merge with other VMAs.
+		 * This will be updated in later commit to make
+		 * sealed VMA also mergeable.
+		 */
+		if (start != vma->vm_start) {
+			ret = split_vma(vmi, vma, start, 1);
+			if (ret)
+				goto out;
+		}
+
+		if (end != vma->vm_end) {
+			ret = split_vma(vmi, vma, end, 0);
+			if (ret)
+				goto out;
+		}
+
+		vma->vm_seals |= addtypes;
+	}
+
+out:
+	*prev = vma;
+	return ret;
+}
+
+/*
+ * Check for do_mseal:
+ * 1> start is part of a valid vma.
+ * 2> end is part of a valid vma.
+ * 3> No gap (unallocated address) between start and end.
+ * 4> requested seal type can be added in given address range.
+ */
+static int check_mm_seal(unsigned long start, unsigned long end,
+			 unsigned long newtypes)
+{
+	struct vm_area_struct *vma;
+	unsigned long nstart = start;
+
+	VMA_ITERATOR(vmi, current->mm, start);
+
+	/* going through each vma to check. */
+	for_each_vma_range(vmi, vma, end) {
+		if (vma->vm_start > nstart)
+			/* unallocated memory found. */
+			return -ENOMEM;
+
+		if (!can_add_vma_seals(vma, newtypes))
+			return -EACCES;
+
+		if (vma->vm_end >= end)
+			return 0;
+
+		nstart = vma->vm_end;
+	}
+
+	return -ENOMEM;
+}
+
+/*
+ * Apply sealing.
+ */
+static int apply_mm_seal(unsigned long start, unsigned long end,
+			 unsigned long newtypes)
+{
+	unsigned long nstart, nend;
+	struct vm_area_struct *vma, *prev = NULL;
+	struct vma_iterator vmi;
+	int error = 0;
+
+	vma_iter_init(&vmi, current->mm, start);
+	vma = vma_find(&vmi, end);
+
+	prev = vma_prev(&vmi);
+	if (start > vma->vm_start)
+		prev = vma;
+
+	nstart = start;
+
+	/* going through each vma to update. */
+	for_each_vma_range(vmi, vma, end) {
+		nend = vma->vm_end;
+		if (nend > end)
+			nend = end;
+
+		error = mseal_fixup(&vmi, vma, &prev, nstart, nend, newtypes);
+		if (error)
+			break;
+
+		nstart = vma->vm_end;
+	}
+
+	return error;
+}
+
+/*
+ * mseal(2) seals the VM's meta data from
+ * selected syscalls.
+ *
+ * addr/len: VM address range.
+ *
+ *  The address range by addr/len must meet:
+ *   start (addr) must be in a valid VMA.
+ *   end (addr + len) must be in a valid VMA.
+ *   no gap (unallocated memory) between start and end.
+ *   start (addr) must be page aligned.
+ *
+ *  len: len will be page aligned implicitly.
+ *
+ *  types: bit mask for sealed syscalls.
+ *   MM_SEAL_BASE: prevent VMA from:
+ *   1> Unmapping, moving to another location, and shrinking
+ *	the size, via munmap() and mremap(), can leave an empty
+ *	space, therefore can be replaced with a VMA with a new
+ *	set of attributes.
+ *   2> Move or expand a different vma into the current location,
+ *	via mremap().
+ *   3> Modifying sealed VMA via mmap(MAP_FIXED).
+ *   4> Size expansion, via mremap(), does not appear to pose any
+ *	specific risks to sealed VMAs. It is included anyway because
+ *	the use case is unclear. In any case, users can rely on
+ *	merging to expand a sealed VMA.
+ *
+ *   The MM_SEAL_PROT_PKEY:
+ *	Seal PROT and PKEY of the address range, in other words,
+ *	mprotect() and pkey_mprotect() will be denied if the memory is
+ *	sealed with MM_SEAL_PROT_PKEY.
+ *
+ *   The MM_SEAL_SEAL
+ *	MM_SEAL_SEAL denies adding a new seal for an VMA.
+ *
+ *	The kernel will remember which seal types are applied, and the
+ *	application doesn’t need to repeat all existing seal types in
+ *	the next mseal(). Once a seal type is applied, it can’t be
+ *	unsealed. Call mseal() on an existing seal type is a no-action,
+ *	not a failure.
+ *
+ *  flags: reserved.
+ *
+ * return values:
+ *  zero: success.
+ *  -EINVAL:
+ *   invalid seal type.
+ *   invalid input flags.
+ *   addr is not page aligned.
+ *   addr + len overflow.
+ *  -ENOMEM:
+ *   addr is not a valid address (not allocated).
+ *   end (addr + len) is not a valid address.
+ *   a gap (unallocated memory) between start and end.
+ *  -EACCES:
+ *   MM_SEAL_SEAL is set, adding a new seal is rejected.
+ *
+ * Note:
+ *  user can call mseal(2) multiple times to add new seal types.
+ *  adding an already added seal type is a no-action (no error).
+ *  adding a new seal type after MM_SEAL_SEAL will be rejected.
+ *  unseal() or removing a seal type is not supported.
+ */
+static int do_mseal(unsigned long start, size_t len_in, unsigned long types,
+		    unsigned long flags)
+{
+	int ret = 0;
+	unsigned long end;
+	struct mm_struct *mm = current->mm;
+	size_t len;
+
+	/* MM_SEAL_BASE is set when other seal types are set. */
+	if (types & MM_SEAL_PROT_PKEY)
+		types |= MM_SEAL_BASE;
+
+	if (!can_do_mseal(types, flags))
+		return -EINVAL;
+
+	start = untagged_addr(start);
+	if (!PAGE_ALIGNED(start))
+		return -EINVAL;
+
+	len = PAGE_ALIGN(len_in);
+	/* Check to see whether len was rounded up from small -ve to zero. */
+	if (len_in && !len)
+		return -EINVAL;
+
+	end = start + len;
+	if (end < start)
+		return -EINVAL;
+
+	if (end == start)
+		return 0;
+
+	if (mmap_write_lock_killable(mm))
+		return -EINTR;
+
+	ret = check_mm_seal(start, end, types);
+	if (ret)
+		goto out;
+
+	ret = apply_mm_seal(start, end, types);
+
+out:
+	mmap_write_unlock(current->mm);
+	return ret;
+}
+
+SYSCALL_DEFINE4(mseal, unsigned long, start, size_t, len, unsigned long, types, unsigned long,
+		flags)
+{
+	return do_mseal(start, len, types, flags);
+}
-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 02/11] mseal: Wire up mseal syscall
  2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
  2023-12-12 23:16 ` [RFC PATCH v3 01/11] mseal: Add mseal syscall jeffxu
@ 2023-12-12 23:16 ` jeffxu
  2023-12-12 23:16 ` [RFC PATCH v3 03/11] mseal: add can_modify_mm and can_modify_vma jeffxu
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:16 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Wire up mseal syscall for all architectures.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 arch/alpha/kernel/syscalls/syscall.tbl      | 1 +
 arch/arm/tools/syscall.tbl                  | 1 +
 arch/arm64/include/asm/unistd.h             | 2 +-
 arch/arm64/include/asm/unistd32.h           | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl       | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl       | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl     | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    | 1 +
 arch/s390/kernel/syscalls/syscall.tbl       | 1 +
 arch/sh/kernel/syscalls/syscall.tbl         | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl      | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     | 1 +
 include/uapi/asm-generic/unistd.h           | 5 ++++-
 19 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index b68f1f56b836..4de33b969009 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -496,3 +496,4 @@
 564	common	futex_wake			sys_futex_wake
 565	common	futex_wait			sys_futex_wait
 566	common	futex_requeue			sys_futex_requeue
+567	common  mseal				sys_mseal
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 93d0d46cbb15..dacea023bb88 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -469,3 +469,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	mseal				sys_mseal
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 531effca5f1f..298313d2e0af 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -39,7 +39,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		457
+#define __NR_compat_syscalls		458
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c453291154fd..015c80b14206 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -917,6 +917,8 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake)
 __SYSCALL(__NR_futex_wait, sys_futex_wait)
 #define __NR_futex_requeue 456
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
+#define __NR_mseal 457
+__SYSCALL(__NR_mseal, sys_mseal)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 81375ea78288..e8b40451693d 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -376,3 +376,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	mseal 				sys_mseal
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index f7f997a88bab..0da4a4dc1737 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -455,3 +455,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	mseal				sys_mseal
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 2967ec26b978..ca8572222783 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -461,3 +461,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	mseal				sys_mseal
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 383abb1713f4..4fd33623b7e8 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -394,3 +394,4 @@
 454	n32	futex_wake			sys_futex_wake
 455	n32	futex_wait			sys_futex_wait
 456	n32	futex_requeue			sys_futex_requeue
+457	n32	mseal				sys_mseal
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index c9bd09ba905f..aaa6382781e0 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -370,3 +370,4 @@
 454	n64	futex_wake			sys_futex_wake
 455	n64	futex_wait			sys_futex_wait
 456	n64	futex_requeue			sys_futex_requeue
+457	n64	mseal				sys_mseal
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index ba5ef6cea97a..bbdd6f151224 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -443,3 +443,4 @@
 454	o32	futex_wake			sys_futex_wake
 455	o32	futex_wait			sys_futex_wait
 456	o32	futex_requeue			sys_futex_requeue
+457	o32	mseal				sys_mseal
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 9f0f6df55361..8dda80555c7c 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -454,3 +454,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	mseal				sys_mseal
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 26fc41904266..d0aa97a669bc 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -542,3 +542,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	mseal				sys_mseal
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 31be90b241f7..228f100f8565 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -458,3 +458,4 @@
 454  common	futex_wake		sys_futex_wake			sys_futex_wake
 455  common	futex_wait		sys_futex_wait			sys_futex_wait
 456  common	futex_requeue		sys_futex_requeue			sys_futex_requeue
+457  common	mseal			sys_mseal			sys_mseal
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 4bc5d488ab17..cf08ea4a7539 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -458,3 +458,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	mseal				sys_mseal
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8404c8e50394..30796f78bdc2 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -501,3 +501,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	mseal 				sys_mseal
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 31c48bc2c3d8..c4163b904714 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -460,3 +460,4 @@
 454	i386	futex_wake		sys_futex_wake
 455	i386	futex_wait		sys_futex_wait
 456	i386	futex_requeue		sys_futex_requeue
+457	i386	mseal 			sys_mseal
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index a577bb27c16d..47fbc6ac0267 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -378,6 +378,7 @@
 454	common	futex_wake		sys_futex_wake
 455	common	futex_wait		sys_futex_wait
 456	common	futex_requeue		sys_futex_requeue
+457 	common  mseal			sys_mseal
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index dd71ecce8b86..fe5f562f6493 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -426,3 +426,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	mseal 				sys_mseal
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index d9e9cd13e577..1678245d8a2b 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -829,8 +829,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait)
 #define __NR_futex_requeue 456
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 
+#define __NR_mseal 457
+__SYSCALL(__NR_mseal, sys_mseal)
+
 #undef __NR_syscalls
-#define __NR_syscalls 457
+#define __NR_syscalls 458
 
 /*
  * 32 bit systems traditionally used different
-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 03/11] mseal: add can_modify_mm and can_modify_vma
  2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
  2023-12-12 23:16 ` [RFC PATCH v3 01/11] mseal: Add mseal syscall jeffxu
  2023-12-12 23:16 ` [RFC PATCH v3 02/11] mseal: Wire up " jeffxu
@ 2023-12-12 23:16 ` jeffxu
  2023-12-12 23:16 ` [RFC PATCH v3 04/11] mseal: add MM_SEAL_BASE jeffxu
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:16 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Two utilities to be used later.

can_modify_mm:
 checks sealing flags for given memory range.

can_modify_vma:
  checks sealing flags for given vma.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 include/linux/mm.h | 18 ++++++++++++++++++
 mm/mseal.c         | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3d1120570de5..2435acc1f44f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3339,6 +3339,12 @@ static inline unsigned long vma_seals(struct vm_area_struct *vma)
 	return (vma->vm_seals & MM_SEAL_ALL);
 }
 
+extern bool can_modify_mm(struct mm_struct *mm, unsigned long start,
+		unsigned long end, unsigned long checkSeals);
+
+extern bool can_modify_vma(struct vm_area_struct *vma,
+		unsigned long checkSeals);
+
 #else
 static inline bool check_vma_seals_mergeable(unsigned long vm_seals1)
 {
@@ -3349,6 +3355,18 @@ static inline unsigned long vma_seals(struct vm_area_struct *vma)
 {
 	return 0;
 }
+
+static inline bool can_modify_mm(struct mm_struct *mm, unsigned long start,
+		unsigned long end, unsigned long checkSeals)
+{
+	return true;
+}
+
+static inline bool can_modify_vma(struct vm_area_struct *vma,
+		unsigned long checkSeals)
+{
+	return true;
+}
 #endif
 
 /* These take the mm semaphore themselves */
diff --git a/mm/mseal.c b/mm/mseal.c
index 13bbe9ef5883..d12aa628ebdc 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -26,6 +26,44 @@ static bool can_do_mseal(unsigned long types, unsigned long flags)
 	return true;
 }
 
+/*
+ * check if a vma is sealed for modification.
+ * return true, if modification is allowed.
+ */
+bool can_modify_vma(struct vm_area_struct *vma,
+		    unsigned long checkSeals)
+{
+	if (checkSeals & vma_seals(vma))
+		return false;
+
+	return true;
+}
+
+/*
+ * Check if the vmas of a memory range are allowed to be modified.
+ * the memory ranger can have a gap (unallocated memory).
+ * return true, if it is allowed.
+ */
+bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end,
+		   unsigned long checkSeals)
+{
+	struct vm_area_struct *vma;
+
+	VMA_ITERATOR(vmi, mm, start);
+
+	if (!checkSeals)
+		return true;
+
+	/* going through each vma to check. */
+	for_each_vma_range(vmi, vma, end) {
+		if (!can_modify_vma(vma, checkSeals))
+			return false;
+	}
+
+	/* Allow by default. */
+	return true;
+}
+
 /*
  * Check if a seal type can be added to VMA.
  */
-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 04/11] mseal: add MM_SEAL_BASE
  2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
                   ` (2 preceding siblings ...)
  2023-12-12 23:16 ` [RFC PATCH v3 03/11] mseal: add can_modify_mm and can_modify_vma jeffxu
@ 2023-12-12 23:16 ` jeffxu
  2023-12-12 23:16 ` [RFC PATCH v3 05/11] mseal: add MM_SEAL_PROT_PKEY jeffxu
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:16 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

The base package includes the features common to all VMA sealing
types. It prevents sealed VMAs from:
1> Unmapping, moving to another location, and shrinking the size, via
munmap() and mremap(), can leave an empty space, therefore can be
replaced with a VMA with a new set of attributes.
2> Move or expand a different vma into the current location, via mremap().
3> Modifying sealed VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed
VMA.

We consider the MM_SEAL_BASE feature, on which other sealing features
will depend. For instance, it probably does not make sense to seal
PROT_PKEY without sealing the BASE, and the kernel will implicitly add
SEAL_BASE for SEAL_PROT_PKEY. (If the application wants to relax this
in future, we could use the flags field in mseal() to overwrite
this the behavior of implicitly adding SEAL_BASE.)

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 mm/mmap.c   | 23 +++++++++++++++++++++++
 mm/mremap.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 42462c2a0c35..dbc557bd460c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1259,6 +1259,13 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			return -EEXIST;
 	}
 
+	/*
+	 * Check if the address range is sealed for do_mmap().
+	 * can_modify_mm assumes we have acquired the lock on MM.
+	 */
+	if (!can_modify_mm(mm, addr, addr + len, MM_SEAL_BASE))
+		return -EACCES;
+
 	if (prot == PROT_EXEC) {
 		pkey = execute_only_pkey(mm);
 		if (pkey < 0)
@@ -2632,6 +2639,14 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 	if (end == start)
 		return -EINVAL;
 
+	/*
+	 * Check if memory is sealed before arch_unmap.
+	 * Prevent unmapping a sealed VMA.
+	 * can_modify_mm assumes we have acquired the lock on MM.
+	 */
+	if (!can_modify_mm(mm, start, end, MM_SEAL_BASE))
+		return -EACCES;
+
 	 /* arch_unmap() might do unmaps itself.  */
 	arch_unmap(mm, start, end);
 
@@ -3053,6 +3068,14 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
+	/*
+	 * Check if memory is sealed before arch_unmap.
+	 * Prevent unmapping a sealed VMA.
+	 * can_modify_mm assumes we have acquired the lock on MM.
+	 */
+	if (!can_modify_mm(mm, start, end, MM_SEAL_BASE))
+		return -EACCES;
+
 	arch_unmap(mm, start, end);
 	return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
 }
diff --git a/mm/mremap.c b/mm/mremap.c
index 382e81c33fc4..ff7429bfbbe1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -835,7 +835,35 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if ((mm->map_count + 2) >= sysctl_max_map_count - 3)
 		return -ENOMEM;
 
+	/*
+	 * In mremap_to() which moves a VMA to another address.
+	 * Check if src address is sealed, if so, reject.
+	 * In other words, prevent a sealed VMA being moved to
+	 * another address.
+	 *
+	 * Place can_modify_mm here because mremap_to()
+	 * does its own checking for address range, and we only
+	 * check the sealing after passing those checks.
+	 * can_modify_mm assumes we have acquired the lock on MM.
+	 */
+	if (!can_modify_mm(mm, addr, addr + old_len, MM_SEAL_BASE))
+		return -EACCES;
+
 	if (flags & MREMAP_FIXED) {
+		/*
+		 * In mremap_to() which moves a VMA to another address.
+		 * Check if dst address is sealed, if so, reject.
+		 * In other words, prevent moving a vma to a sealed VMA.
+		 *
+		 * Place can_modify_mm here because mremap_to() does its
+		 * own checking for address, and we only check the sealing
+		 * after passing those checks.
+		 * can_modify_mm assumes we have acquired the lock on MM.
+		 */
+		if (!can_modify_mm(mm, new_addr, new_addr + new_len,
+				   MM_SEAL_BASE))
+			return -EACCES;
+
 		ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
 		if (ret)
 			goto out;
@@ -994,6 +1022,20 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 		goto out;
 	}
 
+	/*
+	 * This is shrink/expand case (not mremap_to())
+	 * Check if src address is sealed, if so, reject.
+	 * In other words, prevent shrinking or expanding a sealed VMA.
+	 *
+	 * Place can_modify_mm here so we can keep the logic related to
+	 * shrink/expand together. Perhaps we can extract below to be its
+	 * own function in future.
+	 */
+	if (!can_modify_mm(mm, addr, addr + old_len, MM_SEAL_BASE)) {
+		ret = -EACCES;
+		goto out;
+	}
+
 	/*
 	 * Always allow a shrinking remap: that just unmaps
 	 * the unnecessary pages..
-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 05/11] mseal: add MM_SEAL_PROT_PKEY
  2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
                   ` (3 preceding siblings ...)
  2023-12-12 23:16 ` [RFC PATCH v3 04/11] mseal: add MM_SEAL_BASE jeffxu
@ 2023-12-12 23:16 ` jeffxu
  2023-12-12 23:17 ` [RFC PATCH v3 06/11] mseal: add sealing support for mmap jeffxu
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:16 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Seal PROT and PKEY of the address range, in other words, mprotect()
and pkey_mprotect() will be denied if the memory is sealed with
MM_SEAL_PROT_PKEY.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 mm/mprotect.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index b94fbb45d5c7..1527188b1e92 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -32,6 +32,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/memory-tiers.h>
+#include <uapi/linux/mman.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
@@ -753,6 +754,15 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		}
 	}
 
+	/*
+	 * checking if PROT and PKEY is sealed.
+	 * can_modify_mm assumes we have acquired the lock on MM.
+	 */
+	if (!can_modify_mm(current->mm, start, end, MM_SEAL_PROT_PKEY)) {
+		error = -EACCES;
+		goto out;
+	}
+
 	prev = vma_prev(&vmi);
 	if (start > vma->vm_start)
 		prev = vma;
-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 06/11] mseal: add sealing support for mmap
  2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
                   ` (4 preceding siblings ...)
  2023-12-12 23:16 ` [RFC PATCH v3 05/11] mseal: add MM_SEAL_PROT_PKEY jeffxu
@ 2023-12-12 23:17 ` jeffxu
  2023-12-12 23:17 ` [RFC PATCH v3 07/11] mseal: make sealed VMA mergeable jeffxu
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:17 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Allow mmap() to set the sealing type when creating a mapping. This is
useful for optimization because it avoids having to make two system
calls: one for mmap() and one for mseal(). With this change, mmap()
can take an input that specifies the sealing type, so only one system
call is needed.

This patch uses the "prot" field of mmap() to set the sealing.
Three sealing types are added to match with MM_SEAL_xyz in mseal().
PROT_SEAL_SEAL
PROT_SEAL_BASE
PROT_SEAL_PROT_PKEY

We also thought about using MAP_SEAL_xyz, which is a field in the
mmap() function called "flags". However, this field is more about the
type of mapping, such as MAP_FIXED_NOREPLACE. The "prot" field seems
more appropriate for our case.

It's worth noting that even though the sealing type is set via the
"prot" field in mmap(), we don't require it to be set in the "prot"
field in later mprotect() call. This is unlike the PROT_READ,
PROT_WRITE, PROT_EXEC bits, e.g. if PROT_WRITE is not set in
mprotect(), it means that the region is not writable. In other words,
if you set PROT_SEAL_PROT_PKEY in mmap(), you don't need to set it in
mprotect(). In fact, with the current approach, mseal() is used to set
sealing on existing VMA.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Suggested-by: Pedro Falcato <pedro.falcato@gmail.com>
---
 arch/mips/kernel/vdso.c                | 10 +++-
 include/linux/mm.h                     | 63 +++++++++++++++++++++++++-
 include/uapi/asm-generic/mman-common.h | 13 ++++++
 mm/mmap.c                              | 25 ++++++++--
 4 files changed, 105 insertions(+), 6 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index f6d40e43f108..6d1103d36af1 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -98,11 +98,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		return -EINTR;
 
 	if (IS_ENABLED(CONFIG_MIPS_FP_SUPPORT)) {
-		/* Map delay slot emulation page */
+		/*
+		 * Map delay slot emulation page.
+		 *
+		 * Note: passing vm_seals = 0
+		 * Don't support sealing for vdso for now.
+		 * This might change when we add sealing support for vdso.
+		 */
 		base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
 				VM_READ | VM_EXEC |
 				VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
-				0, NULL);
+				0, NULL, 0);
 		if (IS_ERR_VALUE(base)) {
 			ret = base;
 			goto out;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2435acc1f44f..5d3ee79f1438 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -266,6 +266,15 @@ extern unsigned int kobjsize(const void *objp);
 	MM_SEAL_BASE | \
 	MM_SEAL_PROT_PKEY)
 
+/*
+ * PROT_SEAL_ALL is all supported flags in mmap().
+ * See include/uapi/asm-generic/mman-common.h.
+ */
+#define PROT_SEAL_ALL ( \
+	PROT_SEAL_SEAL | \
+	PROT_SEAL_BASE | \
+	PROT_SEAL_PROT_PKEY)
+
 /*
  * vm_flags in vm_area_struct, see mm_types.h.
  * When changing, update also include/trace/events/mmflags.h.
@@ -3290,7 +3299,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-	struct list_head *uf);
+	struct list_head *uf, unsigned long vm_seals);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
@@ -3339,12 +3348,47 @@ static inline unsigned long vma_seals(struct vm_area_struct *vma)
 	return (vma->vm_seals & MM_SEAL_ALL);
 }
 
+static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm_seals)
+{
+	vma->vm_seals |= vm_seals;
+}
+
 extern bool can_modify_mm(struct mm_struct *mm, unsigned long start,
 		unsigned long end, unsigned long checkSeals);
 
 extern bool can_modify_vma(struct vm_area_struct *vma,
 		unsigned long checkSeals);
 
+/*
+ * Convert prot field of mmap to vm_seals type.
+ */
+static inline unsigned long convert_mmap_seals(unsigned long prot)
+{
+	unsigned long seals = 0;
+
+	/*
+	 * set SEAL_PROT_PKEY implies SEAL_BASE.
+	 */
+	if (prot & PROT_SEAL_PROT_PKEY)
+		prot |= PROT_SEAL_BASE;
+
+	/*
+	 * The seal bits start from bit 26 of the "prot" field of mmap.
+	 * see comments in include/uapi/asm-generic/mman-common.h.
+	 */
+	seals = (prot & PROT_SEAL_ALL) >> PROT_SEAL_BIT_BEGIN;
+	return seals;
+}
+
+/*
+ * check input sealing type from the "prot" field of mmap().
+ * for CONFIG_MSEAL case, this always return 0 (successful).
+ */
+static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals)
+{
+	*vm_seals = convert_mmap_seals(prot);
+	return 0;
+}
 #else
 static inline bool check_vma_seals_mergeable(unsigned long vm_seals1)
 {
@@ -3367,6 +3411,23 @@ static inline bool can_modify_vma(struct vm_area_struct *vma,
 {
 	return true;
 }
+
+static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm_seals)
+{
+}
+
+/*
+ * check input sealing type from the "prot" field of mmap().
+ * For not CONFIG_MSEAL, if SEAL flag is set, it will return failure.
+ */
+static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals)
+{
+	if (prot & PROT_SEAL_ALL)
+		return -EINVAL;
+
+	return 0;
+}
+
 #endif
 
 /* These take the mm semaphore themselves */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..f07ad9e70b3a 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -17,6 +17,19 @@
 #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
 #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
 
+/*
+ * The PROT_SEAL_XX defines memory sealings flags in the prot argument
+ * of mmap(). The bits currently take consecutive bits and match
+ * the same sequence as MM_SEAL_XX type, this allows convert_mmap_seals()
+ * to convert prot to MM_SEAL_XX type using bit operations.
+ * The include/uapi/linux/mman.h header file defines the MM_SEAL_XX type,
+ * which is used by the mseal() system call.
+ */
+#define PROT_SEAL_BIT_BEGIN	26
+#define PROT_SEAL_SEAL	_BITUL(PROT_SEAL_BIT_BEGIN)	/* 0x04000000 seal seal */
+#define PROT_SEAL_BASE	_BITUL(PROT_SEAL_BIT_BEGIN + 1)	/* 0x08000000 base for all sealing types */
+#define PROT_SEAL_PROT_PKEY	_BITUL(PROT_SEAL_BIT_BEGIN + 2)	/* 0x10000000 seal prot and pkey */
+
 /* 0x01 - 0x03 are defined in linux/mman.h */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
 #define MAP_FIXED	0x10		/* Interpret addr exactly */
diff --git a/mm/mmap.c b/mm/mmap.c
index dbc557bd460c..3e1bf5a131b0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1211,6 +1211,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 {
 	struct mm_struct *mm = current->mm;
 	int pkey = 0;
+	unsigned long vm_seals = 0;
 
 	*populate = 0;
 
@@ -1231,6 +1232,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	if (flags & MAP_FIXED_NOREPLACE)
 		flags |= MAP_FIXED;
 
+	if (check_mmap_seals(prot, &vm_seals) < 0)
+		return -EINVAL;
+
 	if (!(flags & MAP_FIXED))
 		addr = round_hint_to_min(addr);
 
@@ -1381,7 +1385,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
-	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, vm_seals);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -2679,7 +2683,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+		struct list_head *uf, unsigned long vm_seals)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
@@ -2723,7 +2727,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	next = vma_next(&vmi);
 	prev = vma_prev(&vmi);
-	if (vm_flags & VM_SPECIAL) {
+	/*
+	 * For now, sealed VMA doesn't merge with other VMA,
+	 * Will change this in later commit when we make sealed VMA
+	 * also mergeable.
+	 */
+	if ((vm_flags & VM_SPECIAL) ||
+		(vm_seals & MM_SEAL_ALL)) {
 		if (prev)
 			vma_iter_next_range(&vmi);
 		goto cannot_expand;
@@ -2781,6 +2791,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_page_prot = vm_get_page_prot(vm_flags);
 	vma->vm_pgoff = pgoff;
 
+	update_vma_seals(vma, vm_seals);
+
 	if (file) {
 		if (vm_flags & VM_SHARED) {
 			error = mapping_map_writable(file->f_mapping);
@@ -2992,6 +3004,13 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 	if (pgoff + (size >> PAGE_SHIFT) < pgoff)
 		return ret;
 
+	/*
+	 * Do not support sealing in remap_file_page.
+	 * sealing is set via mmap() and mseal().
+	 */
+	if (prot & PROT_SEAL_ALL)
+		return ret;
+
 	if (mmap_write_lock_killable(mm))
 		return -EINTR;
 
-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 07/11] mseal: make sealed VMA mergeable.
  2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
                   ` (5 preceding siblings ...)
  2023-12-12 23:17 ` [RFC PATCH v3 06/11] mseal: add sealing support for mmap jeffxu
@ 2023-12-12 23:17 ` jeffxu
  2023-12-12 23:17 ` [RFC PATCH v3 08/11] mseal: add MM_SEAL_DISCARD_RO_ANON jeffxu
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:17 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Add merge/split handling for mlock/madvice/mprotect/mmap case.
Make sealed VMA mergeable with adjacent VMAs.

This is so that we don't run out of VMAs, i.e. there is a max
number of VMA per process.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Suggested-by: Jann Horn <jannh@google.com>
---
 fs/userfaultfd.c   |  8 +++++---
 include/linux/mm.h | 31 +++++++++++++------------------
 mm/madvise.c       |  2 +-
 mm/mempolicy.c     |  2 +-
 mm/mlock.c         |  2 +-
 mm/mmap.c          | 44 +++++++++++++++++++++-----------------------
 mm/mprotect.c      |  2 +-
 mm/mremap.c        |  2 +-
 mm/mseal.c         | 23 ++++++++++++++++++-----
 9 files changed, 62 insertions(+), 54 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 56eaae9dac1a..8ebee7c1c6cf 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -926,7 +926,8 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 				 new_flags, vma->anon_vma,
 				 vma->vm_file, vma->vm_pgoff,
 				 vma_policy(vma),
-				 NULL_VM_UFFD_CTX, anon_vma_name(vma));
+				 NULL_VM_UFFD_CTX, anon_vma_name(vma),
+				vma_seals(vma));
 		if (prev) {
 			vma = prev;
 		} else {
@@ -1483,7 +1484,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 				 vma->anon_vma, vma->vm_file, pgoff,
 				 vma_policy(vma),
 				 ((struct vm_userfaultfd_ctx){ ctx }),
-				 anon_vma_name(vma));
+				 anon_vma_name(vma), vma_seals(vma));
 		if (prev) {
 			/* vma_merge() invalidated the mas */
 			vma = prev;
@@ -1668,7 +1669,8 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		prev = vma_merge(&vmi, mm, prev, start, vma_end, new_flags,
 				 vma->anon_vma, vma->vm_file, pgoff,
 				 vma_policy(vma),
-				 NULL_VM_UFFD_CTX, anon_vma_name(vma));
+				 NULL_VM_UFFD_CTX, anon_vma_name(vma),
+				vma_seals(vma));
 		if (prev) {
 			vma = prev;
 			goto next;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d3ee79f1438..1f162bb5b38d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3243,7 +3243,7 @@ extern struct vm_area_struct *vma_merge(struct vma_iterator *vmi,
 	struct mm_struct *, struct vm_area_struct *prev, unsigned long addr,
 	unsigned long end, unsigned long vm_flags, struct anon_vma *,
 	struct file *, pgoff_t, struct mempolicy *, struct vm_userfaultfd_ctx,
-	struct anon_vma_name *);
+	struct anon_vma_name *, unsigned long vm_seals);
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *,
 		       unsigned long addr, int new_below);
@@ -3327,19 +3327,6 @@ static inline void mm_populate(unsigned long addr, unsigned long len) {}
 #endif
 
 #ifdef CONFIG_MSEAL
-static inline bool check_vma_seals_mergeable(unsigned long vm_seals)
-{
-	/*
-	 * Set sealed VMA not mergeable with another VMA for now.
-	 * This will be changed in later commit to make sealed
-	 * VMA also mergeable.
-	 */
-	if (vm_seals & MM_SEAL_ALL)
-		return false;
-
-	return true;
-}
-
 /*
  * return the valid sealing (after mask).
  */
@@ -3353,6 +3340,14 @@ static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm
 	vma->vm_seals |= vm_seals;
 }
 
+static inline bool check_vma_seals_mergeable(unsigned long vm_seals1, unsigned long vm_seals2)
+{
+	if ((vm_seals1 & MM_SEAL_ALL) != (vm_seals2 & MM_SEAL_ALL))
+		return false;
+
+	return true;
+}
+
 extern bool can_modify_mm(struct mm_struct *mm, unsigned long start,
 		unsigned long end, unsigned long checkSeals);
 
@@ -3390,14 +3385,14 @@ static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals)
 	return 0;
 }
 #else
-static inline bool check_vma_seals_mergeable(unsigned long vm_seals1)
+static inline unsigned long vma_seals(struct vm_area_struct *vma)
 {
-	return true;
+	return 0;
 }
 
-static inline unsigned long vma_seals(struct vm_area_struct *vma)
+static inline bool check_vma_seals_mergeable(unsigned long vm_seals1, unsigned long vm_seals2)
 {
-	return 0;
+	return true;
 }
 
 static inline bool can_modify_mm(struct mm_struct *mm, unsigned long start,
diff --git a/mm/madvise.c b/mm/madvise.c
index 4dded5d27e7e..e2d219a4b6ef 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -152,7 +152,7 @@ static int madvise_update_vma(struct vm_area_struct *vma,
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(&vmi, mm, *prev, start, end, new_flags,
 			  vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
-			  vma->vm_userfaultfd_ctx, anon_name);
+			  vma->vm_userfaultfd_ctx, anon_name, vma_seals(vma));
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e52e3a0b8f2e..e70b69c64564 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -836,7 +836,7 @@ static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	pgoff = vma->vm_pgoff + ((vmstart - vma->vm_start) >> PAGE_SHIFT);
 	merged = vma_merge(vmi, vma->vm_mm, *prev, vmstart, vmend, vma->vm_flags,
 			 vma->anon_vma, vma->vm_file, pgoff, new_pol,
-			 vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+			 vma->vm_userfaultfd_ctx, anon_vma_name(vma), vma_seals(vma));
 	if (merged) {
 		*prev = merged;
 		return vma_replace_policy(merged, new_pol);
diff --git a/mm/mlock.c b/mm/mlock.c
index 06bdfab83b58..b537a2cbd337 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -428,7 +428,7 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*prev = vma_merge(vmi, mm, *prev, start, end, newflags,
 			vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
-			vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+			vma->vm_userfaultfd_ctx, anon_vma_name(vma), vma_seals(vma));
 	if (*prev) {
 		vma = *prev;
 		goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index 3e1bf5a131b0..6da8d83f2e66 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -720,7 +720,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 static inline bool is_mergeable_vma(struct vm_area_struct *vma,
 		struct file *file, unsigned long vm_flags,
 		struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
-		struct anon_vma_name *anon_name, bool may_remove_vma)
+		struct anon_vma_name *anon_name, bool may_remove_vma,
+		unsigned long vm_seals)
 {
 	/*
 	 * VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -740,7 +741,7 @@ static inline bool is_mergeable_vma(struct vm_area_struct *vma,
 		return false;
 	if (!anon_vma_name_eq(anon_vma_name(vma), anon_name))
 		return false;
-	if (!check_vma_seals_mergeable(vma_seals(vma)))
+	if (!check_vma_seals_mergeable(vma_seals(vma), vm_seals))
 		return false;
 
 	return true;
@@ -776,9 +777,10 @@ static bool
 can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
 		struct anon_vma *anon_vma, struct file *file,
 		pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
-		struct anon_vma_name *anon_name)
+		struct anon_vma_name *anon_name, unsigned long vm_seals)
 {
-	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name, true) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx,
+		anon_name, true, vm_seals) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		if (vma->vm_pgoff == vm_pgoff)
 			return true;
@@ -799,9 +801,10 @@ static bool
 can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
 		struct anon_vma *anon_vma, struct file *file,
 		pgoff_t vm_pgoff, struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
-		struct anon_vma_name *anon_name)
+		struct anon_vma_name *anon_name, unsigned long vm_seals)
 {
-	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name, false) &&
+	if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx,
+		anon_name, false, vm_seals) &&
 	    is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
 		pgoff_t vm_pglen;
 		vm_pglen = vma_pages(vma);
@@ -869,7 +872,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
 			struct anon_vma *anon_vma, struct file *file,
 			pgoff_t pgoff, struct mempolicy *policy,
 			struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
-			struct anon_vma_name *anon_name)
+			struct anon_vma_name *anon_name, unsigned long vm_seals)
 {
 	struct vm_area_struct *curr, *next, *res;
 	struct vm_area_struct *vma, *adjust, *remove, *remove2;
@@ -908,7 +911,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
 		/* Can we merge the predecessor? */
 		if (addr == prev->vm_end && mpol_equal(vma_policy(prev), policy)
 		    && can_vma_merge_after(prev, vm_flags, anon_vma, file,
-					   pgoff, vm_userfaultfd_ctx, anon_name)) {
+			pgoff, vm_userfaultfd_ctx, anon_name, vm_seals)) {
 			merge_prev = true;
 			vma_prev(vmi);
 		}
@@ -917,7 +920,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
 	/* Can we merge the successor? */
 	if (next && mpol_equal(policy, vma_policy(next)) &&
 	    can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen,
-				 vm_userfaultfd_ctx, anon_name)) {
+			vm_userfaultfd_ctx, anon_name, vm_seals)) {
 		merge_next = true;
 	}
 
@@ -2727,13 +2730,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	next = vma_next(&vmi);
 	prev = vma_prev(&vmi);
-	/*
-	 * For now, sealed VMA doesn't merge with other VMA,
-	 * Will change this in later commit when we make sealed VMA
-	 * also mergeable.
-	 */
-	if ((vm_flags & VM_SPECIAL) ||
-		(vm_seals & MM_SEAL_ALL)) {
+
+	if (vm_flags & VM_SPECIAL) {
 		if (prev)
 			vma_iter_next_range(&vmi);
 		goto cannot_expand;
@@ -2743,7 +2741,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	/* Check next */
 	if (next && next->vm_start == end && !vma_policy(next) &&
 	    can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
-				 NULL_VM_UFFD_CTX, NULL)) {
+			NULL_VM_UFFD_CTX, NULL, vm_seals)) {
 		merge_end = next->vm_end;
 		vma = next;
 		vm_pgoff = next->vm_pgoff - pglen;
@@ -2752,9 +2750,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	/* Check prev */
 	if (prev && prev->vm_end == addr && !vma_policy(prev) &&
 	    (vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file,
-				       pgoff, vma->vm_userfaultfd_ctx, NULL) :
+			pgoff, vma->vm_userfaultfd_ctx, NULL, vm_seals) :
 		   can_vma_merge_after(prev, vm_flags, NULL, file, pgoff,
-				       NULL_VM_UFFD_CTX, NULL))) {
+			NULL_VM_UFFD_CTX, NULL, vm_seals))) {
 		merge_start = prev->vm_start;
 		vma = prev;
 		vm_pgoff = prev->vm_pgoff;
@@ -2822,7 +2820,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			merge = vma_merge(&vmi, mm, prev, vma->vm_start,
 				    vma->vm_end, vma->vm_flags, NULL,
 				    vma->vm_file, vma->vm_pgoff, NULL,
-				    NULL_VM_UFFD_CTX, NULL);
+				    NULL_VM_UFFD_CTX, NULL, vma_seals(vma));
 			if (merge) {
 				/*
 				 * ->mmap() can change vma->vm_file and fput
@@ -3130,14 +3128,14 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
 		return -ENOMEM;
-
 	/*
 	 * Expand the existing vma if possible; Note that singular lists do not
 	 * occur after forking, so the expand will only happen on new VMAs.
 	 */
 	if (vma && vma->vm_end == addr && !vma_policy(vma) &&
 	    can_vma_merge_after(vma, flags, NULL, NULL,
-				addr >> PAGE_SHIFT, NULL_VM_UFFD_CTX, NULL)) {
+			addr >> PAGE_SHIFT, NULL_VM_UFFD_CTX, NULL,
+			vma_seals(vma))) {
 		vma_iter_config(vmi, vma->vm_start, addr + len);
 		if (vma_iter_prealloc(vmi, vma))
 			goto unacct_fail;
@@ -3380,7 +3378,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 
 	new_vma = vma_merge(&vmi, mm, prev, addr, addr + len, vma->vm_flags,
 			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
-			    vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+			    vma->vm_userfaultfd_ctx, anon_vma_name(vma), vma_seals(vma));
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1527188b1e92..a4c90e71607b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -632,7 +632,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 	pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 	*pprev = vma_merge(vmi, mm, *pprev, start, end, newflags,
 			   vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
-			   vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+			   vma->vm_userfaultfd_ctx, anon_vma_name(vma), vma_seals(vma));
 	if (*pprev) {
 		vma = *pprev;
 		VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY);
diff --git a/mm/mremap.c b/mm/mremap.c
index ff7429bfbbe1..357efd6b48b9 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1098,7 +1098,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 			vma = vma_merge(&vmi, mm, vma, extension_start,
 				extension_end, vma->vm_flags, vma->anon_vma,
 				vma->vm_file, extension_pgoff, vma_policy(vma),
-				vma->vm_userfaultfd_ctx, anon_vma_name(vma));
+				vma->vm_userfaultfd_ctx, anon_vma_name(vma), vma_seals(vma));
 			if (!vma) {
 				vm_unacct_memory(pages);
 				ret = -ENOMEM;
diff --git a/mm/mseal.c b/mm/mseal.c
index d12aa628ebdc..3b90dce7d20e 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -7,8 +7,10 @@
  *  Author: Jeff Xu <jeffxu@chromium.org>
  */
 
+#include <linux/mempolicy.h>
 #include <linux/mman.h>
 #include <linux/mm.h>
+#include <linux/mm_inline.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 #include "internal.h"
@@ -81,14 +83,25 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		struct vm_area_struct **prev, unsigned long start,
 		unsigned long end, unsigned long addtypes)
 {
+	pgoff_t pgoff;
 	int ret = 0;
+	unsigned long newtypes =  vma_seals(vma) | addtypes;
+
+	if (newtypes != vma_seals(vma)) {
+		/*
+		 * Attempt to merge with prev and next vma.
+		 */
+		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+		*prev = vma_merge(vmi, vma->vm_mm, *prev, start, end, vma->vm_flags,
+				vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
+				vma->vm_userfaultfd_ctx, anon_vma_name(vma), newtypes);
+		if (*prev) {
+			vma = *prev;
+			goto out;
+		}
 
-	if (addtypes & ~(vma_seals(vma))) {
 		/*
 		 * Handle split at start and end.
-		 * For now sealed VMA doesn't merge with other VMAs.
-		 * This will be updated in later commit to make
-		 * sealed VMA also mergeable.
 		 */
 		if (start != vma->vm_start) {
 			ret = split_vma(vmi, vma, start, 1);
@@ -102,7 +115,7 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 				goto out;
 		}
 
-		vma->vm_seals |= addtypes;
+		vma->vm_seals = newtypes;
 	}
 
 out:
-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 08/11] mseal: add MM_SEAL_DISCARD_RO_ANON
  2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
                   ` (6 preceding siblings ...)
  2023-12-12 23:17 ` [RFC PATCH v3 07/11] mseal: make sealed VMA mergeable jeffxu
@ 2023-12-12 23:17 ` jeffxu
  2023-12-12 23:17 ` [RFC PATCH v3 09/11] mseal: add MAP_SEALABLE to mmap() jeffxu
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:17 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Certain types of madvise() operations are destructive, such as
MADV_DONTNEED, which can effectively alter region contents by
discarding pages, especially when memory is anonymous. This blocks
such operations for anonymous memory which is not writable to the
user.

The MM_SEAL_DISCARD_RO_ANON blocks such operations if users don't have
access to the memory, and the memory is anonymous memory.

We do not think such sealing is useful for file-backed mapping because
it should repopulate the memory contents from the underlying mapped
file. We also do not think it is useful if the user can write to the
memory because then the attacker can also write.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: Stephen Röttger <sroettger@google.com>
---
 include/linux/mm.h                     | 19 +++++--
 include/uapi/asm-generic/mman-common.h |  2 +
 include/uapi/linux/mman.h              |  1 +
 mm/madvise.c                           | 12 +++++
 mm/mseal.c                             | 73 ++++++++++++++++++++++++--
 5 files changed, 98 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f162bb5b38d..50dda474acc2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -264,7 +264,8 @@ extern unsigned int kobjsize(const void *objp);
 #define MM_SEAL_ALL ( \
 	MM_SEAL_SEAL | \
 	MM_SEAL_BASE | \
-	MM_SEAL_PROT_PKEY)
+	MM_SEAL_PROT_PKEY | \
+	MM_SEAL_DISCARD_RO_ANON)
 
 /*
  * PROT_SEAL_ALL is all supported flags in mmap().
@@ -273,7 +274,8 @@ extern unsigned int kobjsize(const void *objp);
 #define PROT_SEAL_ALL ( \
 	PROT_SEAL_SEAL | \
 	PROT_SEAL_BASE | \
-	PROT_SEAL_PROT_PKEY)
+	PROT_SEAL_PROT_PKEY | \
+	PROT_SEAL_DISCARD_RO_ANON)
 
 /*
  * vm_flags in vm_area_struct, see mm_types.h.
@@ -3354,6 +3356,9 @@ extern bool can_modify_mm(struct mm_struct *mm, unsigned long start,
 extern bool can_modify_vma(struct vm_area_struct *vma,
 		unsigned long checkSeals);
 
+extern bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start,
+		unsigned long end, int behavior);
+
 /*
  * Convert prot field of mmap to vm_seals type.
  */
@@ -3362,9 +3367,9 @@ static inline unsigned long convert_mmap_seals(unsigned long prot)
 	unsigned long seals = 0;
 
 	/*
-	 * set SEAL_PROT_PKEY implies SEAL_BASE.
+	 * set SEAL_PROT_PKEY or SEAL_DISCARD_RO_ANON implies SEAL_BASE.
 	 */
-	if (prot & PROT_SEAL_PROT_PKEY)
+	if (prot & (PROT_SEAL_PROT_PKEY | PROT_SEAL_DISCARD_RO_ANON))
 		prot |= PROT_SEAL_BASE;
 
 	/*
@@ -3407,6 +3412,12 @@ static inline bool can_modify_vma(struct vm_area_struct *vma,
 	return true;
 }
 
+static inline bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start,
+		unsigned long end, int behavior)
+{
+	return true;
+}
+
 static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm_seals)
 {
 }
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f07ad9e70b3a..bf503962409a 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -29,6 +29,8 @@
 #define PROT_SEAL_SEAL	_BITUL(PROT_SEAL_BIT_BEGIN)	/* 0x04000000 seal seal */
 #define PROT_SEAL_BASE	_BITUL(PROT_SEAL_BIT_BEGIN + 1)	/* 0x08000000 base for all sealing types */
 #define PROT_SEAL_PROT_PKEY	_BITUL(PROT_SEAL_BIT_BEGIN + 2)	/* 0x10000000 seal prot and pkey */
+/* seal destructive madvise for non-writeable anonymous memory. */
+#define PROT_SEAL_DISCARD_RO_ANON	_BITUL(PROT_SEAL_BIT_BEGIN + 3)	/* 0x20000000 */
 
 /* 0x01 - 0x03 are defined in linux/mman.h */
 #define MAP_TYPE	0x0f		/* Mask for type of mapping */
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index f561652886c4..3872cc118c8a 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -58,5 +58,6 @@ struct cachestat {
 #define MM_SEAL_SEAL		_BITUL(0)
 #define MM_SEAL_BASE		_BITUL(1)
 #define MM_SEAL_PROT_PKEY	_BITUL(2)
+#define MM_SEAL_DISCARD_RO_ANON	_BITUL(3)
 
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index e2d219a4b6ef..ff038e323779 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1403,6 +1403,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  -EIO    - an I/O error occurred while paging in data.
  *  -EBADF  - map exists, but area maps something that isn't a file.
  *  -EAGAIN - a kernel resource was temporarily unavailable.
+ *  -EACCES - memory is sealed.
  */
 int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
 {
@@ -1446,10 +1447,21 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	start = untagged_addr_remote(mm, start);
 	end = start + len;
 
+	/*
+	 * Check if the address range is sealed for do_madvise().
+	 * can_modify_mm_madv assumes we have acquired the lock on MM.
+	 */
+	if (!can_modify_mm_madv(mm, start, end, behavior)) {
+		error = -EACCES;
+		goto out;
+	}
+
 	blk_start_plug(&plug);
 	error = madvise_walk_vmas(mm, start, end, behavior,
 			madvise_vma_behavior);
 	blk_finish_plug(&plug);
+
+out:
 	if (write)
 		mmap_write_unlock(mm);
 	else
diff --git a/mm/mseal.c b/mm/mseal.c
index 3b90dce7d20e..294f48d33db6 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -11,6 +11,7 @@
 #include <linux/mman.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/mmu_context.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
 #include "internal.h"
@@ -66,6 +67,55 @@ bool can_modify_mm(struct mm_struct *mm, unsigned long start, unsigned long end,
 	return true;
 }
 
+static bool is_madv_discard(int behavior)
+{
+	return	behavior &
+		(MADV_FREE | MADV_DONTNEED | MADV_DONTNEED_LOCKED |
+		 MADV_REMOVE | MADV_DONTFORK | MADV_WIPEONFORK);
+}
+
+static bool is_ro_anon(struct vm_area_struct *vma)
+{
+	/* check anonymous mapping. */
+	if (vma->vm_file || vma->vm_flags & VM_SHARED)
+		return false;
+
+	/*
+	 * check for non-writable:
+	 * PROT=RO or PKRU is not writeable.
+	 */
+	if (!(vma->vm_flags & VM_WRITE) ||
+		!arch_vma_access_permitted(vma, true, false, false))
+		return true;
+
+	return false;
+}
+
+/*
+ * Check if the vmas of a memory range are allowed to be modified by madvise.
+ * the memory ranger can have a gap (unallocated memory).
+ * return true, if it is allowed.
+ */
+bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, unsigned long end,
+		int behavior)
+{
+	struct vm_area_struct *vma;
+
+	VMA_ITERATOR(vmi, mm, start);
+
+	if (!is_madv_discard(behavior))
+		return true;
+
+	/* going through each vma to check. */
+	for_each_vma_range(vmi, vma, end)
+		if (is_ro_anon(vma) && !can_modify_vma(
+			vma, MM_SEAL_DISCARD_RO_ANON))
+			return false;
+
+	/* Allow by default. */
+	return true;
+}
+
 /*
  * Check if a seal type can be added to VMA.
  */
@@ -76,6 +126,12 @@ static bool can_add_vma_seals(struct vm_area_struct *vma, unsigned long newSeals
 	    (newSeals & ~(vma_seals(vma))))
 		return false;
 
+	/*
+	 * For simplicity, we allow adding all sealing types during mmap or mseal.
+	 * The actual sealing check will happen later during particular action.
+	 * E.g. For MM_SEAL_DISCARD_RO_ANON, we always allow adding it, at the
+	 * time madvice() call, we will check if the sealing condition isn't met.
+	 */
 	return true;
 }
 
@@ -225,15 +281,22 @@ static int apply_mm_seal(unsigned long start, unsigned long end,
  *	mprotect() and pkey_mprotect() will be denied if the memory is
  *	sealed with MM_SEAL_PROT_PKEY.
  *
- *   The MM_SEAL_SEAL
- *	MM_SEAL_SEAL denies adding a new seal for an VMA.
- *
  *	The kernel will remember which seal types are applied, and the
  *	application doesn’t need to repeat all existing seal types in
  *	the next mseal(). Once a seal type is applied, it can’t be
  *	unsealed. Call mseal() on an existing seal type is a no-action,
  *	not a failure.
  *
+ *   MM_SEAL_DISCARD_RO_ANON: block some destructive madvice()
+ *	behavior, such as MADV_DONTNEED, which can effectively
+ *	alter gegion contents by discarding pages, block such
+ *	operation if users don't have write access to the memory, and
+ *	the memory is anonymous memory.
+ *	Setting this implies MM_SEAL_BASE is also set.
+ *
+ *   The MM_SEAL_SEAL
+ *	MM_SEAL_SEAL denies adding a new seal for an VMA.
+ *
  *  flags: reserved.
  *
  * return values:
@@ -264,8 +327,8 @@ static int do_mseal(unsigned long start, size_t len_in, unsigned long types,
 	struct mm_struct *mm = current->mm;
 	size_t len;
 
-	/* MM_SEAL_BASE is set when other seal types are set. */
-	if (types & MM_SEAL_PROT_PKEY)
+	/* MM_SEAL_BASE is set when other seal types are set */
+	if (types & (MM_SEAL_PROT_PKEY | MM_SEAL_DISCARD_RO_ANON))
 		types |= MM_SEAL_BASE;
 
 	if (!can_do_mseal(types, flags))
-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 09/11] mseal: add MAP_SEALABLE to mmap()
  2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
                   ` (7 preceding siblings ...)
  2023-12-12 23:17 ` [RFC PATCH v3 08/11] mseal: add MM_SEAL_DISCARD_RO_ANON jeffxu
@ 2023-12-12 23:17 ` jeffxu
  2023-12-12 23:17 ` [RFC PATCH v3 10/11] selftest mm/mseal memory sealing jeffxu
  2023-12-12 23:17 ` [RFC PATCH v3 11/11] mseal:add documentation jeffxu
  10 siblings, 0 replies; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:17 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

The MAP_SEALABLE flag is added to the flags field of mmap().
When present, it marks the map as sealable. A map created
without MAP_SEALABLE will not support sealing; In other words,
mseal() will fail for such a map.

Applications that don't care about sealing will expect their
behavior unchanged. For those that need sealing support, opt-in
by adding MAP_SEALABLE when creating the map.

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 include/linux/mm.h                     | 52 ++++++++++++++++++++++++--
 include/linux/mm_types.h               |  1 +
 include/uapi/asm-generic/mman-common.h |  1 +
 mm/mmap.c                              |  2 +-
 mm/mseal.c                             |  7 +++-
 5 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50dda474acc2..6f5dba9fbe21 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -267,6 +267,17 @@ extern unsigned int kobjsize(const void *objp);
 	MM_SEAL_PROT_PKEY | \
 	MM_SEAL_DISCARD_RO_ANON)
 
+/* define VM_SEALABLE in vm_seals of vm_area_struct. */
+#define VM_SEALABLE	_BITUL(31)
+
+/*
+ * VM_SEALS_BITS_ALL marks the bits used for
+ * sealing in vm_seals of vm_area_structure.
+ */
+#define VM_SEALS_BITS_ALL ( \
+	MM_SEAL_ALL | \
+	VM_SEALABLE)
+
 /*
  * PROT_SEAL_ALL is all supported flags in mmap().
  * See include/uapi/asm-generic/mman-common.h.
@@ -3330,9 +3341,17 @@ static inline void mm_populate(unsigned long addr, unsigned long len) {}
 
 #ifdef CONFIG_MSEAL
 /*
- * return the valid sealing (after mask).
+ * return the valid sealing (after mask), this includes sealable bit.
  */
 static inline unsigned long vma_seals(struct vm_area_struct *vma)
+{
+	return (vma->vm_seals & VM_SEALS_BITS_ALL);
+}
+
+/*
+ * return the enabled sealing type (after mask), without sealable bit.
+ */
+static inline unsigned long vma_enabled_seals(struct vm_area_struct *vma)
 {
 	return (vma->vm_seals & MM_SEAL_ALL);
 }
@@ -3342,9 +3361,14 @@ static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm
 	vma->vm_seals |= vm_seals;
 }
 
+static inline bool is_vma_sealable(struct vm_area_struct *vma)
+{
+	return vma->vm_seals & VM_SEALABLE;
+}
+
 static inline bool check_vma_seals_mergeable(unsigned long vm_seals1, unsigned long vm_seals2)
 {
-	if ((vm_seals1 & MM_SEAL_ALL) != (vm_seals2 & MM_SEAL_ALL))
+	if ((vm_seals1 & VM_SEALS_BITS_ALL) != (vm_seals2 & VM_SEALS_BITS_ALL))
 		return false;
 
 	return true;
@@ -3384,9 +3408,15 @@ static inline unsigned long convert_mmap_seals(unsigned long prot)
  * check input sealing type from the "prot" field of mmap().
  * for CONFIG_MSEAL case, this always return 0 (successful).
  */
-static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals)
+static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals,
+	unsigned long flags)
 {
 	*vm_seals = convert_mmap_seals(prot);
+	if (*vm_seals)
+	/* setting one of MM_SEAL_XX means the map is sealable. */
+		*vm_seals |= VM_SEALABLE;
+	else
+		*vm_seals |= (flags & MAP_SEALABLE) ? VM_SEALABLE:0;
 	return 0;
 }
 #else
@@ -3395,6 +3425,16 @@ static inline unsigned long vma_seals(struct vm_area_struct *vma)
 	return 0;
 }
 
+static inline unsigned long vma_enabled_seals(struct vm_area_struct *vma)
+{
+	return 0;
+}
+
+static inline bool is_vma_sealable(struct vm_area_struct *vma)
+{
+	return false;
+}
+
 static inline bool check_vma_seals_mergeable(unsigned long vm_seals1, unsigned long vm_seals2)
 {
 	return true;
@@ -3426,11 +3466,15 @@ static inline void update_vma_seals(struct vm_area_struct *vma, unsigned long vm
  * check input sealing type from the "prot" field of mmap().
  * For not CONFIG_MSEAL, if SEAL flag is set, it will return failure.
  */
-static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals)
+static inline int check_mmap_seals(unsigned long prot, unsigned long *vm_seals,
+	unsigned long flags)
 {
 	if (prot & PROT_SEAL_ALL)
 		return -EINVAL;
 
+	if (flags & MAP_SEALABLE)
+		return -EINVAL;
+
 	return 0;
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 052799173c86..c9b04c545f39 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -691,6 +691,7 @@ struct vm_area_struct {
 	/*
 	 * bit masks for seal.
 	 * need this since vm_flags is full.
+	 * We could merge this into vm_flags if vm_flags ever get expanded.
 	 */
 	unsigned long vm_seals;		/* seal flags, see mm.h. */
 #endif
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index bf503962409a..57ef4507c00b 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -47,6 +47,7 @@
 
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
+#define MAP_SEALABLE	0x8000000	/* map is sealable. */
 
 /*
  * Flags for mlock
diff --git a/mm/mmap.c b/mm/mmap.c
index 6da8d83f2e66..6e35e2070060 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1235,7 +1235,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	if (flags & MAP_FIXED_NOREPLACE)
 		flags |= MAP_FIXED;
 
-	if (check_mmap_seals(prot, &vm_seals) < 0)
+	if (check_mmap_seals(prot, &vm_seals, flags) < 0)
 		return -EINVAL;
 
 	if (!(flags & MAP_FIXED))
diff --git a/mm/mseal.c b/mm/mseal.c
index 294f48d33db6..5d4cf71b497e 100644
--- a/mm/mseal.c
+++ b/mm/mseal.c
@@ -121,9 +121,13 @@ bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start, unsigned long
  */
 static bool can_add_vma_seals(struct vm_area_struct *vma, unsigned long newSeals)
 {
+	/* if map is not sealable, reject. */
+	if (!is_vma_sealable(vma))
+		return false;
+
 	/* When SEAL_MSEAL is set, reject if a new type of seal is added. */
 	if ((vma->vm_seals & MM_SEAL_SEAL) &&
-	    (newSeals & ~(vma_seals(vma))))
+	    (newSeals & ~(vma_enabled_seals(vma))))
 		return false;
 
 	/*
@@ -185,6 +189,7 @@ static int mseal_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
  * 2> end is part of a valid vma.
  * 3> No gap (unallocated address) between start and end.
  * 4> requested seal type can be added in given address range.
+ * 5> map is sealable.
  */
 static int check_mm_seal(unsigned long start, unsigned long end,
 			 unsigned long newtypes)
-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 10/11] selftest mm/mseal memory sealing
  2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
                   ` (8 preceding siblings ...)
  2023-12-12 23:17 ` [RFC PATCH v3 09/11] mseal: add MAP_SEALABLE to mmap() jeffxu
@ 2023-12-12 23:17 ` jeffxu
  2023-12-31  6:39   ` Muhammad Usama Anjum
  2023-12-12 23:17 ` [RFC PATCH v3 11/11] mseal:add documentation jeffxu
  10 siblings, 1 reply; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:17 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

selftest for memory sealing change in mmap() and mseal().

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 tools/testing/selftests/mm/.gitignore   |    1 +
 tools/testing/selftests/mm/Makefile     |    1 +
 tools/testing/selftests/mm/config       |    1 +
 tools/testing/selftests/mm/mseal_test.c | 2141 +++++++++++++++++++++++
 4 files changed, 2144 insertions(+)
 create mode 100644 tools/testing/selftests/mm/mseal_test.c

diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index cdc9ce4426b9..f0f22a649985 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -43,3 +43,4 @@ mdwe_test
 gup_longterm
 mkdirty
 va_high_addr_switch
+mseal_test
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 6a9fc5693145..0c086cecc093 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -59,6 +59,7 @@ TEST_GEN_FILES += mlock2-tests
 TEST_GEN_FILES += mrelease_test
 TEST_GEN_FILES += mremap_dontunmap
 TEST_GEN_FILES += mremap_test
+TEST_GEN_FILES += mseal_test
 TEST_GEN_FILES += on-fault-limit
 TEST_GEN_FILES += thuge-gen
 TEST_GEN_FILES += transhuge-stress
diff --git a/tools/testing/selftests/mm/config b/tools/testing/selftests/mm/config
index be087c4bc396..cf2b8780e9b1 100644
--- a/tools/testing/selftests/mm/config
+++ b/tools/testing/selftests/mm/config
@@ -6,3 +6,4 @@ CONFIG_TEST_HMM=m
 CONFIG_GUP_TEST=y
 CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_MEM_SOFT_DIRTY=y
+CONFIG_MSEAL=y
diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
new file mode 100644
index 000000000000..0692485d8b3c
--- /dev/null
+++ b/tools/testing/selftests/mm/mseal_test.c
@@ -0,0 +1,2141 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/mman.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <stdbool.h>
+#include "../kselftest.h"
+#include <syscall.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <fcntl.h>
+#include <assert.h>
+#include <sys/ioctl.h>
+#include <sys/vfs.h>
+#include <sys/stat.h>
+
+/*
+ * need those definition for manually build using gcc.
+ * gcc -I ../../../../usr/include   -DDEBUG -O3  -DDEBUG -O3 mseal_test.c -o mseal_test
+ */
+#ifndef MM_SEAL_SEAL
+#define MM_SEAL_SEAL 0x1
+#endif
+
+#ifndef MM_SEAL_BASE
+#define MM_SEAL_BASE 0x2
+#endif
+
+#ifndef MM_SEAL_PROT_PKEY
+#define MM_SEAL_PROT_PKEY 0x4
+#endif
+
+#ifndef MM_SEAL_DISCARD_RO_ANON
+#define MM_SEAL_DISCARD_RO_ANON 0x8
+#endif
+
+#ifndef MAP_SEALABLE
+#define MAP_SEALABLE 0x8000000
+#endif
+
+#ifndef PROT_SEAL_SEAL
+#define PROT_SEAL_SEAL 0x04000000
+#endif
+
+#ifndef PROT_SEAL_BASE
+#define PROT_SEAL_BASE 0x08000000
+#endif
+
+#ifndef PROT_SEAL_PROT_PKEY
+#define PROT_SEAL_PROT_PKEY 0x10000000
+#endif
+
+#ifndef PROT_SEAL_DISCARD_RO_ANON
+#define PROT_SEAL_DISCARD_RO_ANON 0x20000000
+#endif
+
+#ifndef PKEY_DISABLE_ACCESS
+# define PKEY_DISABLE_ACCESS    0x1
+#endif
+
+#ifndef PKEY_DISABLE_WRITE
+# define PKEY_DISABLE_WRITE     0x2
+#endif
+
+#ifndef PKEY_BITS_PER_KEY
+#define PKEY_BITS_PER_PKEY      2
+#endif
+
+#ifndef PKEY_MASK
+#define PKEY_MASK       (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)
+#endif
+
+#ifndef DEBUG
+#define LOG_TEST_ENTER()	{}
+#else
+#define LOG_TEST_ENTER()	{ksft_print_msg("%s\n", __func__); }
+#endif
+
+#ifndef u64
+#define u64 unsigned long long
+#endif
+
+/*
+ * define sys_xyx to call syscall directly.
+ */
+static int sys_mseal(void *start, size_t len, int types)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(__NR_mseal, start, len, types, 0);
+	return sret;
+}
+
+int sys_mprotect(void *ptr, size_t size, unsigned long prot)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(SYS_mprotect, ptr, size, prot);
+	return sret;
+}
+
+int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
+		unsigned long pkey)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(__NR_pkey_mprotect, ptr, size, orig_prot, pkey);
+	return sret;
+}
+
+int sys_munmap(void *ptr, size_t size)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(SYS_munmap, ptr, size);
+	return sret;
+}
+
+static int sys_madvise(void *start, size_t len, int types)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(__NR_madvise, start, len, types);
+	return sret;
+}
+
+int sys_pkey_alloc(unsigned long flags, unsigned long init_val)
+{
+	int ret = syscall(SYS_pkey_alloc, flags, init_val);
+	return ret;
+}
+
+static inline unsigned int __read_pkey_reg(void)
+{
+	unsigned int eax, edx;
+	unsigned int ecx = 0;
+	unsigned int pkey_reg;
+
+	asm volatile(".byte 0x0f,0x01,0xee\n\t"
+			: "=a" (eax), "=d" (edx)
+			: "c" (ecx));
+	pkey_reg = eax;
+	return pkey_reg;
+}
+
+static inline void __write_pkey_reg(u64 pkey_reg)
+{
+	unsigned int eax = pkey_reg;
+	unsigned int ecx = 0;
+	unsigned int edx = 0;
+
+	asm volatile(".byte 0x0f,0x01,0xef\n\t"
+			: : "a" (eax), "c" (ecx), "d" (edx));
+	assert(pkey_reg == __read_pkey_reg());
+}
+
+static inline unsigned long pkey_bit_position(int pkey)
+{
+	return pkey * PKEY_BITS_PER_PKEY;
+}
+
+static inline u64 set_pkey_bits(u64 reg, int pkey, u64 flags)
+{
+	unsigned long shift = pkey_bit_position(pkey);
+	/* mask out bits from pkey in old value */
+	reg &= ~((u64)PKEY_MASK << shift);
+	/* OR in new bits for pkey */
+	reg |= (flags & PKEY_MASK) << shift;
+	return reg;
+}
+
+static inline void set_pkey(int pkey, unsigned long pkey_value)
+{
+	unsigned long mask = (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE);
+	u64 new_pkey_reg;
+
+	assert(!(pkey_value & ~mask));
+	new_pkey_reg = set_pkey_bits(__read_pkey_reg(), pkey, pkey_value);
+	__write_pkey_reg(new_pkey_reg);
+}
+
+void setup_single_address(int size, void **ptrOut)
+{
+	void *ptr;
+
+	ptr = mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0);
+	assert(ptr != (void *)-1);
+	*ptrOut = ptr;
+}
+
+void setup_single_address_sealable(int size, void **ptrOut, bool sealable)
+{
+	void *ptr;
+	unsigned long mapflags = MAP_ANONYMOUS | MAP_PRIVATE;
+
+	if (sealable)
+		mapflags |= MAP_SEALABLE;
+
+	ptr = mmap(NULL, size, PROT_READ, mapflags, -1, 0);
+	assert(ptr != (void *)-1);
+	*ptrOut = ptr;
+}
+
+void setup_single_address_rw_sealable(int size, void **ptrOut, bool sealable)
+{
+	void *ptr;
+	unsigned long mapflags = MAP_ANONYMOUS | MAP_PRIVATE;
+
+	if (sealable)
+		mapflags |= MAP_SEALABLE;
+
+	ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, mapflags, -1, 0);
+	assert(ptr != (void *)-1);
+	*ptrOut = ptr;
+}
+
+void clean_single_address(void *ptr, int size)
+{
+	int ret;
+
+	ret = munmap(ptr, size);
+	assert(!ret);
+}
+
+void seal_mprotect_single_address(void *ptr, int size)
+{
+	int ret;
+
+	ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+}
+
+void seal_discard_ro_anon_single_address(void *ptr, int size)
+{
+	int ret;
+
+	ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON);
+	assert(!ret);
+}
+
+static void test_seal_addseals(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	/* adding seal one by one */
+
+	ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+	assert(!ret);
+	ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+	ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
+	assert(!ret);
+}
+
+static void test_seal_addseals_combined(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+
+	/* adding multiple seals */
+	ret = sys_mseal(ptr, size,
+			MM_SEAL_PROT_PKEY | MM_SEAL_BASE|
+			MM_SEAL_SEAL);
+	assert(!ret);
+
+	/* not adding more seal type, so ok. */
+	ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+	assert(!ret);
+
+	/* not adding more seal type, so ok. */
+	ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
+	assert(!ret);
+}
+
+static void test_seal_addseals_reject(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_BASE | MM_SEAL_SEAL);
+	assert(!ret);
+
+	/* MM_SEAL_SEAL is set, so not allow new seal type . */
+	ret = sys_mseal(ptr, size,
+			MM_SEAL_PROT_PKEY | MM_SEAL_BASE | MM_SEAL_SEAL);
+	assert(ret < 0);
+}
+
+static void test_seal_unmapped_start(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	// munmap 2 pages from ptr.
+	ret = sys_munmap(ptr, 2 * page_size);
+	assert(!ret);
+
+	// mprotect will fail because 2 pages from ptr are unmapped.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	assert(ret < 0);
+
+	// mseal will fail because 2 pages from ptr are unmapped.
+	ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr + 2 * page_size, 2 * page_size, MM_SEAL_SEAL);
+	assert(!ret);
+}
+
+static void test_seal_unmapped_middle(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	// munmap 2 pages from ptr + page.
+	ret = sys_munmap(ptr + page_size, 2 * page_size);
+	assert(!ret);
+
+	// mprotect will fail, since size is 4 pages.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	assert(ret < 0);
+
+	// mseal will fail as well.
+	ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
+	assert(ret < 0);
+
+	/* we still can add seal to the first page and last page*/
+	ret = sys_mseal(ptr, page_size, MM_SEAL_SEAL | MM_SEAL_PROT_PKEY);
+	assert(!ret);
+
+	ret = sys_mseal(ptr + 3 * page_size, page_size,
+			MM_SEAL_SEAL | MM_SEAL_PROT_PKEY);
+	assert(!ret);
+}
+
+static void test_seal_unmapped_end(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	// unmap last 2 pages.
+	ret = sys_munmap(ptr + 2 * page_size, 2 * page_size);
+	assert(!ret);
+
+	//mprotect will fail since last 2 pages are unmapped.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	assert(ret < 0);
+
+	//mseal will fail as well.
+	ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
+	assert(ret < 0);
+
+	/* The first 2 pages is not sealed, and can add seals */
+	ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_SEAL | MM_SEAL_PROT_PKEY);
+	assert(!ret);
+}
+
+static void test_seal_multiple_vmas(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	// use mprotect to split the vma into 3.
+	ret = sys_mprotect(ptr + page_size, 2 * page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// mprotect will get applied to all 4 pages - 3 VMAs.
+	ret = sys_mprotect(ptr, size, PROT_READ);
+	assert(!ret);
+
+	// use mprotect to split the vma into 3.
+	ret = sys_mprotect(ptr + page_size, 2 * page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// mseal get applied to all 4 pages - 3 VMAs.
+	ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
+	assert(!ret);
+
+	// verify additional seal type will fail after MM_SEAL_SEAL set.
+	ret = sys_mseal(ptr, page_size, MM_SEAL_SEAL | MM_SEAL_PROT_PKEY);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr + page_size, 2 * page_size,
+			MM_SEAL_SEAL | MM_SEAL_PROT_PKEY);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr + 3 * page_size, page_size,
+			MM_SEAL_SEAL | MM_SEAL_PROT_PKEY);
+	assert(ret < 0);
+}
+
+static void test_seal_split_start(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split at middle */
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	/* seal the first page, this will split the VMA */
+	ret = sys_mseal(ptr, page_size, MM_SEAL_SEAL);
+	assert(!ret);
+
+	/* can't add seal to the first page */
+	ret = sys_mseal(ptr, page_size, MM_SEAL_SEAL | MM_SEAL_PROT_PKEY);
+	assert(ret < 0);
+
+	/* add seal to the remain 3 pages */
+	ret = sys_mseal(ptr + page_size, 3 * page_size,
+			MM_SEAL_SEAL | MM_SEAL_PROT_PKEY);
+	assert(!ret);
+}
+
+static void test_seal_split_end(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split at middle */
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	/* seal the last page */
+	ret = sys_mseal(ptr + 3 * page_size, page_size, MM_SEAL_SEAL);
+	assert(!ret);
+
+	/* adding seal to the last page is rejected. */
+	ret = sys_mseal(ptr + 3 * page_size, page_size,
+			MM_SEAL_SEAL | MM_SEAL_PROT_PKEY);
+	assert(ret < 0);
+
+	/* Adding seals to the first 3 pages */
+	ret = sys_mseal(ptr, 3 * page_size, MM_SEAL_SEAL | MM_SEAL_PROT_PKEY);
+	assert(!ret);
+}
+
+static void test_seal_invalid_input(void)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(8 * page_size, &ptr);
+	clean_single_address(ptr + 4 * page_size, 4 * page_size);
+
+	/* invalid flag */
+	ret = sys_mseal(ptr, size, 0x20);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr, size, 0x31);
+	assert(ret < 0);
+
+	ret = sys_mseal(ptr, size, 0x3F);
+	assert(ret < 0);
+
+	/* unaligned address */
+	ret = sys_mseal(ptr + 1, 2 * page_size, MM_SEAL_SEAL);
+	assert(ret < 0);
+
+	/* length too big */
+	ret = sys_mseal(ptr, 5 * page_size, MM_SEAL_SEAL);
+	assert(ret < 0);
+
+	/* start is not in a valid VMA */
+	ret = sys_mseal(ptr - page_size, 5 * page_size, MM_SEAL_SEAL);
+	assert(ret < 0);
+}
+
+static void test_seal_zero_length(void)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_mprotect(ptr, 0, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	/* seal 0 length will be OK, same as mprotect */
+	ret = sys_mseal(ptr, 0, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+
+	// verify the 4 pages are not sealed by previous call.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+}
+
+static void test_seal_twice(void)
+{
+	LOG_TEST_ENTER();
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+
+	// apply the same seal will be OK. idempotent.
+	ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+
+	ret = sys_mseal(ptr, size,
+			MM_SEAL_PROT_PKEY | MM_SEAL_BASE |
+			MM_SEAL_SEAL);
+	assert(!ret);
+
+	ret = sys_mseal(ptr, size,
+			MM_SEAL_PROT_PKEY | MM_SEAL_BASE |
+			MM_SEAL_SEAL);
+	assert(!ret);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
+	assert(!ret);
+}
+
+static void test_seal_mprotect(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_mprotect_single_address(ptr, size);
+
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+static void test_seal_start_mprotect(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_mprotect_single_address(ptr, page_size);
+
+	// the first page is sealed.
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	// pages after the first page is not sealed.
+	ret = sys_mprotect(ptr + page_size, page_size * 3,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+}
+
+static void test_seal_end_mprotect(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_mprotect_single_address(ptr + page_size, 3 * page_size);
+
+	/* first page is not sealed */
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	/* last 3 page are sealed */
+	ret = sys_mprotect(ptr + page_size, page_size * 3,
+			PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+static void test_seal_mprotect_unalign_len(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_mprotect_single_address(ptr, page_size * 2 - 1);
+
+	// 2 pages are sealed.
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_mprotect(ptr + page_size * 2, page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+}
+
+static void test_seal_mprotect_unalign_len_variant_2(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+	if (seal)
+		seal_mprotect_single_address(ptr, page_size * 2 + 1);
+
+	// 3 pages are sealed.
+	ret = sys_mprotect(ptr, page_size * 3, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_mprotect(ptr + page_size * 3, page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+}
+
+static void test_seal_mprotect_two_vma(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split */
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	if (seal)
+		seal_mprotect_single_address(ptr, page_size * 4);
+
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_mprotect(ptr + page_size * 2, page_size * 2,
+			PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+static void test_seal_mprotect_two_vma_with_split(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	// use mprotect to split as two vma.
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// mseal can apply across 2 vma, also split them.
+	if (seal)
+		seal_mprotect_single_address(ptr + page_size, page_size * 2);
+
+	// the first page is not sealed.
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// the second page is sealed.
+	ret = sys_mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	// the third page is sealed.
+	ret = sys_mprotect(ptr + 2 * page_size, page_size,
+			PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	// the fouth page is not sealed.
+	ret = sys_mprotect(ptr + 3 * page_size, page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+}
+
+static void test_seal_mprotect_partial_mprotect(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	// seal one page.
+	if (seal)
+		seal_mprotect_single_address(ptr, page_size);
+
+	// mprotect first 2 page will fail, since the first page are sealed.
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ | PROT_WRITE);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+static void test_seal_mprotect_two_vma_with_gap(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	// use mprotect to split.
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// use mprotect to split.
+	ret = sys_mprotect(ptr + 3 * page_size, page_size,
+			PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// use munmap to free two pages in the middle
+	ret = sys_munmap(ptr + page_size, 2 * page_size);
+	assert(!ret);
+
+	// mprotect will fail, because there is a gap in the address.
+	// notes, internally mprotect still updated the first page.
+	ret = sys_mprotect(ptr, 4 * page_size, PROT_READ);
+	assert(ret < 0);
+
+	// mseal will fail as well.
+	ret = sys_mseal(ptr, 4 * page_size, MM_SEAL_PROT_PKEY);
+	assert(ret < 0);
+
+	// unlike mprotect, the first page is not sealed.
+	ret = sys_mprotect(ptr, page_size, PROT_READ);
+	assert(ret == 0);
+
+	// the last page is not sealed.
+	ret = sys_mprotect(ptr + 3 * page_size, page_size, PROT_READ);
+	assert(ret == 0);
+}
+
+static void test_seal_mprotect_split(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	//use mprotect to split.
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	//seal all 4 pages.
+	if (seal) {
+		ret = sys_mseal(ptr, 4 * page_size, MM_SEAL_PROT_PKEY);
+		assert(!ret);
+	}
+
+	//madvice is OK.
+	ret = sys_madvise(ptr, page_size * 2, MADV_WILLNEED);
+	assert(!ret);
+
+	//mprotect is sealed.
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+
+	ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+static void test_seal_mprotect_merge(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	// use mprotect to split one page.
+	ret = sys_mprotect(ptr, page_size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	// seal first two pages.
+	if (seal) {
+		ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_PROT_PKEY);
+		assert(!ret);
+	}
+
+	ret = sys_madvise(ptr, page_size, MADV_WILLNEED);
+	assert(!ret);
+
+	// 2 pages are sealed.
+	ret = sys_mprotect(ptr, 2 * page_size, PROT_READ);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	// last 2 pages are not sealed.
+	ret = sys_mprotect(ptr + 2 * page_size, 2 * page_size, PROT_READ);
+	assert(ret == 0);
+}
+
+static void test_seal_munmap(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// 4 pages are sealed.
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+/*
+ * allocate 4 pages,
+ * use mprotect to split it as two VMAs
+ * seal the whole range
+ * munmap will fail on both
+ */
+static void test_seal_munmap_two_vma(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	/* use mprotect to split */
+	ret = sys_mprotect(ptr, page_size * 2, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	ret = sys_munmap(ptr, page_size * 2);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_munmap(ptr + page_size, page_size * 2);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+/*
+ * allocate a VMA with 4 pages.
+ * munmap the middle 2 pages.
+ * seal the whole 4 pages, will fail.
+ * note: one of the pages are sealed
+ * munmap the first page will be OK.
+ * munmap the last page will be OK.
+ */
+static void test_seal_munmap_vma_with_gap(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	ret = sys_munmap(ptr + page_size, page_size * 2);
+	assert(!ret);
+
+	if (seal) {
+		// can't have gap in the middle.
+		ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+		assert(ret < 0);
+	}
+
+	ret = sys_munmap(ptr, page_size);
+	assert(!ret);
+
+	ret = sys_munmap(ptr + page_size * 2, page_size);
+	assert(!ret);
+
+	ret = sys_munmap(ptr, size);
+	assert(!ret);
+}
+
+static void test_munmap_start_freed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	// unmap the first page.
+	ret = sys_munmap(ptr, page_size);
+	assert(!ret);
+
+	// seal the last 3 pages.
+	if (seal) {
+		ret = sys_mseal(ptr + page_size, 3 * page_size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// unmap from the first page.
+	ret = sys_munmap(ptr, size);
+	if (seal) {
+		assert(ret < 0);
+
+		// use mprotect to verify page is not unmapped.
+		ret = sys_mprotect(ptr + page_size, 3 * page_size, PROT_READ);
+		assert(!ret);
+	} else
+		// note: this will be OK, even the first page is
+		// already unmapped.
+		assert(!ret);
+}
+
+static void test_munmap_end_freed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+	// unmap last page.
+	ret = sys_munmap(ptr + page_size * 3, page_size);
+	assert(!ret);
+
+	// seal the first 3 pages.
+	if (seal) {
+		ret = sys_mseal(ptr, 3 * page_size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// unmap all pages.
+	ret = sys_munmap(ptr, size);
+	if (seal) {
+		assert(ret < 0);
+
+		// use mprotect to verify page is not unmapped.
+		ret = sys_mprotect(ptr, 3 * page_size, PROT_READ);
+		assert(!ret);
+	} else
+		assert(!ret);
+}
+
+static void test_munmap_middle_freed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+	// unmap 2 pages in the middle.
+	ret = sys_munmap(ptr + page_size, page_size * 2);
+	assert(!ret);
+
+	// seal the first page.
+	if (seal) {
+		ret = sys_mseal(ptr, page_size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// munmap all 4 pages.
+	ret = sys_munmap(ptr, size);
+	if (seal) {
+		assert(ret < 0);
+
+		// use mprotect to verify page is not unmapped.
+		ret = sys_mprotect(ptr, page_size, PROT_READ);
+		assert(!ret);
+	} else
+		assert(!ret);
+}
+
+void test_seal_mremap_shrink(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// shrink from 4 pages to 2 pages.
+	ret2 = mremap(ptr, size, 2 * page_size, 0, 0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 != MAP_FAILED);
+
+	}
+}
+
+void test_seal_mremap_expand(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	// ummap last 2 pages.
+	ret = sys_munmap(ptr + 2 * page_size, 2 * page_size);
+	assert(!ret);
+
+	if (seal) {
+		ret = sys_mseal(ptr, 2 * page_size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// expand from 2 page to 4 pages.
+	ret2 = mremap(ptr, 2 * page_size, 4 * page_size, 0, 0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 == ptr);
+
+	}
+}
+
+void test_seal_mremap_move(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr, *newPtr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	setup_single_address(size, &newPtr);
+	clean_single_address(newPtr, size);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// move from ptr to fixed address.
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newPtr);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 != MAP_FAILED);
+
+	}
+}
+
+void test_seal_mmap_overwrite_prot(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// use mmap to change protection.
+	ret2 = mmap(ptr, size, PROT_NONE,
+			MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else
+		assert(ret2 == ptr);
+}
+
+void test_seal_mmap_expand(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 12 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	// ummap last 4 pages.
+	ret = sys_munmap(ptr + 8 * page_size, 4 * page_size);
+	assert(!ret);
+
+	if (seal) {
+		ret = sys_mseal(ptr, 8 * page_size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// use mmap to expand.
+	ret2 = mmap(ptr, size, PROT_READ,
+			MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else
+		assert(ret2 == ptr);
+}
+
+void test_seal_mmap_shrink(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 12 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// use mmap to shrink.
+	ret2 = mmap(ptr, 8 * page_size, PROT_READ,
+			MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else
+		assert(ret2 == ptr);
+}
+
+void test_seal_mremap_shrink_fixed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	setup_single_address(size, &newAddr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// mremap to move and shrink to fixed address
+	ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED,
+			newAddr);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else
+		assert(ret2 == newAddr);
+}
+
+void test_seal_mremap_expand_fixed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(page_size, &ptr);
+	setup_single_address(size, &newAddr);
+
+	if (seal) {
+		ret = sys_mseal(newAddr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// mremap to move and expand to fixed address
+	ret2 = mremap(ptr, page_size, size, MREMAP_MAYMOVE | MREMAP_FIXED,
+			newAddr);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else
+		assert(ret2 == newAddr);
+}
+
+void test_seal_mremap_move_fixed(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+	setup_single_address(size, &newAddr);
+
+	if (seal) {
+		ret = sys_mseal(newAddr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// mremap to move to fixed address
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_FIXED, newAddr);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else
+		assert(ret2 == newAddr);
+}
+
+void test_seal_mremap_move_fixed_zero(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	/*
+	 * MREMAP_FIXED can move the mapping to zero address
+	 */
+	ret2 = mremap(ptr, size, 2 * page_size, MREMAP_MAYMOVE | MREMAP_FIXED,
+			0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 == 0);
+
+	}
+}
+
+void test_seal_mremap_move_dontunmap(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	// mremap to move, and don't unmap src addr.
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, 0);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 != MAP_FAILED);
+
+	}
+}
+
+void test_seal_mremap_move_dontunmap_anyaddr(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	void *newAddr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	setup_single_address(size, &ptr);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+		assert(!ret);
+	}
+
+	/*
+	 * The 0xdeaddead should not have effect on dest addr
+	 * when MREMAP_DONTUNMAP is set.
+	 */
+	ret2 = mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP,
+			0xdeaddead);
+	if (seal) {
+		assert(ret2 == MAP_FAILED);
+		assert(errno == EACCES);
+	} else {
+		assert(ret2 != MAP_FAILED);
+		assert((long)ret2 != 0xdeaddead);
+
+	}
+}
+
+unsigned long get_vma_size(void *addr)
+{
+	FILE *maps;
+	char line[256];
+	int size = 0;
+	uintptr_t  addr_start, addr_end;
+
+	maps = fopen("/proc/self/maps", "r");
+	if (!maps)
+		return 0;
+
+	while (fgets(line, sizeof(line), maps)) {
+		if (sscanf(line, "%lx-%lx", &addr_start, &addr_end) == 2) {
+			if (addr_start == (uintptr_t) addr) {
+				size = addr_end - addr_start;
+				break;
+			}
+		}
+	}
+	fclose(maps);
+	return size;
+}
+
+void test_seal_mmap_seal_base(void)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	ptr = mmap(NULL, size, PROT_READ | PROT_SEAL_BASE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	assert(ptr != (void *)-1);
+
+	ret = sys_munmap(ptr, size);
+	assert(ret < 0);
+
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+
+	ret = sys_mprotect(ptr, size, PROT_READ);
+	assert(ret < 0);
+}
+
+void test_seal_mmap_seal_mprotect(void)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	ptr = mmap(NULL, size, PROT_READ | PROT_SEAL_PROT_PKEY, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	assert(ptr != (void *)-1);
+
+	ret = sys_munmap(ptr, size);
+	assert(ret < 0);
+
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	assert(ret < 0);
+}
+
+void test_seal_mmap_seal_mseal(void)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	void *ret2;
+
+	ptr = mmap(NULL, size, PROT_READ | PROT_SEAL_SEAL, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	assert(ptr != (void *)-1);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_BASE);
+	assert(ret < 0);
+
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	assert(!ret);
+
+	ret = sys_munmap(ptr, size);
+	assert(!ret);
+}
+
+void test_seal_merge_and_split(void)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size;
+	int ret;
+	void *ret2;
+
+	// (24 RO)
+	setup_single_address(24 * page_size, &ptr);
+
+	// use mprotect(NONE) to set out boundary
+	// (1 NONE) (22 RO) (1 NONE)
+	ret = sys_mprotect(ptr, page_size, PROT_NONE);
+	assert(!ret);
+	ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE);
+	assert(!ret);
+	size = get_vma_size(ptr + page_size);
+	assert(size == 22 * page_size);
+
+	// use mseal to split from beginning
+	// (1 NONE) (1 RO_SBASE) (21 RO) (1 NONE)
+	ret = sys_mseal(ptr + page_size, page_size, MM_SEAL_BASE);
+	assert(!ret);
+	size = get_vma_size(ptr + page_size);
+	assert(size == page_size);
+	size = get_vma_size(ptr + 2 * page_size);
+	assert(size == 21 * page_size);
+
+	// use mseal to split from the end.
+	// (1 NONE) (1 RO_SBASE) (20 RO) (1 RO_SBASE) (1 NONE)
+	ret = sys_mseal(ptr + 22 * page_size, page_size, MM_SEAL_BASE);
+	assert(!ret);
+	size = get_vma_size(ptr + 22 * page_size);
+	assert(size == page_size);
+	size = get_vma_size(ptr + 2 * page_size);
+	assert(size == 20 * page_size);
+
+	// merge with prev.
+	// (1 NONE) (2 RO_SBASE) (19 RO) (1 RO_SBASE) (1 NONE)
+	ret = sys_mseal(ptr + 2 * page_size, page_size, MM_SEAL_BASE);
+	assert(!ret);
+	size = get_vma_size(ptr +  page_size);
+	assert(size ==  2 * page_size);
+
+	// merge with after.
+	// (1 NONE) (2 RO_SBASE) (18 RO) (2 RO_SBASES) (1 NONE)
+	ret = sys_mseal(ptr + 21 * page_size, page_size, MM_SEAL_BASE);
+	assert(!ret);
+	size = get_vma_size(ptr +  21 * page_size);
+	assert(size ==  2 * page_size);
+
+	// split from prev
+	// (1 NONE) (1 RO_SBASE) (2RO_SPROT) (17 RO) (2 RO_SBASES) (1 NONE)
+	ret = sys_mseal(ptr + 2 * page_size, 2 * page_size, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+	size = get_vma_size(ptr +  2 * page_size);
+	assert(size ==  2 * page_size);
+	ret = sys_munmap(ptr + page_size,  page_size);
+	assert(ret < 0);
+	ret = sys_mprotect(ptr + 2 * page_size, page_size,  PROT_NONE);
+	assert(ret < 0);
+
+	// split from next
+	// (1 NONE) (1 RO_SBASE) (2 RO_SPROT) (16 RO) (2 RO_SPROT) (1 RO_SBASES) (1 NONE)
+	ret = sys_mseal(ptr + 20 * page_size, 2 * page_size, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+	size = get_vma_size(ptr +  20 * page_size);
+	assert(size ==  2 * page_size);
+
+	// merge from middle of prev and middle of next.
+	// (1 NONE) (1 RW_SBASE) (20 RO_SPROT) (1 RW_SBASES) (1 NONE)
+	ret = sys_mseal(ptr + 3 * page_size, 18 * page_size, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+	size = get_vma_size(ptr +  2 * page_size);
+	assert(size ==  20 * page_size);
+
+	size = get_vma_size(ptr +  22 * page_size);
+	assert(size == page_size);
+
+	size = get_vma_size(ptr +  23 * page_size);
+	assert(size == page_size);
+
+	// Add split using SEAL_ALL
+	// (1 NONE) (1 RW_SBASE) (1 RO_SALL) (18 RO_SPROT) (1 RO_SALL) (1 RW_SBASES) (1 NONE)
+	ret = sys_mseal(ptr + 2 * page_size, page_size,
+				MM_SEAL_PROT_PKEY | MM_SEAL_DISCARD_RO_ANON);
+	assert(!ret);
+	size = get_vma_size(ptr +  2 * page_size);
+	assert(size ==  1 * page_size);
+
+	ret = sys_mseal(ptr + 21 * page_size, page_size,
+				MM_SEAL_PROT_PKEY | MM_SEAL_DISCARD_RO_ANON);
+	assert(!ret);
+	size = get_vma_size(ptr +  21 * page_size);
+	assert(size ==  1 * page_size);
+
+	// add a new seal type, and merge with next
+	// (1 NONE) (2 RO_SALL) (18 RO_SPROT) (2 RO_SALL) (1 NONE)
+	ret = sys_mprotect(ptr + page_size,  page_size, PROT_READ);
+	assert(!ret);
+	ret = sys_mseal(ptr + page_size, page_size, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+	ret = sys_mseal(ptr + page_size, page_size, MM_SEAL_DISCARD_RO_ANON);
+	assert(!ret);
+	size = get_vma_size(ptr +  page_size);
+	assert(size ==  2 * page_size);
+
+	ret = sys_mprotect(ptr + 22 * page_size, page_size, PROT_READ);
+	assert(!ret);
+	ret = sys_mseal(ptr + 22 * page_size, page_size, MM_SEAL_PROT_PKEY);
+	assert(!ret);
+	ret = sys_mseal(ptr + 22 * page_size, page_size, MM_SEAL_DISCARD_RO_ANON);
+	assert(!ret);
+	size = get_vma_size(ptr +  page_size);
+	assert(size ==  2 * page_size);
+}
+
+void test_seal_mmap_merge(void)
+{
+	LOG_TEST_ENTER();
+
+	void *ptr, *ptr2;
+	unsigned long page_size = getpagesize();
+	unsigned long size;
+	int ret;
+	void *ret2;
+
+	// (24 RO)
+	setup_single_address(24 * page_size, &ptr);
+
+	// use mprotect(NONE) to set out boundary
+	// (1 NONE) (22 RO) (1 NONE)
+	ret = sys_mprotect(ptr, page_size, PROT_NONE);
+	assert(!ret);
+	ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE);
+	assert(!ret);
+	size = get_vma_size(ptr + page_size);
+	assert(size == 22 * page_size);
+
+	// use munmap to free 2 segment of memory.
+	// (1 NONE) (1 free) (20 RO) (1 free) (1 NONE)
+	ret = sys_munmap(ptr + page_size, page_size);
+	assert(!ret);
+
+	ret = sys_munmap(ptr + 22 * page_size, page_size);
+	assert(!ret);
+
+	// apply seal to the middle
+	// (1 NONE) (1 free) (20 RO_SBASE) (1 free) (1 NONE)
+	ret = sys_mseal(ptr + 2 * page_size, 20 * page_size, MM_SEAL_BASE);
+	assert(!ret);
+	size = get_vma_size(ptr + 2 * page_size);
+	assert(size == 20 * page_size);
+
+	// allocate a mapping at beginning, and make sure it merges.
+	// (1 NONE) (21 RO_SBASE) (1 free) (1 NONE)
+	ptr2 = mmap(ptr + page_size, page_size, PROT_READ | PROT_SEAL_BASE,
+		MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	assert(ptr != (void *)-1);
+	size = get_vma_size(ptr + page_size);
+	assert(size == 21 * page_size);
+
+	// allocate a mapping at end, and make sure it merges.
+	// (1 NONE) (22 RO_SBASE) (1 NONE)
+	ptr2 = mmap(ptr + 22 * page_size, page_size, PROT_READ | PROT_SEAL_BASE,
+		MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	assert(ptr != (void *)-1);
+	size = get_vma_size(ptr + page_size);
+	assert(size == 22 * page_size);
+}
+
+static void test_not_sealable(void)
+{
+	int ret;
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+
+	ptr = mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	assert(ptr != (void *)-1);
+
+	ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
+	assert(ret < 0);
+}
+
+static void test_merge_sealable(void)
+{
+	int ret;
+	void *ptr, *ptr2;
+	unsigned long page_size = getpagesize();
+	unsigned long size;
+
+	// (24 RO)
+	setup_single_address(24 * page_size, &ptr);
+
+	// use mprotect(NONE) to set out boundary
+	// (1 NONE) (22 RO) (1 NONE)
+	ret = sys_mprotect(ptr, page_size, PROT_NONE);
+	assert(!ret);
+	ret = sys_mprotect(ptr + 23 * page_size, page_size, PROT_NONE);
+	assert(!ret);
+	size = get_vma_size(ptr + page_size);
+	assert(size == 22 * page_size);
+
+	// (1 NONE) (RO) (4 free) (17 RO) (1 NONE)
+	ret = sys_munmap(ptr + 2 * page_size,  4 * page_size);
+	assert(!ret);
+	size = get_vma_size(ptr + page_size);
+	assert(size == 1 * page_size);
+	size = get_vma_size(ptr +  6 * page_size);
+	assert(size == 17 * page_size);
+
+	// (1 NONE) (RO) (1 free) (2 RO) (1 free) (17 RO) (1 NONE)
+	ptr2 = mmap(ptr + 3 * page_size, 2 * page_size, PROT_READ,
+		MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0);
+	size = get_vma_size(ptr + 3 * page_size);
+	assert(size == 2 * page_size);
+
+	// (1 NONE) (RO) (1 free) (20 RO) (1 NONE)
+	ptr2 = mmap(ptr + 5 * page_size, 1 * page_size, PROT_READ,
+		MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0);
+	assert(ptr2 != (void *)-1);
+	size = get_vma_size(ptr + 3 * page_size);
+	assert(size == 20 * page_size);
+
+	// (1 NONE) (RO) (1 free) (19 RO) (1 RO_SB) (1 NONE)
+	ret = sys_mseal(ptr + 22 * page_size, page_size, MM_SEAL_BASE);
+	assert(!ret);
+
+	// (1 NONE) (RO) (not sealable) (19 RO) (1 RO_SB) (1 NONE)
+	ptr2 = mmap(ptr + 2 * page_size, page_size, PROT_READ,
+		MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	assert(ptr2 != (void *)-1);
+	size = get_vma_size(ptr + page_size);
+	assert(size == page_size);
+	size = get_vma_size(ptr + 2 * page_size);
+	assert(size == page_size);
+}
+
+static void test_seal_discard_ro_anon_on_rw(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address_rw_sealable(size, &ptr, seal);
+	assert(ptr != (void *)-1);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON);
+		assert(!ret);
+	}
+
+	// sealing doesn't take effect on RW memory.
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	assert(!ret);
+
+	// base seal still apply.
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+static void test_seal_discard_ro_anon_on_pkey(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	int pkey;
+
+	setup_single_address_rw_sealable(size, &ptr, seal);
+	assert(ptr != (void *)-1);
+
+	pkey = sys_pkey_alloc(0, 0);
+	assert(pkey > 0);
+
+	ret = sys_mprotect_pkey((void *)ptr, size, PROT_READ | PROT_WRITE, pkey);
+	assert(!ret);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON);
+		assert(!ret);
+	}
+
+	// sealing doesn't take effect if PKRU allow write.
+	set_pkey(pkey, 0);
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	assert(!ret);
+
+	// sealing will take effect if PKRU deny write.
+	set_pkey(pkey, PKEY_DISABLE_WRITE);
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	// base seal still apply.
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+static void test_seal_discard_ro_anon_on_filebacked(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	int fd;
+	unsigned long mapflags = MAP_PRIVATE;
+
+	if (seal)
+		mapflags |= MAP_SEALABLE;
+
+	fd = memfd_create("test", 0);
+	assert(fd > 0);
+
+	ret = fallocate(fd, 0, 0, size);
+	assert(!ret);
+
+	ptr = mmap(NULL, size, PROT_READ, mapflags, fd, 0);
+	assert(ptr != MAP_FAILED);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON);
+		assert(!ret);
+	}
+
+	// sealing doesn't apply for file backed mapping.
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	assert(!ret);
+
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+	close(fd);
+}
+
+static void test_seal_discard_ro_anon_on_shared(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	unsigned long mapflags = MAP_ANONYMOUS | MAP_SHARED;
+
+	if (seal)
+		mapflags |= MAP_SEALABLE;
+
+	ptr = mmap(NULL, size, PROT_READ, mapflags, -1, 0);
+	assert(ptr != (void *)-1);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON);
+		assert(!ret);
+	}
+
+	// sealing doesn't apply for shared mapping.
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	assert(!ret);
+
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+static void test_seal_discard_ro_anon_invalid_shared(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+	int fd;
+
+	fd = open("/proc/self/maps", O_RDONLY);
+	ptr = mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, fd, 0);
+	assert(ptr != (void *)-1);
+
+	if (seal) {
+		ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON);
+		assert(!ret);
+	}
+
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	assert(!ret);
+
+	ret = sys_munmap(ptr, size);
+	assert(ret < 0);
+	close(fd);
+}
+
+static void test_seal_discard_ro_anon(bool seal)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	setup_single_address(size, &ptr);
+
+	if (seal)
+		seal_discard_ro_anon_single_address(ptr, size);
+
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_munmap(ptr, size);
+	if (seal)
+		assert(ret < 0);
+	else
+		assert(!ret);
+}
+
+static void test_mmap_seal_discard_ro_anon(void)
+{
+	LOG_TEST_ENTER();
+	void *ptr;
+	unsigned long page_size = getpagesize();
+	unsigned long size = 4 * page_size;
+	int ret;
+
+	ptr = mmap(NULL, size, PROT_READ | PROT_WRITE | PROT_SEAL_DISCARD_RO_ANON,
+			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	assert(ptr != (void *)-1);
+
+	ret = sys_mprotect(ptr, size, PROT_READ);
+	assert(!ret);
+
+	ret = sys_madvise(ptr, size, MADV_DONTNEED);
+	assert(ret < 0);
+
+	ret = sys_munmap(ptr, size);
+	assert(ret < 0);
+}
+
+bool seal_support(void)
+{
+	void *ptr;
+	unsigned long page_size = getpagesize();
+
+	ptr = mmap(NULL, page_size, PROT_READ | PROT_SEAL_BASE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (ptr == (void *) -1)
+		return false;
+	return true;
+}
+
+bool pkey_supported(void)
+{
+	int pkey = sys_pkey_alloc(0, 0);
+
+	if (pkey > 0)
+		return true;
+	return false;
+}
+
+int main(int argc, char **argv)
+{
+	bool test_seal = seal_support();
+
+	if (!test_seal) {
+		ksft_print_msg("%s CONFIG_MSEAL might be disabled, skip test\n", __func__);
+		return 0;
+	}
+
+	test_seal_invalid_input();
+	test_seal_addseals();
+	test_seal_addseals_combined();
+	test_seal_addseals_reject();
+	test_seal_unmapped_start();
+	test_seal_unmapped_middle();
+	test_seal_unmapped_end();
+	test_seal_multiple_vmas();
+	test_seal_split_start();
+	test_seal_split_end();
+
+	test_seal_zero_length();
+	test_seal_twice();
+
+	test_seal_mprotect(false);
+	test_seal_mprotect(true);
+
+	test_seal_start_mprotect(false);
+	test_seal_start_mprotect(true);
+
+	test_seal_end_mprotect(false);
+	test_seal_end_mprotect(true);
+
+	test_seal_mprotect_unalign_len(false);
+	test_seal_mprotect_unalign_len(true);
+
+	test_seal_mprotect_unalign_len_variant_2(false);
+	test_seal_mprotect_unalign_len_variant_2(true);
+
+	test_seal_mprotect_two_vma(false);
+	test_seal_mprotect_two_vma(true);
+
+	test_seal_mprotect_two_vma_with_split(false);
+	test_seal_mprotect_two_vma_with_split(true);
+
+	test_seal_mprotect_partial_mprotect(false);
+	test_seal_mprotect_partial_mprotect(true);
+
+	test_seal_mprotect_two_vma_with_gap(false);
+	test_seal_mprotect_two_vma_with_gap(true);
+
+	test_seal_mprotect_merge(false);
+	test_seal_mprotect_merge(true);
+
+	test_seal_mprotect_split(false);
+	test_seal_mprotect_split(true);
+
+	test_seal_munmap(false);
+	test_seal_munmap(true);
+	test_seal_munmap_two_vma(false);
+	test_seal_munmap_two_vma(true);
+	test_seal_munmap_vma_with_gap(false);
+	test_seal_munmap_vma_with_gap(true);
+
+	test_munmap_start_freed(false);
+	test_munmap_start_freed(true);
+	test_munmap_middle_freed(false);
+	test_munmap_middle_freed(true);
+	test_munmap_end_freed(false);
+	test_munmap_end_freed(true);
+
+	test_seal_mremap_shrink(false);
+	test_seal_mremap_shrink(true);
+	test_seal_mremap_expand(false);
+	test_seal_mremap_expand(true);
+	test_seal_mremap_move(false);
+	test_seal_mremap_move(true);
+
+	test_seal_mremap_shrink_fixed(false);
+	test_seal_mremap_shrink_fixed(true);
+	test_seal_mremap_expand_fixed(false);
+	test_seal_mremap_expand_fixed(true);
+	test_seal_mremap_move_fixed(false);
+	test_seal_mremap_move_fixed(true);
+	test_seal_mremap_move_dontunmap(false);
+	test_seal_mremap_move_dontunmap(true);
+	test_seal_mremap_move_fixed_zero(false);
+	test_seal_mremap_move_fixed_zero(true);
+	test_seal_mremap_move_dontunmap_anyaddr(false);
+	test_seal_mremap_move_dontunmap_anyaddr(true);
+	test_seal_discard_ro_anon(false);
+	test_seal_discard_ro_anon(true);
+	test_seal_discard_ro_anon_on_rw(false);
+	test_seal_discard_ro_anon_on_rw(true);
+	test_seal_discard_ro_anon_on_shared(false);
+	test_seal_discard_ro_anon_on_shared(true);
+	test_seal_discard_ro_anon_on_filebacked(false);
+	test_seal_discard_ro_anon_on_filebacked(true);
+	test_seal_mmap_overwrite_prot(false);
+	test_seal_mmap_overwrite_prot(true);
+	test_seal_mmap_expand(false);
+	test_seal_mmap_expand(true);
+	test_seal_mmap_shrink(false);
+	test_seal_mmap_shrink(true);
+
+	test_seal_mmap_seal_base();
+	test_seal_mmap_seal_mprotect();
+	test_seal_mmap_seal_mseal();
+	test_mmap_seal_discard_ro_anon();
+	test_seal_merge_and_split();
+	test_seal_mmap_merge();
+
+	test_not_sealable();
+	test_merge_sealable();
+
+	if (pkey_supported()) {
+		test_seal_discard_ro_anon_on_pkey(false);
+		test_seal_discard_ro_anon_on_pkey(true);
+	}
+
+	ksft_print_msg("Done\n");
+	return 0;
+}
-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH v3 11/11] mseal:add documentation
  2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
                   ` (9 preceding siblings ...)
  2023-12-12 23:17 ` [RFC PATCH v3 10/11] selftest mm/mseal memory sealing jeffxu
@ 2023-12-12 23:17 ` jeffxu
  2023-12-13  0:39   ` Linus Torvalds
  10 siblings, 1 reply; 28+ messages in thread
From: jeffxu @ 2023-12-12 23:17 UTC (permalink / raw)
  To: akpm, keescook, jannh, sroettger, willy, gregkh, torvalds
  Cc: jeffxu, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt, Jeff Xu

From: Jeff Xu <jeffxu@chromium.org>

Add documentation for mseal().

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 Documentation/userspace-api/mseal.rst | 189 ++++++++++++++++++++++++++
 1 file changed, 189 insertions(+)
 create mode 100644 Documentation/userspace-api/mseal.rst

diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst
new file mode 100644
index 000000000000..651c618d0664
--- /dev/null
+++ b/Documentation/userspace-api/mseal.rst
@@ -0,0 +1,189 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Introduction of mseal
+=====================
+
+:Author: Jeff Xu <jeffxu@chromium.org>
+
+Modern CPUs support memory permissions such as RW and NX bits. The memory
+permission feature improves security stance on memory corruption bugs, i.e.
+the attacker can’t just write to arbitrary memory and point the code to it,
+the memory has to be marked with X bit, or else an exception will happen.
+
+Memory sealing additionally protects the mapping itself against
+modifications. This is useful to mitigate memory corruption issues where a
+corrupted pointer is passed to a memory management system. For example,
+such an attacker primitive can break control-flow integrity guarantees
+since read-only memory that is supposed to be trusted can become writable
+or .text pages can get remapped. Memory sealing can automatically be
+applied by the runtime loader to seal .text and .rodata pages and
+applications can additionally seal security critical data at runtime.
+
+A similar feature already exists in the XNU kernel with the
+VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
+
+User API
+========
+Two system calls are involved in virtual memory sealing, ``mseal()`` and ``mmap()``.
+
+``mseal()``
+-----------
+
+The ``mseal()`` is an architecture independent syscall, and with following
+signature:
+
+``int mseal(void addr, size_t len, unsigned long types, unsigned long flags)``
+
+**addr/len**: virtual memory address range.
+
+The address range set by ``addr``/``len`` must meet:
+   - start (addr) must be in a valid VMA.
+   - end (addr + len) must be in a valid VMA.
+   - no gap (unallocated memory) between start and end.
+   - start (addr) must be page aligned.
+
+The ``len`` will be paged aligned implicitly by kernel.
+
+**types**: bit mask to specify the sealing types, they are:
+
+- The ``MM_SEAL_BASE``: Prevent VMA from:
+
+    Unmapping, moving to another location, and shrinking the size,
+    via munmap() and mremap(), can leave an empty space, therefore
+    can be replaced with a VMA with a new set of attributes.
+
+    Move or expand a different vma into the current location,
+    via mremap().
+
+    Modifying sealed VMA via mmap(MAP_FIXED).
+
+    Size expansion, via mremap(), does not appear to pose any
+    specific risks to sealed VMAs. It is included anyway because
+    the use case is unclear. In any case, users can rely on
+    merging to expand a sealed VMA.
+
+    We consider the MM_SEAL_BASE feature, on which other sealing
+    features will depend. For instance, it probably does not make sense
+    to seal PROT_PKEY without sealing the BASE, and the kernel will
+    implicitly add SEAL_BASE for SEAL_PROT_PKEY. (If the application
+    wants to relax this in future, we could use the “flags” field in
+    mseal() to overwrite this the behavior.)
+
+- The ``MM_SEAL_PROT_PKEY``:
+
+    Seal PROT and PKEY of the address range, in other words,
+    mprotect() and pkey_mprotect() will be denied if the memory is
+    sealed with MM_SEAL_PROT_PKEY.
+
+- The ``MM_SEAL_DISCARD_RO_ANON``:
+
+    Certain types of madvise() operations are destructive [3], such
+    as MADV_DONTNEED, which can effectively alter region contents by
+    discarding pages, especially when memory is anonymous. This blocks
+    such operations for anonymous memory which is not writable to the
+    user.
+
+- The ``MM_SEAL_SEAL``
+    Denies adding a new seal.
+
+**flags**: reserved for future use.
+
+**return values**:
+
+- ``0``:
+    - Success.
+
+- ``-EINVAL``:
+    - Invalid seal type.
+    - Invalid input flags.
+    - Start address is not page aligned.
+    - Address range (``addr`` + ``len``) overflow.
+
+- ``-ENOMEM``:
+    - ``addr`` is not a valid address (not allocated).
+    - End address (``addr`` + ``len``) is not a valid address.
+    - A gap (unallocated memory) between start and end.
+
+- ``-EACCES``:
+    - ``MM_SEAL_SEAL`` is set, adding a new seal is not allowed.
+    - Address range is not sealable, e.g. ``MAP_SEALABLE`` is not
+      set during ``mmap()``.
+
+**Note**:
+
+- User can call mseal(2) multiple times to add new seal types.
+- Adding an already added seal type is a no-action (no error).
+- unseal() or removing a seal type is not supported.
+- In case of error return, one can expect the memory range is unchanged.
+
+``mmap()``
+----------
+``void *mmap(void* addr, size_t length, int prot, int flags, int fd,
+off_t offset);``
+
+We made two changes (``prot`` and ``flags``) to ``mmap()`` related to
+memory sealing.
+
+**prot**:
+
+- ``PROT_SEAL_SEAL``
+- ``PROT_SEAL_BASE``
+- ``PROT_SEAL_PROT_PKEY``
+- ``PROT_SEAL_DISCARD_RO_ANON``
+
+Allow ``mmap()`` to set the sealing type when creating a mapping. This is
+useful for optimization because it avoids having to make two system
+calls: one for ``mmap()`` and one for ``mseal()``.
+
+It's worth noting that even though the sealing type is set via the
+``prot`` field in ``mmap()``, we don't require it to be set in the ``prot``
+field in later ``mprotect()`` call. This is unlike the ``PROT_READ``,
+``PROT_WRITE``, ``PROT_EXEC`` bits, e.g. if ``PROT_WRITE`` is not set in
+``mprotect()``, it means that the region is not writable.
+
+**flags**
+The ``MAP_SEALABLE`` flag is added to the ``flags`` field of ``mmap()``.
+When present, it marks the map as sealable. A map created
+without ``MAP_SEALABLE`` will not support sealing; In other words,
+``mseal()`` will fail for such a map.
+
+Applications that don't care about sealing will expect their
+behavior unchanged. For those that need sealing support, opt-in
+by adding ``MAP_SEALABLE`` when creating the map.
+
+Use Case:
+=========
+- glibc:
+  The dymamic linker, during loading ELF executables, can apply sealing to
+  to non-writeable memory segments.
+
+- Chrome browser: protect some security sensitive data-structures.
+
+Additional notes:
+=================
+As Jann Horn pointed out in [3], there are still a few ways to write
+to RO memory, which is, in a way, by design. Those are not covered by
+``mseal()``. If applications want to block such cases, sandboxer
+(such as seccomp, LSM, etc) might be considered.
+
+Those cases are:
+
+- Write to read-only memory through ``/proc/self/mem`` interface.
+
+- Write to read-only memory through ``ptrace`` (such as ``PTRACE_POKETEXT``).
+
+- ``userfaultfd()``.
+
+The idea that inspired this patch comes from Stephen Röttger’s work in V8
+CFI [4].Chrome browser in ChromeOS will be the first user of this API.
+
+Reference:
+==========
+[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
+
+[2] https://man.openbsd.org/mimmutable.2
+
+[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
+
+[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
-- 
2.43.0.472.g3155946c3a-goog



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2023-12-12 23:17 ` [RFC PATCH v3 11/11] mseal:add documentation jeffxu
@ 2023-12-13  0:39   ` Linus Torvalds
  2023-12-14  0:35     ` Jeff Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2023-12-13  0:39 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, sroettger, willy, gregkh, jeffxu, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening, deraadt

On Tue, 12 Dec 2023 at 15:17, <jeffxu@chromium.org> wrote:
> +
> +**types**: bit mask to specify the sealing types, they are:

I really want a real-life use-case for more than one bit of "don't modify".

IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?

Or when would you *ever* say "seal this area, but mprotect()" is ok.

IOW, I want to know why we don't just do the BSD immutable thing, and
why we need this multi-level sealing thing.

               Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 01/11] mseal: Add mseal syscall.
  2023-12-12 23:16 ` [RFC PATCH v3 01/11] mseal: Add mseal syscall jeffxu
@ 2023-12-13  7:24   ` Greg KH
  0 siblings, 0 replies; 28+ messages in thread
From: Greg KH @ 2023-12-13  7:24 UTC (permalink / raw)
  To: jeffxu
  Cc: akpm, keescook, jannh, sroettger, willy, torvalds, jeffxu,
	jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt

On Tue, Dec 12, 2023 at 11:16:55PM +0000, jeffxu@chromium.org wrote:
> +config MSEAL
> +	default n

Minor nit, "n" is always the default, no need to call it out here.

> +	bool "Enable mseal() system call"
> +	depends on MMU
> +	help
> +	  Enable the virtual memory sealing.
> +	  This feature allows sealing each virtual memory area separately with
> +	  multiple sealing types.

You might want to include more documentation as to what this is for,
otherwise distros / users will not know if they need to enable this
or not.

thanks,

greg k-h


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2023-12-13  0:39   ` Linus Torvalds
@ 2023-12-14  0:35     ` Jeff Xu
  2023-12-14  1:09       ` Theo de Raadt
                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Jeff Xu @ 2023-12-14  0:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: jeffxu, akpm, keescook, jannh, sroettger, willy, gregkh, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening, deraadt

On Tue, Dec 12, 2023 at 4:39 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Dec 2023 at 15:17, <jeffxu@chromium.org> wrote:
> > +
> > +**types**: bit mask to specify the sealing types, they are:
>
> I really want a real-life use-case for more than one bit of "don't modify".
>
For the real-life use case question, Stephen Röttger and I put
description in the cover letter as well as the open discussion section
(mseal() vs immutable()) of patch 0/11.  Perhaps you are looking for more
details in chrome usage of the API, e.g. code-wise ?

> IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
>
The MADV_DONTNEED is OK for file-backed mapping.
As state in man page of madvise: [1]

"subsequent accesses of pages in the range will succeed,  but will
result in either repopulating the memory contents from the up-to-date
contents of the underlying mapped file"

> Or when would you *ever* say "seal this area, but mprotect()" is ok.
>
The fact  that openBSD allows RW=>RO transaction, as in its man page [2]

 "  At present, mprotect(2) may reduce permissions on immutable pages
  marked PROT_READ | PROT_WRITE to the less permissive PROT_READ."

suggests application might desire multiple ways to seal the "PROT" bits.

E.g.
Applications that wants a full lockdown of PROT and PKEY might use
SEAL_PROT_PKEY (Chrome case and implemented in this patch)

Application that desires RW=>RO transaction, might implement
SEAL_PROT_DOWNGRADEABLE, or specifically allow RW=>RO.
(not implemented but can be added in future as extension if  needed.)

> IOW, I want to know why we don't just do the BSD immutable thing, and
> why we need this multi-level sealing thing.
>
The details are discussed in mseal() vs immutable()) of the cover letter
(patch 0/11)

In short, BSD's immutable is designed specific for libc case, and Chrome
case is just different (e.g. the lifetime of those mappings and requirement of
free/discard unused memory).

Single bit vs multi-bits are still up for discussion.
If there are strong opinions on the multiple-bits approach, (and
no objection on applying MM_SEAL_DISCARD_RO_ANON to the .text part
during libc dynamic loading, which has no effect anyway because it is
file backed.), we could combine all three bits into one. A side note is that we
could not add something such as SEAL_PROT_DOWNGRADEABLE later,
since pkey_mprotect is sealed.

I'm open to one bit approach. If we took that approach,
We might consider the following:

mseal() or
mseal(flags), flags are reserved for future use.

I appreciate a direction on this.

 [1] https://man7.org/linux/man-pages/man2/madvise.2.html
 [2] https://man.openbsd.org/mimmutable.2

-Jeff



>                Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2023-12-14  0:35     ` Jeff Xu
@ 2023-12-14  1:09       ` Theo de Raadt
  2023-12-14  1:31       ` Linus Torvalds
  2023-12-14 15:04       ` Theo de Raadt
  2 siblings, 0 replies; 28+ messages in thread
From: Theo de Raadt @ 2023-12-14  1:09 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Linus Torvalds, jeffxu, akpm, keescook, jannh, sroettger, willy,
	gregkh, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

Jeff Xu <jeffxu@google.com> wrote:

> > Or when would you *ever* say "seal this area, but mprotect()" is ok.
> >
> The fact  that openBSD allows RW=>RO transaction, as in its man page [2]
> 
>  "  At present, mprotect(2) may reduce permissions on immutable pages
>   marked PROT_READ | PROT_WRITE to the less permissive PROT_READ."

Let me explain this.

We encountered two places that needed this less-permission-transition.

Both of these problems were found in either .data or bss, which the
kernel makes immutable by default.  The OpenBSD kernel makes those
regions immutable BY DEFAULT, and there is no way to turn that off.

One was in our libc malloc, which after initialization, wants to protect
a control data structure from being written in the future.

The other was in chrome v8, for the v8_flags variable, this is
similarily mprotected to less permission after initialization to avoid
tampering (because it's an amazing relative-address located control
gadget).

We introduced a different mechanism to solve these problem.

So we added a new ELF section which annotates objects you need to be
MUTABLE.  If these are .data or .bss, they are placed in the MUTABLE
region annotated with the following Program Header:

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  OPENBSD_MUTABLE 0x0e9000 0x00000000000ec000 0x00000000000ec000 0x001000 0x001000 RW  0x1000

associated with this Section Header

  [20] .openbsd.mutable  PROGBITS        00000000000ec000 0e9000 001000 00  WA  0   0 4096

(It is vaguely similar to RELRO).

You can place objects there using the a compiler __attribute__((section
declaration, like this example from our libc/malloc.c code

static union {
        struct malloc_readonly mopts;
        u_char _pad[MALLOC_PAGESIZE];
} malloc_readonly __attribute__((aligned(MALLOC_PAGESIZE)))
                __attribute__((section(".openbsd.mutable")));

During startup the code can set the protection and then the immutability
of the object correctly.

Since we have no purpose left for this permission reduction semantic
upon immutable mappings, we may be deleting that behaviour in the
future.  I wrote that code, because I needed it to make progress with some
difficult pieces of code.  But we found a better way.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2023-12-14  0:35     ` Jeff Xu
  2023-12-14  1:09       ` Theo de Raadt
@ 2023-12-14  1:31       ` Linus Torvalds
  2023-12-14 18:06         ` Stephen Röttger
  2023-12-14 15:04       ` Theo de Raadt
  2 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2023-12-14  1:31 UTC (permalink / raw)
  To: Jeff Xu
  Cc: jeffxu, akpm, keescook, jannh, sroettger, willy, gregkh, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening, deraadt

On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote:
>
>
> > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
> >
> The MADV_DONTNEED is OK for file-backed mapping.

Right. It makes no semantic difference. So there's no point to it.

My point was that you added this magic flag for "not ok for RO anon mapping".

It's such a *completely* random flag, that I go "that's just crazy
random - make sealing _always_ disallow that case".

So what I object to in this series is basically random small details
that should just eb part of the basic act of sealing.

I think sealing should just mean "you can't do any operations that
have semantic meaning for the mapping, because it is SEALED".

So I think sealing should automatically mean "can't do MADV_DONTNEED
on anon memory", because that's basically equivalent to a munmap/remap
operation.

I also think that sealing should just automatically mean "can't do
mprotect any more".

And yes, the OpenBSD semantics of "immutable" apparently allowed
reducing permissions, but even the openbsd man-page seems to think
that was a bug, so we should just not allow it. And the openbsd case
seems to be because of how they made certain things immutable by
default, which is different from what this mseal() thing is.

End result: I'd really like to make the thing conceptually simpler,
rather than add all those random (*very* random in case of
MADV_DONTNEED) special cases.

Is there any actual practical example of why you'd want a half-sealed thing?

And no, I didn't read the pdf that was attached. If it can't just be
explained in plain language, it's not an explanation.

I'd love for "sealed" to be just a single bit in the vm_flags things
that we already have. Not a config option. Not some complicated thing
that is hard to explain. A simple "I have set up this mapping, you
can't change it any more".

And if it cannot be that kind of thing, I want to have clear and
obvious examples of why it can't be that simple thing.

Not a pdf file that describes some google-chrome design. Something
down-to-earth and practical (and not a "we might want this in the
future" thing either).

IOW, what is wrong with "THIS VMA SETUP CANNOT BE CHANGED ANY MORE"?

Nothing less, but also nothing more. No random odd bits that need explaining.

              Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2023-12-14  0:35     ` Jeff Xu
  2023-12-14  1:09       ` Theo de Raadt
  2023-12-14  1:31       ` Linus Torvalds
@ 2023-12-14 15:04       ` Theo de Raadt
  2 siblings, 0 replies; 28+ messages in thread
From: Theo de Raadt @ 2023-12-14 15:04 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Linus Torvalds, jeffxu, akpm, keescook, jannh, sroettger, willy,
	gregkh, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening

Jeff Xu <jeffxu@google.com> wrote:

> In short, BSD's immutable is designed specific for libc case, and Chrome
> case is just different (e.g. the lifetime of those mappings and requirement of
> free/discard unused memory).

That is not true.  During the mimmutable design I took the entire
software ecosystem into consideration.  Not just libc.  That is either
uncharitable or uninformed.

In OpenBSD, pretty much the only thing which calls mimmutable() is the
shared library linker, which does so on all possible regions of all DSO
objects, not just libc.

For example, chrome loads 96 libraries, and all their text/data/bss/etc
are immutable. All the static address space is immutable.  It's the same
for all other programs running in OpenBSD -- only transient heap and
mmap spaces remain permission mutable.

It is not just libc.

What you are trying to do here with chrome is bring some sort of
soft-immutable management to regions of memory, so that trusted parts of
chrome can still change the permissions, but untrusted / gadgetry parts
of chrome cannot change the permissions.  That's a very different thing
than what I set out to do with mimmutable().  I'm not aware of any other
piece of software that needs this.  I still can't wrap my head around
the assurance model of the design. 

Maybe it is time to stop comparing mseal() to mimmutable().

Also, maybe this proposal should be using the name chromesyscall()
instead -- then it could be extended indefinitely in the future...


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2023-12-14  1:31       ` Linus Torvalds
@ 2023-12-14 18:06         ` Stephen Röttger
  2023-12-14 20:11           ` Pedro Falcato
  2023-12-14 20:14           ` Linus Torvalds
  0 siblings, 2 replies; 28+ messages in thread
From: Stephen Röttger @ 2023-12-14 18:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Xu, jeffxu, akpm, keescook, jannh, willy, gregkh, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening, deraadt

[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]

On Thu, Dec 14, 2023 at 2:31 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote:
> >
> >
> > > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
> > >
> > The MADV_DONTNEED is OK for file-backed mapping.
>
> Right. It makes no semantic difference. So there's no point to it.
>
> My point was that you added this magic flag for "not ok for RO anon mapping".
>
> It's such a *completely* random flag, that I go "that's just crazy
> random - make sealing _always_ disallow that case".
>
> So what I object to in this series is basically random small details
> that should just eb part of the basic act of sealing.
>
> I think sealing should just mean "you can't do any operations that
> have semantic meaning for the mapping, because it is SEALED".
>
> So I think sealing should automatically mean "can't do MADV_DONTNEED
> on anon memory", because that's basically equivalent to a munmap/remap
> operation.

In Chrome, we have a use case to allow MADV_DONTNEED on sealed memory.
We have a pkey-tagged heap and code region for JIT code. The regions are
writable by page permissions, but we use the pkey to control write access.
These regions are mmapped at process startup and we want to seal them to ensure
that the pkey and page permissions can't change.
Since these regions are used for dynamic allocations, we still need a way to
release unneeded resources, i.e. madvise(DONTNEED) unused pages on free().

AIUI, the madvise(DONTNEED) should effectively only change the content of
anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we
added this special case: if you want to madvise(DONTNEED) an anonymous page,
you should have write permissions to the page.

In our allocator, on free we can then release resources via:
* allow pkey writes
* madvise(DONTNEED)
* disallow pkey writes

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4005 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2023-12-14 18:06         ` Stephen Röttger
@ 2023-12-14 20:11           ` Pedro Falcato
  2023-12-14 20:14           ` Linus Torvalds
  1 sibling, 0 replies; 28+ messages in thread
From: Pedro Falcato @ 2023-12-14 20:11 UTC (permalink / raw)
  To: Stephen Röttger
  Cc: Linus Torvalds, Jeff Xu, jeffxu, akpm, keescook, jannh, willy,
	gregkh, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	dave.hansen, linux-hardening, deraadt

On Thu, Dec 14, 2023 at 6:07 PM Stephen Röttger <sroettger@google.com> wrote:
>
> On Thu, Dec 14, 2023 at 2:31 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote:
> > >
> > >
> > > > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
> > > >
> > > The MADV_DONTNEED is OK for file-backed mapping.
> >
> > Right. It makes no semantic difference. So there's no point to it.
> >
> > My point was that you added this magic flag for "not ok for RO anon mapping".
> >
> > It's such a *completely* random flag, that I go "that's just crazy
> > random - make sealing _always_ disallow that case".
> >
> > So what I object to in this series is basically random small details
> > that should just eb part of the basic act of sealing.
> >
> > I think sealing should just mean "you can't do any operations that
> > have semantic meaning for the mapping, because it is SEALED".
> >
> > So I think sealing should automatically mean "can't do MADV_DONTNEED
> > on anon memory", because that's basically equivalent to a munmap/remap
> > operation.
>
> In Chrome, we have a use case to allow MADV_DONTNEED on sealed memory.

I don't want to be that guy (*believe me*), but what if there was a
way to attach BPF programs to mm's? Such that you could handle 'seal
failures' in BPF, and thus allow for this sort of weird semantics?
e.g: madvise(MADV_DONTNEED) on a sealed region fails, kernel invokes
the BPF program (that chrome loaded), BPF program sees it was a
MADV_DONTNEED and allows it to proceed.

It requires BPF but sounds like a good compromise in order to not get
an ugly API?

-- 
Pedro


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2023-12-14 18:06         ` Stephen Röttger
  2023-12-14 20:11           ` Pedro Falcato
@ 2023-12-14 20:14           ` Linus Torvalds
  2023-12-14 22:52             ` Jeff Xu
  1 sibling, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2023-12-14 20:14 UTC (permalink / raw)
  To: Stephen Röttger
  Cc: Jeff Xu, jeffxu, akpm, keescook, jannh, willy, gregkh, jorgelo,
	groeck, linux-kernel, linux-kselftest, linux-mm, pedro.falcato,
	dave.hansen, linux-hardening, deraadt

On Thu, 14 Dec 2023 at 10:07, Stephen Röttger <sroettger@google.com> wrote:
>
> AIUI, the madvise(DONTNEED) should effectively only change the content of
> anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we
> added this special case: if you want to madvise(DONTNEED) an anonymous page,
> you should have write permissions to the page.

Hmm. I actually would be happier if we just made that change in
general. Maybe even without sealing, but I agree that it *definitely*
makes sense in general as a sealing thing.

IOW, just saying

 "madvise(DONTNEED) needs write permissions to an anonymous mapping when sealed"

makes 100% sense to me. Having a separate _flag_ to give sensible
semantics is just odd.

IOW, what I really want is exactly that "sensible semantics, not random flags".

Particularly for new system calls with fairly specialized use, I think
it's very important that the semantics are sensible on a conceptual
level, and that we do not add system calls that are based on "random
implementation issue of the day".

Yes, yes, then as we have to maintain things long-term, and we hit
some compatibility issue, at *THAT* point we'll end up facing nasty
"we had an implementation that had these semantics in practice, so now
we're stuck with it", but when introducing a new system call, let's
try really hard to start off from those kinds of random things.

Wouldn't it be lovely if we can just come up with a sane set of "this
is what it means to seal a vma", and enumerate those, and make those
sane conceptual rules be the initial definition. By all means have a
"flags" argument for future cases when we figure out there was
something wrong or the notion needed to be extended, but if we already
*start* with random extensions, I feel there's something wrong with
the whole concept.

So I would really wish for the first version of

     mseal(start, len, flags);

to have "flags=0" be the one and only case we actually handle
initially, and only add a single PROT_SEAL flag to mmap() that says
"create this mapping already pre-sealed".

Strive very hard to make sealing be a single VM_SEALED flag in the
vma->vm_flags that we already have, just admit that none of this
matters on 32-bit architectures, so that VM_SEALED can just use one of
the high flags that we have several free of (and that pkeys already
depends on), and make this a standard feature with no #ifdef's.

Can chrome live with that? And what would the required semantics be?
I'll start the list:

 - you can't unmap or remap in any way (including over-mapping)

 - you can't change protections (but with architecture support like
pkey, you can obviously change the protections indirectly with PKRU
etc)

 - you can't do VM operations that change data without the area being
writable (so the DONTNEED case - maybe there are others)

 - anything else?

Wouldn't it be lovely to have just a single notion of sealing that is
well-documented and makes sense, and doesn't require people to worry
about odd special cases?

And yes, we'd have the 'flags' argument for future special cases, and
hope really hard that it's never needed.

           Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2023-12-14 20:14           ` Linus Torvalds
@ 2023-12-14 22:52             ` Jeff Xu
  2024-01-20 15:23               ` Theo de Raadt
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Xu @ 2023-12-14 22:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Stephen Röttger, Jeff Xu, akpm, keescook, jannh, willy,
	gregkh, jorgelo, groeck, linux-kernel, linux-kselftest, linux-mm,
	pedro.falcato, dave.hansen, linux-hardening, deraadt

On Thu, Dec 14, 2023 at 12:14 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 14 Dec 2023 at 10:07, Stephen Röttger <sroettger@google.com> wrote:
> >
> > AIUI, the madvise(DONTNEED) should effectively only change the content of
> > anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we
> > added this special case: if you want to madvise(DONTNEED) an anonymous page,
> > you should have write permissions to the page.
>
> Hmm. I actually would be happier if we just made that change in
> general. Maybe even without sealing, but I agree that it *definitely*
> makes sense in general as a sealing thing.
>
> IOW, just saying
>
>  "madvise(DONTNEED) needs write permissions to an anonymous mapping when sealed"
>
> makes 100% sense to me. Having a separate _flag_ to give sensible
> semantics is just odd.
>
> IOW, what I really want is exactly that "sensible semantics, not random flags".
>
> Particularly for new system calls with fairly specialized use, I think
> it's very important that the semantics are sensible on a conceptual
> level, and that we do not add system calls that are based on "random
> implementation issue of the day".
>
> Yes, yes, then as we have to maintain things long-term, and we hit
> some compatibility issue, at *THAT* point we'll end up facing nasty
> "we had an implementation that had these semantics in practice, so now
> we're stuck with it", but when introducing a new system call, let's
> try really hard to start off from those kinds of random things.
>
> Wouldn't it be lovely if we can just come up with a sane set of "this
> is what it means to seal a vma", and enumerate those, and make those
> sane conceptual rules be the initial definition. By all means have a
> "flags" argument for future cases when we figure out there was
> something wrong or the notion needed to be extended, but if we already
> *start* with random extensions, I feel there's something wrong with
> the whole concept.
>
> So I would really wish for the first version of
>
>      mseal(start, len, flags);
>
> to have "flags=0" be the one and only case we actually handle
> initially, and only add a single PROT_SEAL flag to mmap() that says
> "create this mapping already pre-sealed".
>
> Strive very hard to make sealing be a single VM_SEALED flag in the
> vma->vm_flags that we already have, just admit that none of this
> matters on 32-bit architectures, so that VM_SEALED can just use one of
> the high flags that we have several free of (and that pkeys already
> depends on), and make this a standard feature with no #ifdef's.
>
> Can chrome live with that? And what would the required semantics be?
> I'll start the list:
>
>  - you can't unmap or remap in any way (including over-mapping)
>
>  - you can't change protections (but with architecture support like
> pkey, you can obviously change the protections indirectly with PKRU
> etc)
>
>  - you can't do VM operations that change data without the area being
> writable (so the DONTNEED case - maybe there are others)
>
>  - anything else?
>
> Wouldn't it be lovely to have just a single notion of sealing that is
> well-documented and makes sense, and doesn't require people to worry
> about odd special cases?
>
> And yes, we'd have the 'flags' argument for future special cases, and
> hope really hard that it's never needed.
>

Yes, those inputs make a lot of sense !
I will start the next version. In the meantime, if there are more
comments, please continue to post, I appreciate those to make the API
better and simple to use.

-Jeff

>            Linus
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 10/11] selftest mm/mseal memory sealing
  2023-12-12 23:17 ` [RFC PATCH v3 10/11] selftest mm/mseal memory sealing jeffxu
@ 2023-12-31  6:39   ` Muhammad Usama Anjum
  0 siblings, 0 replies; 28+ messages in thread
From: Muhammad Usama Anjum @ 2023-12-31  6:39 UTC (permalink / raw)
  To: jeffxu
  Cc: Muhammad Usama Anjum, jeffxu, jorgelo, groeck, jannh,
	linux-kernel, linux-kselftest, linux-mm, willy, pedro.falcato,
	keescook, dave.hansen, gregkh, linux-hardening, deraadt, akpm,
	torvalds, sroettger

I wasn't CC-ed on the patch even though I'd reviewed the earlier revision.


On 12/13/23 4:17 AM, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> selftest for memory sealing change in mmap() and mseal().
> 
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  tools/testing/selftests/mm/.gitignore   |    1 +
>  tools/testing/selftests/mm/Makefile     |    1 +
>  tools/testing/selftests/mm/config       |    1 +
>  tools/testing/selftests/mm/mseal_test.c | 2141 +++++++++++++++++++++++
>  4 files changed, 2144 insertions(+)
>  create mode 100644 tools/testing/selftests/mm/mseal_test.c
> 
> diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
> index cdc9ce4426b9..f0f22a649985 100644
> --- a/tools/testing/selftests/mm/.gitignore
> +++ b/tools/testing/selftests/mm/.gitignore
> @@ -43,3 +43,4 @@ mdwe_test
>  gup_longterm
>  mkdirty
>  va_high_addr_switch
> +mseal_test
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index 6a9fc5693145..0c086cecc093 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -59,6 +59,7 @@ TEST_GEN_FILES += mlock2-tests
>  TEST_GEN_FILES += mrelease_test
>  TEST_GEN_FILES += mremap_dontunmap
>  TEST_GEN_FILES += mremap_test
> +TEST_GEN_FILES += mseal_test
>  TEST_GEN_FILES += on-fault-limit
>  TEST_GEN_FILES += thuge-gen
>  TEST_GEN_FILES += transhuge-stress
> diff --git a/tools/testing/selftests/mm/config b/tools/testing/selftests/mm/config
> index be087c4bc396..cf2b8780e9b1 100644
> --- a/tools/testing/selftests/mm/config
> +++ b/tools/testing/selftests/mm/config
> @@ -6,3 +6,4 @@ CONFIG_TEST_HMM=m
>  CONFIG_GUP_TEST=y
>  CONFIG_TRANSPARENT_HUGEPAGE=y
>  CONFIG_MEM_SOFT_DIRTY=y
> +CONFIG_MSEAL=y
> diff --git a/tools/testing/selftests/mm/mseal_test.c b/tools/testing/selftests/mm/mseal_test.c
> new file mode 100644
> index 000000000000..0692485d8b3c
> --- /dev/null
> +++ b/tools/testing/selftests/mm/mseal_test.c
> @@ -0,0 +1,2141 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include <sys/mman.h>
> +#include <stdint.h>
> +#include <unistd.h>
> +#include <string.h>
> +#include <sys/time.h>
> +#include <sys/resource.h>
> +#include <stdbool.h>
> +#include "../kselftest.h"
> +#include <syscall.h>
> +#include <errno.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <assert.h>
> +#include <fcntl.h>
> +#include <assert.h>
> +#include <sys/ioctl.h>
> +#include <sys/vfs.h>
> +#include <sys/stat.h>
> +
> +/*
> + * need those definition for manually build using gcc.
> + * gcc -I ../../../../usr/include   -DDEBUG -O3  -DDEBUG -O3 mseal_test.c -o mseal_test
> + */
> +#ifndef MM_SEAL_SEAL
> +#define MM_SEAL_SEAL 0x1
> +#endif
> +
> +#ifndef MM_SEAL_BASE
> +#define MM_SEAL_BASE 0x2
> +#endif
> +
> +#ifndef MM_SEAL_PROT_PKEY
> +#define MM_SEAL_PROT_PKEY 0x4
> +#endif
> +
> +#ifndef MM_SEAL_DISCARD_RO_ANON
> +#define MM_SEAL_DISCARD_RO_ANON 0x8
> +#endif
> +
> +#ifndef MAP_SEALABLE
> +#define MAP_SEALABLE 0x8000000
> +#endif
> +
> +#ifndef PROT_SEAL_SEAL
> +#define PROT_SEAL_SEAL 0x04000000
> +#endif
> +
> +#ifndef PROT_SEAL_BASE
> +#define PROT_SEAL_BASE 0x08000000
> +#endif
> +
> +#ifndef PROT_SEAL_PROT_PKEY
> +#define PROT_SEAL_PROT_PKEY 0x10000000
> +#endif
> +
> +#ifndef PROT_SEAL_DISCARD_RO_ANON
> +#define PROT_SEAL_DISCARD_RO_ANON 0x20000000
> +#endif
> +
> +#ifndef PKEY_DISABLE_ACCESS
> +# define PKEY_DISABLE_ACCESS    0x1
> +#endif
> +
> +#ifndef PKEY_DISABLE_WRITE
> +# define PKEY_DISABLE_WRITE     0x2
> +#endif
> +
> +#ifndef PKEY_BITS_PER_KEY
> +#define PKEY_BITS_PER_PKEY      2
> +#endif
> +
> +#ifndef PKEY_MASK
> +#define PKEY_MASK       (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)
> +#endif
> +
> +#ifndef DEBUG
> +#define LOG_TEST_ENTER()	{}
> +#else
> +#define LOG_TEST_ENTER()	{ksft_print_msg("%s\n", __func__); }
> +#endif
> +
> +#ifndef u64
> +#define u64 unsigned long long
> +#endif
> +
> +/*
> + * define sys_xyx to call syscall directly.
> + */
> +static int sys_mseal(void *start, size_t len, int types)
> +{
> +	int sret;
> +
> +	errno = 0;
> +	sret = syscall(__NR_mseal, start, len, types, 0);
> +	return sret;
> +}
> +
> +int sys_mprotect(void *ptr, size_t size, unsigned long prot)
> +{
> +	int sret;
> +
> +	errno = 0;
> +	sret = syscall(SYS_mprotect, ptr, size, prot);
> +	return sret;
> +}
> +
> +int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
> +		unsigned long pkey)
> +{
> +	int sret;
> +
> +	errno = 0;
> +	sret = syscall(__NR_pkey_mprotect, ptr, size, orig_prot, pkey);
> +	return sret;
> +}
> +
> +int sys_munmap(void *ptr, size_t size)
> +{
> +	int sret;
> +
> +	errno = 0;
> +	sret = syscall(SYS_munmap, ptr, size);
> +	return sret;
> +}
> +
> +static int sys_madvise(void *start, size_t len, int types)
> +{
> +	int sret;
> +
> +	errno = 0;
> +	sret = syscall(__NR_madvise, start, len, types);
> +	return sret;
> +}
> +
> +int sys_pkey_alloc(unsigned long flags, unsigned long init_val)
> +{
> +	int ret = syscall(SYS_pkey_alloc, flags, init_val);
Add empty line here.

> +	return ret;
> +}
> +
> +static inline unsigned int __read_pkey_reg(void)
> +{
> +	unsigned int eax, edx;
> +	unsigned int ecx = 0;
> +	unsigned int pkey_reg;
> +
> +	asm volatile(".byte 0x0f,0x01,0xee\n\t"
> +			: "=a" (eax), "=d" (edx)
> +			: "c" (ecx));
> +	pkey_reg = eax;
> +	return pkey_reg;
> +}
> +
> +static inline void __write_pkey_reg(u64 pkey_reg)
> +{
> +	unsigned int eax = pkey_reg;
> +	unsigned int ecx = 0;
> +	unsigned int edx = 0;
> +
> +	asm volatile(".byte 0x0f,0x01,0xef\n\t"
> +			: : "a" (eax), "c" (ecx), "d" (edx));
> +	assert(pkey_reg == __read_pkey_reg());
> +}
> +
> +static inline unsigned long pkey_bit_position(int pkey)
> +{
> +	return pkey * PKEY_BITS_PER_PKEY;
> +}
> +
> +static inline u64 set_pkey_bits(u64 reg, int pkey, u64 flags)
> +{
> +	unsigned long shift = pkey_bit_position(pkey);
> +	/* mask out bits from pkey in old value */
> +	reg &= ~((u64)PKEY_MASK << shift);
> +	/* OR in new bits for pkey */
> +	reg |= (flags & PKEY_MASK) << shift;
> +	return reg;
> +}
> +
> +static inline void set_pkey(int pkey, unsigned long pkey_value)
> +{
> +	unsigned long mask = (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE);
> +	u64 new_pkey_reg;
> +
> +	assert(!(pkey_value & ~mask));
> +	new_pkey_reg = set_pkey_bits(__read_pkey_reg(), pkey, pkey_value);
> +	__write_pkey_reg(new_pkey_reg);
> +}
> +
> +void setup_single_address(int size, void **ptrOut)
> +{
> +	void *ptr;
> +
> +	ptr = mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_SEALABLE, -1, 0);
> +	assert(ptr != (void *)-1);
> +	*ptrOut = ptr;
> +}
> +
> +void setup_single_address_sealable(int size, void **ptrOut, bool sealable)
> +{
> +	void *ptr;
> +	unsigned long mapflags = MAP_ANONYMOUS | MAP_PRIVATE;
> +
> +	if (sealable)
> +		mapflags |= MAP_SEALABLE;
> +
> +	ptr = mmap(NULL, size, PROT_READ, mapflags, -1, 0);
> +	assert(ptr != (void *)-1);
> +	*ptrOut = ptr;
> +}
> +
> +void setup_single_address_rw_sealable(int size, void **ptrOut, bool sealable)
> +{
> +	void *ptr;
> +	unsigned long mapflags = MAP_ANONYMOUS | MAP_PRIVATE;
> +
> +	if (sealable)
> +		mapflags |= MAP_SEALABLE;
> +
> +	ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, mapflags, -1, 0);
> +	assert(ptr != (void *)-1);
> +	*ptrOut = ptr;
> +}
> +
> +void clean_single_address(void *ptr, int size)
> +{
> +	int ret;
> +
> +	ret = munmap(ptr, size);
> +	assert(!ret);
> +}
> +
> +void seal_mprotect_single_address(void *ptr, int size)
> +{
> +	int ret;
> +
> +	ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
> +	assert(!ret);
> +}
> +
> +void seal_discard_ro_anon_single_address(void *ptr, int size)
> +{
> +	int ret;
> +
> +	ret = sys_mseal(ptr, size, MM_SEAL_DISCARD_RO_ANON);
> +	assert(!ret);
> +}
> +
> +static void test_seal_addseals(void)
> +{
> +	LOG_TEST_ENTER();
> +	int ret;
> +	void *ptr;
> +	unsigned long page_size = getpagesize();
> +	unsigned long size = 4 * page_size;
> +
> +	setup_single_address(size, &ptr);
> +
> +	/* adding seal one by one */
> +
> +	ret = sys_mseal(ptr, size, MM_SEAL_BASE);
> +	assert(!ret);
> +	ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
> +	assert(!ret);
> +	ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
> +	assert(!ret);
> +}
> +
> +static void test_seal_addseals_combined(void)
> +{
> +	LOG_TEST_ENTER();
> +	int ret;
> +	void *ptr;
> +	unsigned long page_size = getpagesize();
> +	unsigned long size = 4 * page_size;
> +
> +	setup_single_address(size, &ptr);
> +
> +	ret = sys_mseal(ptr, size, MM_SEAL_PROT_PKEY);
> +	assert(!ret);
> +
> +	/* adding multiple seals */
> +	ret = sys_mseal(ptr, size,
> +			MM_SEAL_PROT_PKEY | MM_SEAL_BASE|
> +			MM_SEAL_SEAL);
> +	assert(!ret);
> +
> +	/* not adding more seal type, so ok. */
> +	ret = sys_mseal(ptr, size, MM_SEAL_BASE);
> +	assert(!ret);
> +
> +	/* not adding more seal type, so ok. */
> +	ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
> +	assert(!ret);
> +}
> +
> +static void test_seal_addseals_reject(void)
> +{
> +	LOG_TEST_ENTER();
> +	int ret;
> +	void *ptr;
> +	unsigned long page_size = getpagesize();
> +	unsigned long size = 4 * page_size;
> +
> +	setup_single_address(size, &ptr);
> +
> +	ret = sys_mseal(ptr, size, MM_SEAL_BASE | MM_SEAL_SEAL);
> +	assert(!ret);
> +
> +	/* MM_SEAL_SEAL is set, so not allow new seal type . */
> +	ret = sys_mseal(ptr, size,
> +			MM_SEAL_PROT_PKEY | MM_SEAL_BASE | MM_SEAL_SEAL);
> +	assert(ret < 0);
> +}
> +
> +static void test_seal_unmapped_start(void)
> +{
> +	LOG_TEST_ENTER();
> +	int ret;
> +	void *ptr;
> +	unsigned long page_size = getpagesize();
> +	unsigned long size = 4 * page_size;
> +
> +	setup_single_address(size, &ptr);
> +
> +	// munmap 2 pages from ptr.
Don't use different commenting styles in one file. Use /* */ for comments.

> +	ret = sys_munmap(ptr, 2 * page_size);
> +	assert(!ret);
> +
> +	// mprotect will fail because 2 pages from ptr are unmapped.
> +	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
> +	assert(ret < 0);
> +
> +	// mseal will fail because 2 pages from ptr are unmapped.
> +	ret = sys_mseal(ptr, size, MM_SEAL_SEAL);
> +	assert(ret < 0);
> +
> +	ret = sys_mseal(ptr + 2 * page_size, 2 * page_size, MM_SEAL_SEAL);
> +	assert(!ret);
> +}
> +


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2023-12-14 22:52             ` Jeff Xu
@ 2024-01-20 15:23               ` Theo de Raadt
  2024-01-20 16:40                 ` Linus Torvalds
  0 siblings, 1 reply; 28+ messages in thread
From: Theo de Raadt @ 2024-01-20 15:23 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Linus Torvalds, Stephen Röttger, Jeff Xu, akpm, keescook,
	jannh, willy, gregkh, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, pedro.falcato, dave.hansen,
	linux-hardening

Some notes about compatibility between mimmutable(2) and mseal().

This morning, the "RW -> R demotion" code in mimmutable(2) was removed.
As described previously, that was a development crutch to solved a problem
but we found a better way with a new ELF section which is available at
compile time with __attribute__((section(".openbsd.mutable"))).   Which
works great.

I am syncronizing the madvise / msync behaviour further, we will be compatible.
I have worried about madvise / msync for a long time, and audited vast amounts
of the software ecosystem to come to a conclusion we can be more strict, but
I never acted upon it.

BTW, on OpenBSD and probably other related BSD operating systems,
MADV_DONTNEED is non-destructive.  However we have a destructive
operation called MADV_FREE.  msync() MS_INVALIDATE is also destructive.
But all of these operations will now be prohibited, to syncronize the
error return value situation.

There is an one large difference remainig between mimmutable() and mseal(),
which is how other system calls behave.

We return EPERM for failures in all the system calls that fail upon
immutable memory (since Oct 2022).

You are returning EACESS.

Before it is too late, do you want to reconsider that return value, or
do you have a justification for the choice?

I think this remains the blocker which would prevent software from doing

#define mimmutable(addr, len)  mseal(addr, len, 0)


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2024-01-20 15:23               ` Theo de Raadt
@ 2024-01-20 16:40                 ` Linus Torvalds
  2024-01-20 16:59                   ` Theo de Raadt
  2024-01-21  0:16                   ` Jeff Xu
  0 siblings, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2024-01-20 16:40 UTC (permalink / raw)
  To: Theo de Raadt
  Cc: Jeff Xu, Stephen Röttger, Jeff Xu, akpm, keescook, jannh,
	willy, gregkh, jorgelo, groeck, linux-kernel, linux-kselftest,
	linux-mm, pedro.falcato, dave.hansen, linux-hardening

On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> There is an one large difference remainig between mimmutable() and mseal(),
> which is how other system calls behave.
>
> We return EPERM for failures in all the system calls that fail upon
> immutable memory (since Oct 2022).
>
> You are returning EACESS.
>
> Before it is too late, do you want to reconsider that return value, or
> do you have a justification for the choice?

I don't think there's any real reason for the difference.

Jeff - mind changing the EACESS to EPERM, and we'll have something
that is more-or-less compatible between Linux and OpenBSD?

             Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2024-01-20 16:40                 ` Linus Torvalds
@ 2024-01-20 16:59                   ` Theo de Raadt
  2024-01-21  0:16                   ` Jeff Xu
  1 sibling, 0 replies; 28+ messages in thread
From: Theo de Raadt @ 2024-01-20 16:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Xu, Stephen Röttger, Jeff Xu, akpm, keescook, jannh,
	willy, gregkh, jorgelo, groeck, linux-kernel, linux-kselftest,
	linux-mm, pedro.falcato, dave.hansen, linux-hardening

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > There is an one large difference remainig between mimmutable() and mseal(),
> > which is how other system calls behave.
> >
> > We return EPERM for failures in all the system calls that fail upon
> > immutable memory (since Oct 2022).
> >
> > You are returning EACESS.
> >
> > Before it is too late, do you want to reconsider that return value, or
> > do you have a justification for the choice?
> 
> I don't think there's any real reason for the difference.
> 
> Jeff - mind changing the EACESS to EPERM, and we'll have something
> that is more-or-less compatible between Linux and OpenBSD?

(I tried to remember why I chose EPERM, replaying the view from the
German castle during kernel compiles...)

In mmap, EACCESS already means something.

     [EACCES]           The flag PROT_READ was specified as part of the prot
                        parameter and fd was not open for reading.  The flags
                        MAP_SHARED and PROT_WRITE were specified as part of
                        the flags and prot parameters and fd was not open for
                        writing.

In mprotect, the situation is similar

     [EACCES]           The process does not have sufficient access to the
                        underlying memory object to provide the requested
                        protection.

immutable isn't an aspect of the underlying object, but an aspect of the
mapping.

Anyways, it is common for one errno value to have multiple causes.

But this error-aliasing can make it harder to figure things out when
studying a "system call trace" a program, and I strongly believe in
keeping systems are simple as possible.

For all the memory mapping control operations, EPERM was available and
unambiguous.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2024-01-20 16:40                 ` Linus Torvalds
  2024-01-20 16:59                   ` Theo de Raadt
@ 2024-01-21  0:16                   ` Jeff Xu
  2024-01-21  0:43                     ` Theo de Raadt
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff Xu @ 2024-01-21  0:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theo de Raadt, Stephen Röttger, Jeff Xu, akpm, keescook,
	jannh, willy, gregkh, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, pedro.falcato, dave.hansen,
	linux-hardening

On Sat, Jan 20, 2024 at 8:40 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > There is an one large difference remainig between mimmutable() and mseal(),
> > which is how other system calls behave.
> >
> > We return EPERM for failures in all the system calls that fail upon
> > immutable memory (since Oct 2022).
> >
> > You are returning EACESS.
> >
> > Before it is too late, do you want to reconsider that return value, or
> > do you have a justification for the choice?
>
> I don't think there's any real reason for the difference.
>
> Jeff - mind changing the EACESS to EPERM, and we'll have something
> that is more-or-less compatible between Linux and OpenBSD?
>
Sounds Good. I will make the necessary changes in the next version.

-Jeff

>              Linus


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH v3 11/11] mseal:add documentation
  2024-01-21  0:16                   ` Jeff Xu
@ 2024-01-21  0:43                     ` Theo de Raadt
  0 siblings, 0 replies; 28+ messages in thread
From: Theo de Raadt @ 2024-01-21  0:43 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Linus Torvalds, Stephen Röttger, Jeff Xu, akpm, keescook,
	jannh, willy, gregkh, jorgelo, groeck, linux-kernel,
	linux-kselftest, linux-mm, pedro.falcato, dave.hansen,
	linux-hardening

Jeff Xu <jeffxu@chromium.org> wrote:

> > Jeff - mind changing the EACESS to EPERM, and we'll have something
> > that is more-or-less compatible between Linux and OpenBSD?
> >
> Sounds Good. I will make the necessary changes in the next version.

Thanks!  That is so awesome!

On the OpenBSD side, I am close to landing our madvise / msync changes.

Then we are mostly in sync.

It was on my radar for a year, but delayed because I was ponderingn
blocking the destructive madvise / msync ops on regular non-writeable
pages.  These ops remain a page-zero gadget against regular (mutable)
readonly pages, and it bothers me.  I've heard rumour this has been used
in a nasty way, and I think the sloppily defined semantics could use
a strict modernization.


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-01-21  0:43 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-12 23:16 [RFC PATCH v3 00/11] Introduce mseal() jeffxu
2023-12-12 23:16 ` [RFC PATCH v3 01/11] mseal: Add mseal syscall jeffxu
2023-12-13  7:24   ` Greg KH
2023-12-12 23:16 ` [RFC PATCH v3 02/11] mseal: Wire up " jeffxu
2023-12-12 23:16 ` [RFC PATCH v3 03/11] mseal: add can_modify_mm and can_modify_vma jeffxu
2023-12-12 23:16 ` [RFC PATCH v3 04/11] mseal: add MM_SEAL_BASE jeffxu
2023-12-12 23:16 ` [RFC PATCH v3 05/11] mseal: add MM_SEAL_PROT_PKEY jeffxu
2023-12-12 23:17 ` [RFC PATCH v3 06/11] mseal: add sealing support for mmap jeffxu
2023-12-12 23:17 ` [RFC PATCH v3 07/11] mseal: make sealed VMA mergeable jeffxu
2023-12-12 23:17 ` [RFC PATCH v3 08/11] mseal: add MM_SEAL_DISCARD_RO_ANON jeffxu
2023-12-12 23:17 ` [RFC PATCH v3 09/11] mseal: add MAP_SEALABLE to mmap() jeffxu
2023-12-12 23:17 ` [RFC PATCH v3 10/11] selftest mm/mseal memory sealing jeffxu
2023-12-31  6:39   ` Muhammad Usama Anjum
2023-12-12 23:17 ` [RFC PATCH v3 11/11] mseal:add documentation jeffxu
2023-12-13  0:39   ` Linus Torvalds
2023-12-14  0:35     ` Jeff Xu
2023-12-14  1:09       ` Theo de Raadt
2023-12-14  1:31       ` Linus Torvalds
2023-12-14 18:06         ` Stephen Röttger
2023-12-14 20:11           ` Pedro Falcato
2023-12-14 20:14           ` Linus Torvalds
2023-12-14 22:52             ` Jeff Xu
2024-01-20 15:23               ` Theo de Raadt
2024-01-20 16:40                 ` Linus Torvalds
2024-01-20 16:59                   ` Theo de Raadt
2024-01-21  0:16                   ` Jeff Xu
2024-01-21  0:43                     ` Theo de Raadt
2023-12-14 15:04       ` Theo de Raadt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).