linux-hardening.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
@ 2023-05-15 13:05 jeffxu
  2023-05-15 13:05 ` [PATCH 1/6] PKEY: Introduce PKEY_ENFORCE_API flag jeffxu
                   ` (7 more replies)
  0 siblings, 8 replies; 44+ messages in thread
From: jeffxu @ 2023-05-15 13:05 UTC (permalink / raw)
  To: dave.hansen, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

From: Jeff Xu <jeffxu@google.com>

This is the first set of Memory mapping (VMA) protection patches using PKU.

* * * 

Background:

As discussed previously in the kernel mailing list [1], V8 CFI [2] uses 
PKU to protect memory, and Stephen Röttger proposes to extend the PKU to 
memory mapping [3].

We're using PKU for in-process isolation to enforce control-flow integrity
for a JIT compiler. In our threat model, an attacker exploits a 
vulnerability and has arbitrary read/write access to the whole process
space concurrently to other threads being executed. This attacker can
manipulate some arguments to syscalls from some threads.

Under such a powerful attack, we want to create a “safe/isolated”
thread environment. We assign dedicated PKUs to this thread, 
and use those PKUs to protect the threads’ runtime environment.
The thread has exclusive access to its run-time memory. This
includes modifying the protection of the memory mapping, or
munmap the memory mapping after use. And the other threads
won’t be able to access the memory or modify the memory mapping
(VMA) belonging to the thread.

* * * 

Proposed changes:

This patch introduces a new flag, PKEY_ENFORCE_API, to the pkey_alloc()
function. When a PKEY is created with this flag, it is enforced that any
thread that wants to make changes to the memory mapping (such as mprotect)
of the memory must have write access to the PKEY. PKEYs created without
this flag will continue to work as they do now, for backwards 
compatibility.

Only PKEY created from user space can have the new flag set, the PKEY
allocated by the kernel internally will not have it. In other words,
ARCH_DEFAULT_PKEY(0) and execute_only_pkey won’t have this flag set,
and continue work as today.

This flag is checked only at syscall entry, such as mprotect/munmap in
this set of patches. It will not apply to other call paths. In other
words, if the kernel want to change attributes of VMA for some reasons,
the kernel is free to do that and not affected by this new flag.

This set of patch covers mprotect/munmap, I plan to work on other 
syscalls after this. 

* * * 

Testing:

I have tested this patch on a Linux kernel 5.15, 6,1, and 6.4-rc1,
new selftest is added in: pkey_enforce_api.c 

* * * 

Discussion:

We believe that this patch provides a valuable security feature. 
It allows us to create “safe/isolated” thread environments that are 
protected from attackers with arbitrary read/write access to 
the process space.

We believe that the interface change and the patch don't 
introduce backwards compatibility risk.

We would like to disucss this patch in Linux kernel community
for feedback and support. 

* * * 

Reference:

[1]https://lore.kernel.org/all/202208221331.71C50A6F@keescook/
[2]https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit?usp=sharing
[3]https://docs.google.com/document/d/1qqVoVfRiF2nRylL3yjZyCQvzQaej1HRPh3f5wj1AS9I/edit


Best Regards,
-Jeff Xu

Jeff Xu (6):
  PKEY: Introduce PKEY_ENFORCE_API flag
  PKEY: Add arch_check_pkey_enforce_api()
  PKEY: Apply PKEY_ENFORCE_API to mprotect
  PKEY:selftest pkey_enforce_api for mprotect
  KEY: Apply PKEY_ENFORCE_API to munmap
  PKEY:selftest pkey_enforce_api for munmap

 arch/powerpc/include/asm/pkeys.h              |   19 +-
 arch/x86/include/asm/mmu.h                    |    7 +
 arch/x86/include/asm/pkeys.h                  |   92 +-
 arch/x86/mm/pkeys.c                           |    2 +-
 include/linux/mm.h                            |    2 +-
 include/linux/pkeys.h                         |   18 +-
 include/uapi/linux/mman.h                     |    5 +
 mm/mmap.c                                     |   34 +-
 mm/mprotect.c                                 |   31 +-
 mm/mremap.c                                   |    6 +-
 tools/testing/selftests/mm/Makefile           |    1 +
 tools/testing/selftests/mm/pkey_enforce_api.c | 1312 +++++++++++++++++
 12 files changed, 1507 insertions(+), 22 deletions(-)
 create mode 100644 tools/testing/selftests/mm/pkey_enforce_api.c


base-commit: ba0ad6ed89fd5dada3b7b65ef2b08e95d449d4ab
-- 
2.40.1.606.ga4b1b128d6-goog


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/6] PKEY: Introduce PKEY_ENFORCE_API flag
  2023-05-15 13:05 [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 jeffxu
@ 2023-05-15 13:05 ` jeffxu
  2023-05-16 23:14   ` Dave Hansen
  2023-05-15 13:05 ` [PATCH 2/6] PKEY: Add arch_check_pkey_enforce_api() jeffxu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: jeffxu @ 2023-05-15 13:05 UTC (permalink / raw)
  To: dave.hansen, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

From: Jeff Xu <jeffxu@google.com>

This patch introduces a new flag, PKEY_ENFORCE_API, to the pkey_alloc()
function. When a PKEY is created with this flag, it is enforced that any
thread that wants to make changes to the memory mapping (such as
mprotect/munmap) of the memory must have write access to the PKEY.
This is to prevent unauthorized access to protected memory.

PKEYs created without this flag will continue to work as they do now,
for backwards compatibility.

Signed-off-by: Jeff Xu<jeffxu@google.com>
---
 arch/powerpc/include/asm/pkeys.h | 11 ++++++++-
 arch/x86/include/asm/mmu.h       |  7 ++++++
 arch/x86/include/asm/pkeys.h     | 42 ++++++++++++++++++++++++++++++--
 arch/x86/mm/pkeys.c              |  2 +-
 include/linux/pkeys.h            |  9 ++++++-
 include/uapi/linux/mman.h        |  5 ++++
 mm/mprotect.c                    |  6 ++---
 7 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 59a2c7dbc78f..943333ac0fee 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -82,7 +82,7 @@ static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
  * Relies on the mmap_lock to protect against concurrency in mm_pkey_alloc() and
  * mm_pkey_free().
  */
-static inline int mm_pkey_alloc(struct mm_struct *mm)
+static inline int mm_pkey_alloc(struct mm_struct *mm, unsigned long flags)
 {
 	/*
 	 * Note: this is the one and only place we make sure that the pkey is
@@ -168,5 +168,14 @@ static inline bool arch_pkeys_enabled(void)
 	return mmu_has_feature(MMU_FTR_PKEY);
 }
 
+static inline bool arch_check_pkey_alloc_flags(unsigned long flags)
+{
+	/* No flags supported yet. */
+	if (flags)
+		return false;
+
+	return true;
+}
+
 extern void pkey_mm_init(struct mm_struct *mm);
 #endif /*_ASM_POWERPC_KEYS_H */
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 0da5c227f490..d97594b44d9a 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -66,6 +66,13 @@ typedef struct {
 	 */
 	u16 pkey_allocation_map;
 	s16 execute_only_pkey;
+	/*
+	 * One bit per protection key.
+	 * When set, thread must have write permission on corresponding
+	 * PKRU in order to call memory mapping API, such as mprotect,
+	 * munmap, etc.
+	 */
+	u16 pkey_enforce_api_map;
 #endif
 } mm_context_t;
 
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 2e6c04d8a45b..ecadf04a8251 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -51,6 +51,17 @@ static inline int arch_override_mprotect_pkey(struct vm_area_struct *vma,
 	mm_pkey_allocation_map(mm) &= ~(1U << pkey);	\
 } while (0)
 
+#define mm_pkey_enforce_api_map(mm) (mm->context.pkey_enforce_api_map)
+#define mm_set_pkey_enforce_api(mm, pkey)                                      \
+	{                                                                      \
+		mm_pkey_enforce_api_map(mm) |= (1U << pkey);                   \
+	}
+
+#define mm_clear_pkey_enforce_api(mm, pkey)                                    \
+	{                                                                      \
+		mm_pkey_enforce_api_map(mm) &= ~(1U << pkey);                  \
+	}
+
 static inline
 bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
@@ -74,11 +85,25 @@ bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 	return mm_pkey_allocation_map(mm) & (1U << pkey);
 }
 
+/*
+ * Return true if the pkey has ENFORCE_API flag during allocation.
+ */
+static inline bool mm_pkey_enforce_api(struct mm_struct *mm, int pkey)
+{
+	/*
+	 * Only pkey created by user space has the flag.
+	 * execute_only_pkey check is in mm_pkey_is_allocated().
+	 */
+	if (pkey != ARCH_DEFAULT_PKEY && mm_pkey_is_allocated(mm, pkey))
+		return mm_pkey_enforce_api_map(mm) & (1U << pkey);
+
+	return false;
+}
+
 /*
  * Returns a positive, 4-bit key on success, or -1 on failure.
  */
-static inline
-int mm_pkey_alloc(struct mm_struct *mm)
+static inline int mm_pkey_alloc(struct mm_struct *mm, unsigned long flags)
 {
 	/*
 	 * Note: this is the one and only place we make sure
@@ -101,6 +126,9 @@ int mm_pkey_alloc(struct mm_struct *mm)
 
 	mm_set_pkey_allocated(mm, ret);
 
+	if (flags & PKEY_ENFORCE_API)
+		mm_set_pkey_enforce_api(mm, ret);
+
 	return ret;
 }
 
@@ -110,6 +138,7 @@ int mm_pkey_free(struct mm_struct *mm, int pkey)
 	if (!mm_pkey_is_allocated(mm, pkey))
 		return -EINVAL;
 
+	mm_clear_pkey_enforce_api(mm, pkey);
 	mm_set_pkey_free(mm, pkey);
 
 	return 0;
@@ -123,4 +152,13 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 	return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
 }
 
+static inline bool arch_check_pkey_alloc_flags(unsigned long flags)
+{
+	unsigned long valid_flags = PKEY_ENFORCE_API;
+
+	if (flags & ~valid_flags)
+		return false;
+
+	return true;
+}
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 7418c367e328..a76981f44acf 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -20,7 +20,7 @@ int __execute_only_pkey(struct mm_struct *mm)
 	/* Do we need to assign a pkey for mm's execute-only maps? */
 	if (execute_only_pkey == -1) {
 		/* Go allocate one to use, which might fail */
-		execute_only_pkey = mm_pkey_alloc(mm);
+		execute_only_pkey = mm_pkey_alloc(mm, 0);
 		if (execute_only_pkey < 0)
 			return -1;
 		need_to_set_mm_pkey = true;
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 86be8bf27b41..81a482c3e051 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -3,6 +3,7 @@
 #define _LINUX_PKEYS_H
 
 #include <linux/mm.h>
+#include <linux/mman.h>
 
 #define ARCH_DEFAULT_PKEY	0
 
@@ -25,7 +26,7 @@ static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 	return (pkey == 0);
 }
 
-static inline int mm_pkey_alloc(struct mm_struct *mm)
+static inline int mm_pkey_alloc(struct mm_struct *mm, unsigned long flags)
 {
 	return -1;
 }
@@ -46,6 +47,12 @@ static inline bool arch_pkeys_enabled(void)
 	return false;
 }
 
+static inline bool arch_check_pkey_alloc_flags(unsigned long flags)
+{
+	if (flags)
+		return false;
+	return true;
+}
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index f55bc680b5b0..8c69b9a7ff5b 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -41,4 +41,9 @@
 #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
 #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
 
+/*
+ * Flags for pkey_alloc
+ */
+#define PKEY_ENFORCE_API (1 << 0)
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 92d3d3ca390a..8a68fdca8487 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -894,15 +894,15 @@ SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
 	int pkey;
 	int ret;
 
-	/* No flags supported yet. */
-	if (flags)
+	if (!arch_check_pkey_alloc_flags(flags))
 		return -EINVAL;
+
 	/* check for unsupported init values */
 	if (init_val & ~PKEY_ACCESS_MASK)
 		return -EINVAL;
 
 	mmap_write_lock(current->mm);
-	pkey = mm_pkey_alloc(current->mm);
+	pkey = mm_pkey_alloc(current->mm, flags);
 
 	ret = -ENOSPC;
 	if (pkey == -1)
-- 
2.40.1.606.ga4b1b128d6-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/6] PKEY: Add arch_check_pkey_enforce_api()
  2023-05-15 13:05 [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 jeffxu
  2023-05-15 13:05 ` [PATCH 1/6] PKEY: Introduce PKEY_ENFORCE_API flag jeffxu
@ 2023-05-15 13:05 ` jeffxu
  2023-05-18 21:43   ` Dave Hansen
  2023-05-15 13:05 ` [PATCH 3/6] PKEY: Apply PKEY_ENFORCE_API to mprotect jeffxu
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: jeffxu @ 2023-05-15 13:05 UTC (permalink / raw)
  To: dave.hansen, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

From: Jeff Xu <jeffxu@google.com>

This patch adds an architecture-independent function,
arch_check_pkey_enforce_api(), that checks whether the calling thread
has write access to the PKRU for a given range of memory. If the
memory range is protected by PKEY_ENFORCE_API, then the thread must
have write access to the PKRU in order to make changes to the memory
mapping (such as mprotect/munmap).

This function is used by the kernel to enforce the
PKEY_ENFORCE_API flag.

Signed-off-by: Jeff Xu<jeffxu@google.com>
---
 arch/powerpc/include/asm/pkeys.h |  8 +++++
 arch/x86/include/asm/pkeys.h     | 50 ++++++++++++++++++++++++++++++++
 include/linux/pkeys.h            |  9 ++++++
 3 files changed, 67 insertions(+)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 943333ac0fee..24c481e5e95b 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -177,5 +177,13 @@ static inline bool arch_check_pkey_alloc_flags(unsigned long flags)
 	return true;
 }
 
+static inline int arch_check_pkey_enforce_api(struct mm_struct *mm,
+					      unsigned long start,
+					      unsigned long end)
+{
+	/* Allow by default */
+	return 0;
+}
+
 extern void pkey_mm_init(struct mm_struct *mm);
 #endif /*_ASM_POWERPC_KEYS_H */
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index ecadf04a8251..8b94ffc4ca32 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -161,4 +161,54 @@ static inline bool arch_check_pkey_alloc_flags(unsigned long flags)
 
 	return true;
 }
+
+static inline int __arch_check_vma_pkey_for_write(struct vm_area_struct *vma)
+{
+	int pkey = vma_pkey(vma);
+
+	if (mm_pkey_enforce_api(vma->vm_mm, pkey)) {
+		if (!__pkru_allows_write(read_pkru(), pkey))
+			return -EACCES;
+	}
+
+	return 0;
+}
+
+/*
+ * arch_check_pkey_enforce_api is used by the kernel to enforce
+ * PKEY_ENFORCE_API flag.
+ * It checks whether the calling thread  has write access to the PKRU
+ * for a given range of memory. If the  memory range is protected by
+ * PKEY_ENFORCE_API, then the thread must  have write access to the
+ * PKRU in order to make changes to the memory  mapping, such as
+ * mprotect/munmap.
+ */
+static inline int arch_check_pkey_enforce_api(struct mm_struct *mm,
+					      unsigned long start,
+					      unsigned long end)
+{
+	int error;
+	struct vm_area_struct *vma;
+
+	if (!arch_pkeys_enabled())
+		return 0;
+
+	while (true) {
+		vma = find_vma_intersection(mm, start, end);
+		if (!vma)
+			break;
+
+		error = __arch_check_vma_pkey_for_write(vma);
+		if (error)
+			return error;
+
+		if (vma->vm_end >= end)
+			break;
+
+		start = vma->vm_end;
+	}
+
+	return 0;
+}
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 81a482c3e051..7b00689e1c24 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -53,6 +53,15 @@ static inline bool arch_check_pkey_alloc_flags(unsigned long flags)
 		return false;
 	return true;
 }
+
+static inline int arch_check_pkey_enforce_api(struct mm_struct *mm,
+					      unsigned long start,
+					      unsigned long end)
+{
+	// Allow by default.
+	return 0;
+}
+
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
-- 
2.40.1.606.ga4b1b128d6-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/6] PKEY: Apply PKEY_ENFORCE_API to mprotect
  2023-05-15 13:05 [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 jeffxu
  2023-05-15 13:05 ` [PATCH 1/6] PKEY: Introduce PKEY_ENFORCE_API flag jeffxu
  2023-05-15 13:05 ` [PATCH 2/6] PKEY: Add arch_check_pkey_enforce_api() jeffxu
@ 2023-05-15 13:05 ` jeffxu
  2023-05-16 20:07   ` Kees Cook
  2023-05-16 23:18   ` Dave Hansen
  2023-05-15 13:05 ` [PATCH 4/6] PKEY:selftest pkey_enforce_api for mprotect jeffxu
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 44+ messages in thread
From: jeffxu @ 2023-05-15 13:05 UTC (permalink / raw)
  To: dave.hansen, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

From: Jeff Xu <jeffxu@google.com>

This patch enables PKEY_ENFORCE_API for the mprotect and
mprotect_pkey syscalls.

Signed-off-by: Jeff Xu<jeffxu@google.com>
---
 mm/mprotect.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8a68fdca8487..1378be50567d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -727,9 +727,13 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 
 /*
  * pkey==-1 when doing a legacy mprotect()
+ * syscall==true if this is called by syscall from userspace.
+ * Note: this is always true for now, added as a reminder in case that
+ * do_mprotect_pkey is called directly by kernel in the future.
+ * Also it is consistent with __do_munmap().
  */
 static int do_mprotect_pkey(unsigned long start, size_t len,
-		unsigned long prot, int pkey)
+		unsigned long prot, int pkey, bool syscall)
 {
 	unsigned long nstart, end, tmp, reqprot;
 	struct vm_area_struct *vma, *prev;
@@ -794,6 +798,21 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		}
 	}
 
+	/*
+	 * When called by syscall from userspace, check if the calling
+	 * thread has the PKEY permission to modify the memory mapping.
+	 */
+	if (syscall &&
+	    arch_check_pkey_enforce_api(current->mm, start, end) < 0) {
+		char comm[TASK_COMM_LEN];
+
+		pr_warn_ratelimited(
+			"munmap was denied on PKEY_ENFORCE_API memory, pid=%d '%s'\n",
+			task_pid_nr(current), get_task_comm(comm, current));
+		error = -EACCES;
+		goto out;
+	}
+
 	prev = vma_prev(&vmi);
 	if (start > vma->vm_start)
 		prev = vma;
@@ -878,7 +897,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 		unsigned long, prot)
 {
-	return do_mprotect_pkey(start, len, prot, -1);
+	return do_mprotect_pkey(start, len, prot, -1, true);
 }
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -886,7 +905,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
 SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
 		unsigned long, prot, int, pkey)
 {
-	return do_mprotect_pkey(start, len, prot, pkey);
+	return do_mprotect_pkey(start, len, prot, pkey, true);
 }
 
 SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
-- 
2.40.1.606.ga4b1b128d6-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/6] PKEY:selftest pkey_enforce_api for mprotect
  2023-05-15 13:05 [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 jeffxu
                   ` (2 preceding siblings ...)
  2023-05-15 13:05 ` [PATCH 3/6] PKEY: Apply PKEY_ENFORCE_API to mprotect jeffxu
@ 2023-05-15 13:05 ` jeffxu
  2023-05-15 13:05 ` [PATCH 5/6] KEY: Apply PKEY_ENFORCE_API to munmap jeffxu
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 44+ messages in thread
From: jeffxu @ 2023-05-15 13:05 UTC (permalink / raw)
  To: dave.hansen, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

From: Jeff Xu <jeffxu@google.com>

Add selftest for pkey_enforce_api for mprotect.

Signed-off-by: Jeff Xu<jeffxu@google.com>
---
 tools/testing/selftests/mm/Makefile           |   1 +
 tools/testing/selftests/mm/pkey_enforce_api.c | 875 ++++++++++++++++++
 2 files changed, 876 insertions(+)
 create mode 100644 tools/testing/selftests/mm/pkey_enforce_api.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 23af4633f0f4..93437a394128 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -71,6 +71,7 @@ CAN_BUILD_X86_64 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_64bit_pr
 CAN_BUILD_WITH_NOPIE := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_program.c -no-pie)
 
 VMTARGETS := protection_keys
+VMTARGETS += pkey_enforce_api
 BINARIES_32 := $(VMTARGETS:%=%_32)
 BINARIES_64 := $(VMTARGETS:%=%_64)
 
diff --git a/tools/testing/selftests/mm/pkey_enforce_api.c b/tools/testing/selftests/mm/pkey_enforce_api.c
new file mode 100644
index 000000000000..23663c89bc9c
--- /dev/null
+++ b/tools/testing/selftests/mm/pkey_enforce_api.c
@@ -0,0 +1,875 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Tests pkey_enforce_api
+ *
+ * Compile like this:
+ * gcc -mxsave      -o pkey_enforce_api    -O2 -g -std=gnu99 -pthread -Wall pkey_enforce_api.c \
+ * -lrt -ldl -lm
+ * gcc -mxsave -m32 -o pkey_enforce_api_32 -O2 -g -std=gnu99 -pthread -Wall pkey_enforce_api.c \
+ * -lrt -ldl -lm
+ */
+#define _GNU_SOURCE
+#define __SANE_USERSPACE_TYPES__
+#include <errno.h>
+#include <linux/elf.h>
+#include <linux/futex.h>
+#include <pthread.h>
+#include <time.h>
+#include <sys/time.h>
+#include <sys/syscall.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <signal.h>
+#include <assert.h>
+#include <stdlib.h>
+#include <ucontext.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/ptrace.h>
+#include <setjmp.h>
+#include "../kselftest.h"
+#include <sys/prctl.h>
+
+#if defined(__i386__) || defined(__x86_64__) /* arch */
+
+#define dprintf0(args...)
+#define dprintf1(args...)
+#define dprintf2(args...)
+#define dprintf3(args...)
+#define dprintf4(args...)
+
+#ifndef u16
+#define u16 __u16
+#endif
+
+#ifndef u32
+#define u32 __u32
+#endif
+
+#ifndef u64
+#define u64 __u64
+#endif
+
+#ifndef PTR_ERR_ENOTSUP
+#define PTR_ERR_ENOTSUP ((void *)-ENOTSUP)
+#endif
+
+int read_ptr(int *ptr)
+{
+	return *ptr;
+}
+
+void expected_pkey_fault(int pkey)
+{
+}
+
+#include "pkey-x86.h"
+
+#ifndef PKEY_ENFORCE_API
+#define PKEY_ENFORCE_API 1
+#endif
+
+#define PKEY_MASK (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)
+
+#define LOG_TEST_ENTER(x)                                                      \
+	{                                                                      \
+		printf("%s, enforce=%d\n", __func__, x);                       \
+	}
+static inline u64 set_pkey_bits(u64 reg, int pkey, u64 flags)
+{
+	u32 shift = pkey_bit_position(pkey);
+	/* mask out bits from pkey in old value */
+	reg &= ~((u64)PKEY_MASK << shift);
+	/* OR in new bits for pkey */
+	reg |= (flags & PKEY_MASK) << shift;
+	return reg;
+}
+
+static inline u64 get_pkey_bits(u64 reg, int pkey)
+{
+	u32 shift = pkey_bit_position(pkey);
+	/*
+	 * shift down the relevant bits to the lowest two, then
+	 * mask off all the other higher bits
+	 */
+	return ((reg >> shift) & PKEY_MASK);
+}
+
+static u32 get_pkey(int pkey)
+{
+	return (u32)get_pkey_bits(__read_pkey_reg(), pkey);
+}
+
+static void set_pkey(int pkey, unsigned long pkey_value)
+{
+	u32 mask = (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE);
+	u64 new_pkey_reg;
+
+	assert(!(pkey_value & ~mask));
+	new_pkey_reg = set_pkey_bits(__read_pkey_reg(), pkey, pkey_value);
+	__write_pkey_reg(new_pkey_reg);
+}
+
+void pkey_disable_set(int pkey, int value)
+{
+	int pkey_new;
+
+	assert(value & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+
+	pkey_new = get_pkey(pkey);
+	pkey_new |= value;
+	set_pkey(pkey, pkey_new);
+}
+
+void pkey_disable_clear(int pkey, int value)
+{
+	int pkey_new;
+
+	assert(value & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+
+	pkey_new = get_pkey(pkey);
+	pkey_new &= ~value;
+
+	set_pkey(pkey, pkey_new);
+}
+
+void pkey_write_allow(int pkey)
+{
+	pkey_disable_clear(pkey, PKEY_DISABLE_WRITE);
+}
+void pkey_write_deny(int pkey)
+{
+	pkey_disable_set(pkey, PKEY_DISABLE_WRITE);
+}
+void pkey_access_allow(int pkey)
+{
+	pkey_disable_clear(pkey, PKEY_DISABLE_ACCESS);
+}
+void pkey_access_deny(int pkey)
+{
+	pkey_disable_set(pkey, PKEY_DISABLE_ACCESS);
+}
+
+int sys_mprotect(void *ptr, size_t size, unsigned long prot)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(SYS_mprotect, ptr, size, prot);
+	return sret;
+}
+
+int sys_mprotect_pkey(void *ptr, size_t size, unsigned long orig_prot,
+		      unsigned long pkey)
+{
+	int sret;
+
+	errno = 0;
+	sret = syscall(SYS_mprotect_key, ptr, size, orig_prot, pkey);
+	return sret;
+}
+
+int sys_pkey_alloc(unsigned long flags, unsigned long init_val)
+{
+	int ret = syscall(SYS_pkey_alloc, flags, init_val);
+	return ret;
+}
+
+int sys_pkey_free(unsigned long pkey)
+{
+	int ret = syscall(SYS_pkey_free, pkey);
+	return ret;
+}
+
+bool can_create_pkey(void)
+{
+	int pkey;
+
+	pkey = sys_pkey_alloc(0, 0);
+	if (pkey <= 0)
+		return false;
+
+	sys_pkey_free(pkey);
+	return true;
+}
+
+static inline int is_pkeys_supported(void)
+{
+	/* check if the cpu supports pkeys */
+	if (!cpu_has_pkeys() || !can_create_pkey())
+		return 0;
+	return 1;
+}
+
+int pkey_alloc_with_check(bool enforce)
+{
+	int pkey;
+
+	if (enforce)
+		pkey = sys_pkey_alloc(PKEY_ENFORCE_API, 0);
+	else
+		pkey = sys_pkey_alloc(0, 0);
+
+	assert(pkey > 0);
+	return pkey;
+}
+
+void *addr1 = (void *)0x5000000;
+void *addr2 = (void *)0x5001000;
+void *addr3 = (void *)0x5002000;
+void *addr4 = (void *)0x5003000;
+
+void setup_single_address_with_pkey(bool enforce, int size, int *pkeyOut,
+				    void **ptrOut)
+{
+	int pkey;
+	void *ptr;
+	int ret;
+
+	pkey = pkey_alloc_with_check(enforce);
+
+	ptr = mmap(NULL, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	assert(ptr != (void *)-1);
+
+	// assign pkey to the memory.
+	ret = sys_mprotect_pkey((void *)ptr, size, PROT_READ, pkey);
+	assert(!ret);
+
+	*pkeyOut = pkey;
+	*ptrOut = ptr;
+}
+
+void setup_single_fixed_address_with_pkey(bool enforce, int size, int *pkeyOut,
+					  void **ptrOut)
+{
+	int pkey;
+	void *ptr;
+	int ret;
+
+	pkey = pkey_alloc_with_check(enforce);
+
+	ptr = mmap(addr1, size, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	assert(ptr == (void *)addr1);
+
+	// assign pkey to the memory.
+	ret = sys_mprotect_pkey((void *)ptr, size, PROT_READ, pkey);
+	assert(!ret);
+
+	*pkeyOut = pkey;
+	*ptrOut = ptr;
+}
+
+void clean_single_address_with_pkey(int pkey, void *ptr, int size)
+{
+	int ret;
+
+	ret = munmap(ptr, size);
+	assert(!ret);
+
+	ret = sys_pkey_free(pkey);
+	assert(ret == 0);
+}
+
+void setup_two_continues_fixed_address_with_pkey(bool enforce, int size,
+						 int *pkeyOut, void **ptrOut,
+						 void **ptr2Out)
+{
+	void *ptr;
+	void *ptr2;
+	int pkey;
+	int ret;
+
+	pkey = pkey_alloc_with_check(enforce);
+
+	ptr = mmap(addr1, size, PROT_READ,
+		   MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	assert(ptr == addr1);
+
+	ptr2 = mmap(addr2, size, PROT_READ,
+		    MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	assert(ptr2 == addr2);
+
+	// assign pkey to both addresses in the same call (merged)
+	ret = sys_mprotect_pkey(ptr, size * 2,
+				PROT_READ | PROT_WRITE | PROT_EXEC, pkey);
+	assert(!ret);
+	*pkeyOut = pkey;
+	*ptrOut = ptr;
+	*ptr2Out = ptr2;
+}
+
+void clean_two_address_with_pkey(int size, int pkey, void *ptr, void *ptr2)
+{
+	int ret;
+
+	ret = munmap(ptr, size);
+	assert(!ret);
+
+	ret = munmap(ptr2, size);
+	assert(!ret);
+
+	ret = sys_pkey_free(pkey);
+	assert(ret == 0);
+}
+
+// pkey_alloc with flags.
+void test_pkey_alloc(bool enforce)
+{
+	int ret;
+
+	LOG_TEST_ENTER(enforce);
+
+	ret = sys_pkey_alloc(0, 0);
+	assert(ret > 0);
+	ret = sys_pkey_free(ret);
+	assert(ret == 0);
+
+	if (enforce) {
+		ret = sys_pkey_alloc(PKEY_ENFORCE_API, 0);
+		assert(ret > 0);
+		ret = sys_pkey_free(ret);
+		assert(ret == 0);
+
+		// invalid flag.
+		ret = sys_pkey_alloc(0x4, 0);
+		assert(ret != 0);
+	}
+}
+
+// mmap one address.
+// assign pkey on the address.
+// mprotect is denied when no-writeable PKRU in enforce mode.
+void test_mprotect_single_address(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_single_fixed_address_with_pkey(enforce, size, &pkey, &ptr);
+
+	// disable write access.
+	pkey_write_deny(pkey);
+
+	ret = sys_mprotect_pkey(ptr, size, PROT_READ | PROT_WRITE, pkey);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(ret == 0);
+
+	pkey_write_allow(pkey);
+
+	ret = sys_mprotect_pkey(ptr, size, PROT_READ, pkey);
+	assert(!ret);
+
+	ret = sys_mprotect(ptr, size, PROT_READ);
+	assert(ret == 0);
+
+	clean_single_address_with_pkey(pkey, ptr, size);
+}
+
+// mmap two address (continuous two pages).
+// assign PKEY to them with one mprotect_pkey call (merged address).
+// mprotect is denied when non-writeable PKRU in enforce mode.
+void test_mprotect_two_address_merge(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+	void *ptr2;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_two_continues_fixed_address_with_pkey(enforce, size, &pkey, &ptr,
+						    &ptr2);
+
+	// disable write.
+	pkey_write_deny(pkey);
+
+	// modify the protection on both addresses (merged).
+	ret = sys_mprotect(ptr, size * 2, PROT_READ | PROT_WRITE | PROT_EXEC);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_mprotect_pkey(ptr, size * 2,
+				PROT_READ | PROT_WRITE | PROT_EXEC, pkey);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	// modify the protection on both addresses (merged).
+	ret = sys_mprotect(ptr, size * 2, PROT_READ | PROT_WRITE | PROT_EXEC);
+	assert(!ret);
+
+	ret = sys_mprotect_pkey(ptr, size * 2,
+				PROT_READ | PROT_WRITE | PROT_EXEC, pkey);
+	assert(!ret);
+
+	clean_two_address_with_pkey(size, pkey, ptr, ptr2);
+}
+
+void setup_two_continues_fixed_address_protect_second_with_pkey(
+	bool enforce, int size, int *pkeyOut, void **ptrOut, void **ptr2Out)
+{
+	void *ptr;
+	void *ptr2;
+	int pkey;
+	int ret;
+
+	LOG_TEST_ENTER(enforce);
+
+	pkey = pkey_alloc_with_check(enforce);
+
+	// mmap two addresses (continuous two pages).
+	ptr = mmap(addr1, size, PROT_READ,
+		   MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	assert(ptr == addr1);
+
+	ptr2 = mmap(addr2, size, PROT_READ,
+		    MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	assert(ptr2 == addr2);
+
+	// assign pkey to the second page.
+	ret = sys_mprotect_pkey(addr2, size, PROT_READ | PROT_WRITE | PROT_EXEC,
+				pkey);
+	assert(!ret);
+
+	*pkeyOut = pkey;
+	*ptrOut = ptr;
+	*ptr2Out = ptr2;
+}
+
+// mmap two address (continuous two pages).
+// assign PKEY to the second address.
+// mprotect on the second address is denied properly.
+// mprotect on both addresses (merged) is denied properly.
+void test_mprotect_two_address_deny_second(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+	void *ptr2;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_two_continues_fixed_address_protect_second_with_pkey(
+		enforce, size, &pkey, &ptr, &ptr2);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	// modify the first addr is allowed.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
+	assert(!ret);
+
+	// modify the second mmap is protected by pkey.
+	ret = sys_mprotect(ptr2, size, PROT_READ | PROT_WRITE | PROT_EXEC);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	// mprotect both addresses (merged).
+	ret = sys_mprotect_pkey(ptr, size * 2,
+				PROT_READ | PROT_WRITE | PROT_EXEC, pkey);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_mprotect(ptr, size * 2, PROT_READ | PROT_WRITE | PROT_EXEC);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	ret = sys_mprotect_pkey(ptr, size * 2, PROT_READ, pkey);
+	assert(!ret);
+
+	ret = sys_mprotect(ptr, size * 2, PROT_READ);
+	assert(!ret);
+
+	clean_two_address_with_pkey(size, pkey, ptr, ptr2);
+}
+
+void setup_4pages_fixed_protect_second_page(bool enforce, int size,
+					    int *pkeyOut, void **ptrOut,
+					    void **ptr2Out, void **ptr3Out)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+
+	pkey = pkey_alloc_with_check(enforce);
+
+	// allocate 4 pages.
+	ptr = mmap(addr1, size * 4, PROT_READ,
+		   MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	assert(ptr == addr1);
+
+	// assign pkey to the second address.
+	ret = sys_mprotect_pkey(addr2, size, PROT_READ | PROT_WRITE | PROT_EXEC,
+				pkey);
+	assert(!ret);
+
+	*pkeyOut = pkey;
+	*ptrOut = ptr;
+	*ptr2Out = addr2;
+	*ptr3Out = addr3;
+}
+
+// mmap one address with 4 pages.
+// assign PKEY to the second page only.
+// mprotect on the first page is allowed.
+// mprotect on the second page is protected in enforce mode.
+// mprotect on memory range that includes the second pages is protected.
+void test_mprotect_vma_middle_addr(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr, *ptr2, *ptr3;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_4pages_fixed_protect_second_page(enforce, size, &pkey, &ptr,
+					       &ptr2, &ptr3);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	// modify to the first page is allowed.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
+	assert(!ret);
+
+	// modify to the third page is allowed.
+	ret = sys_mprotect(ptr3, size, PROT_READ | PROT_WRITE | PROT_EXEC);
+	assert(!ret);
+
+	// modify to the second page is protected by pkey.
+	ret = sys_mprotect(ptr2, size, PROT_READ | PROT_WRITE | PROT_EXEC);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	// modify to memory range that includes the second page is protected.
+	ret = sys_mprotect_pkey(ptr, size * 4,
+				PROT_READ | PROT_WRITE | PROT_EXEC, pkey);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_mprotect(ptr, size * 4, PROT_READ | PROT_WRITE | PROT_EXEC);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	ret = sys_mprotect(addr2, size, PROT_READ | PROT_WRITE | PROT_EXEC);
+	assert(!ret);
+
+	ret = sys_mprotect_pkey(ptr, size * 4,
+				PROT_READ | PROT_WRITE | PROT_EXEC, pkey);
+	assert(!ret);
+
+	clean_single_address_with_pkey(pkey, ptr, size * 4);
+}
+
+// mmap one address with 4 pages.
+// assign PKEY to the second page only.
+// mprotect on the second page, but size is unaligned.
+void test_mprotect_unaligned(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr, *ptr2, *ptr3;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_4pages_fixed_protect_second_page(enforce, size, &pkey, &ptr,
+					       &ptr2, &ptr3);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	// modify to the first page is allowed.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
+	assert(!ret);
+
+	// modify to the second page is protected by pkey.
+	ret = sys_mprotect(ptr2, size - 1, PROT_READ | PROT_WRITE | PROT_EXEC);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	ret = sys_mprotect(addr2, size - 1, PROT_READ | PROT_WRITE | PROT_EXEC);
+	assert(!ret);
+
+	clean_single_address_with_pkey(pkey, ptr, size * 4);
+}
+
+// mmap one address with 4 pages.
+// assign PKEY to the second page only.
+// mprotect on the second page, but size is unaligned.
+void test_mprotect_unaligned2(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr, *ptr2, *ptr3;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_4pages_fixed_protect_second_page(enforce, size, &pkey, &ptr,
+					       &ptr2, &ptr3);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	// modify to the first page is allowed.
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
+	assert(!ret);
+
+	// modify to the second page is protected by pkey.
+	ret = sys_mprotect(ptr2, size + 1, PROT_READ | PROT_WRITE | PROT_EXEC);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	ret = sys_mprotect(addr2, size + 1, PROT_READ | PROT_WRITE | PROT_EXEC);
+	assert(!ret);
+
+	clean_single_address_with_pkey(pkey, ptr, size * 4);
+}
+
+void setup_address_with_gap_two_pkeys(bool enforce, int size, int *pkeyOut,
+				      int *pkey2Out, void **ptrOut,
+				      void **ptr2Out)
+{
+	int pkey, pkey2;
+	void *ptr, *ptr2;
+	int ret;
+
+	pkey = pkey_alloc_with_check(enforce);
+	pkey2 = pkey_alloc_with_check(enforce);
+
+	ptr = mmap(addr1, size, PROT_READ,
+		   MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	assert(ptr == (void *)addr1);
+
+	ptr2 = mmap(addr3, size, PROT_READ,
+		    MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	assert(ptr2 == (void *)addr3);
+
+	// assign pkey to the memory.
+	ret = sys_mprotect_pkey((void *)ptr, size, PROT_READ, pkey);
+	assert(!ret);
+
+	// assign pkey to the memory.
+	ret = sys_mprotect_pkey((void *)ptr2, size, PROT_READ, pkey2);
+	assert(!ret);
+
+	*pkeyOut = pkey;
+	*ptrOut = ptr;
+
+	*pkey2Out = pkey2;
+	*ptr2Out = ptr2;
+}
+
+void clean_address_with_pag_two_pkeys(int pkey, void *ptr, int pkey2,
+				      void *ptr2, int size)
+{
+	int ret;
+
+	ret = munmap(ptr, size);
+	assert(!ret);
+
+	ret = sys_pkey_free(pkey);
+	assert(ret == 0);
+
+	ret = munmap(ptr2, size);
+	assert(!ret);
+
+	ret = sys_pkey_free(pkey2);
+	assert(ret == 0);
+}
+
+// mmap two addresses, with a page gap between two.
+// assign pkeys on both address.
+// disable access to the second address.
+// mprotect from start of address1 to the end of address 2,
+// because there is a gap in the memory range, mprotect will fail.
+void test_mprotect_gapped_address_with_two_pkeys(bool enforce)
+{
+	int pkey, pkey2;
+	int ret;
+	void *ptr, *ptr2;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_address_with_gap_two_pkeys(enforce, size, &pkey, &pkey2, &ptr,
+					 &ptr2);
+
+	// disable write access.
+	pkey_write_deny(pkey2);
+
+	ret = sys_mprotect_pkey(ptr, size * 3, PROT_READ | PROT_WRITE, pkey);
+	assert(ret < 0);
+
+	ret = sys_mprotect(ptr, size * 3, PROT_READ | PROT_WRITE);
+	assert(ret < 0);
+
+	pkey_write_allow(pkey2);
+
+	ret = sys_mprotect_pkey(ptr, size * 3, PROT_READ, pkey);
+	assert(ret < 0);
+
+	ret = sys_mprotect(ptr, size * 3, PROT_READ);
+	assert(ret < 0);
+
+	clean_address_with_pag_two_pkeys(pkey, ptr, pkey2, ptr2, size);
+}
+
+struct thread_info {
+	int pkey;
+	void *addr;
+	int size;
+	bool enforce;
+};
+
+void *thread_mprotect(void *arg)
+{
+	struct thread_info *tinfo = arg;
+	void *ptr = tinfo->addr;
+	int size = tinfo->size;
+	bool enforce = tinfo->enforce;
+	int pkey = tinfo->pkey;
+	int ret;
+
+	// disable write access.
+	pkey_write_deny(pkey);
+	ret = sys_mprotect_pkey(ptr, size, PROT_READ | PROT_WRITE, pkey);
+
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = sys_mprotect(ptr, size, PROT_READ | PROT_WRITE);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(ret == 0);
+
+	pkey_write_allow(pkey);
+
+	ret = sys_mprotect_pkey(ptr, size, PROT_READ, pkey);
+	assert(!ret);
+
+	ret = sys_mprotect(ptr, size, PROT_READ);
+	assert(ret == 0);
+	return NULL;
+}
+
+// mmap one address.
+// assign pkey on the address.
+// in child thread, mprotect is denied when no-writeable PKRU in enforce mode.
+void test_mprotect_child_thread(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+	int size = PAGE_SIZE;
+	pthread_t thread;
+	struct thread_info tinfo;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_single_fixed_address_with_pkey(enforce, size, &pkey, &ptr);
+	tinfo.size = size;
+	tinfo.addr = ptr;
+	tinfo.enforce = enforce;
+	tinfo.pkey = pkey;
+
+	ret = pthread_create(&thread, NULL, thread_mprotect, (void *)&tinfo);
+	assert(ret == 0);
+	pthread_join(thread, NULL);
+
+	clean_single_address_with_pkey(pkey, ptr, size);
+}
+
+void test_enforce_api(void)
+{
+	for (int i = 0; i < 2; i++) {
+		bool enforce = (i == 1);
+
+		test_pkey_alloc(enforce);
+
+		test_mprotect_single_address(enforce);
+		test_mprotect_two_address_merge(enforce);
+		test_mprotect_two_address_deny_second(enforce);
+		test_mprotect_vma_middle_addr(enforce);
+		test_mprotect_unaligned(enforce);
+		test_mprotect_unaligned2(enforce);
+		test_mprotect_child_thread(enforce);
+		test_mprotect_gapped_address_with_two_pkeys(enforce);
+	}
+}
+
+int main(void)
+{
+	int pkeys_supported = is_pkeys_supported();
+
+	printf("pid: %d\n", getpid());
+	printf("has pkeys: %d\n", pkeys_supported);
+	if (!pkeys_supported) {
+		printf("PKEY not supported, skip the test.\n");
+		exit(0);
+	}
+
+	test_enforce_api();
+	printf("done (all tests OK)\n");
+	return 0;
+}
+#else /* arch */
+int main(void)
+{
+	printf("SKIP: not supported arch\n");
+	return 0;
+}
+#endif /* arch */
-- 
2.40.1.606.ga4b1b128d6-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/6] KEY: Apply PKEY_ENFORCE_API to munmap
  2023-05-15 13:05 [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 jeffxu
                   ` (3 preceding siblings ...)
  2023-05-15 13:05 ` [PATCH 4/6] PKEY:selftest pkey_enforce_api for mprotect jeffxu
@ 2023-05-15 13:05 ` jeffxu
  2023-05-16 20:06   ` Kees Cook
  2023-05-16 23:23   ` Dave Hansen
  2023-05-15 13:05 ` [PATCH 6/6] PKEY:selftest pkey_enforce_api for munmap jeffxu
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 44+ messages in thread
From: jeffxu @ 2023-05-15 13:05 UTC (permalink / raw)
  To: dave.hansen, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

From: Jeff Xu <jeffxu@google.com>

This patch enables PKEY_ENFORCE_API for the munmap
syscall.

Signed-off-by: Jeff Xu<jeffxu@google.com>
---
 include/linux/mm.h |  2 +-
 mm/mmap.c          | 34 ++++++++++++++++++++++++++--------
 mm/mremap.c        |  6 ++++--
 3 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..48076e845d53 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3136,7 +3136,7 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
-			 bool downgrade);
+			 bool downgrade, bool syscall);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
diff --git a/mm/mmap.c b/mm/mmap.c
index 13678edaa22c..29329aa794a6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2498,6 +2498,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
  * @uf: The userfaultfd list_head
  * @downgrade: set to true if the user wants to attempt to write_downgrade the
  * mmap_lock
+ * @syscall: set to true if this is called from syscall entry
  *
  * This function takes a @mas that is either pointing to the previous VMA or set
  * to MA_START and sets it up to remove the mapping(s).  The @len will be
@@ -2507,7 +2508,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
  */
 int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 		  unsigned long start, size_t len, struct list_head *uf,
-		  bool downgrade)
+		  bool downgrade, bool syscall)
 {
 	unsigned long end;
 	struct vm_area_struct *vma;
@@ -2519,6 +2520,19 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 	if (end == start)
 		return -EINVAL;
 
+	/*
+	 * When called by syscall from userspace, check if the calling
+	 * thread has the PKEY permission to modify the memory mapping.
+	 */
+	if (syscall && arch_check_pkey_enforce_api(mm, start, end) < 0) {
+		char comm[TASK_COMM_LEN];
+
+		pr_warn_ratelimited(
+			"munmap was denied on PKEY_ENFORCE_API memory, pid=%d '%s'\n",
+			task_pid_nr(current), get_task_comm(comm, current));
+		return -EACCES;
+	}
+
 	 /* arch_unmap() might do unmaps itself.  */
 	arch_unmap(mm, start, end);
 
@@ -2541,7 +2555,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 {
 	VMA_ITERATOR(vmi, mm, start);
 
-	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
+	return do_vmi_munmap(&vmi, mm, start, len, uf, false, false);
 }
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
@@ -2575,7 +2589,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	}
 
 	/* Unmap any existing mapping in the area */
-	if (do_vmi_munmap(&vmi, mm, addr, len, uf, false))
+	if (do_vmi_munmap(&vmi, mm, addr, len, uf, false, false))
 		return -ENOMEM;
 
 	/*
@@ -2792,7 +2806,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	return error;
 }
 
-static int __vm_munmap(unsigned long start, size_t len, bool downgrade)
+/*
+ * @syscall: set to true if this is called from syscall entry
+ */
+static int __vm_munmap(unsigned long start, size_t len, bool downgrade,
+		       bool syscall)
 {
 	int ret;
 	struct mm_struct *mm = current->mm;
@@ -2802,7 +2820,7 @@ static int __vm_munmap(unsigned long start, size_t len, bool downgrade)
 	if (mmap_write_lock_killable(mm))
 		return -EINTR;
 
-	ret = do_vmi_munmap(&vmi, mm, start, len, &uf, downgrade);
+	ret = do_vmi_munmap(&vmi, mm, start, len, &uf, downgrade, syscall);
 	/*
 	 * Returning 1 indicates mmap_lock is downgraded.
 	 * But 1 is not legal return value of vm_munmap() and munmap(), reset
@@ -2820,14 +2838,14 @@ static int __vm_munmap(unsigned long start, size_t len, bool downgrade)
 
 int vm_munmap(unsigned long start, size_t len)
 {
-	return __vm_munmap(start, len, false);
+	return __vm_munmap(start, len, false, false);
 }
 EXPORT_SYMBOL(vm_munmap);
 
 SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
 {
 	addr = untagged_addr(addr);
-	return __vm_munmap(addr, len, true);
+	return __vm_munmap(addr, len, true, true);
 }
 
 
@@ -3055,7 +3073,7 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
 	if (ret)
 		goto limits_failed;
 
-	ret = do_vmi_munmap(&vmi, mm, addr, len, &uf, 0);
+	ret = do_vmi_munmap(&vmi, mm, addr, len, &uf, 0, false);
 	if (ret)
 		goto munmap_failed;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index b11ce6c92099..768e5bd4e8b5 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -703,7 +703,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	}
 
 	vma_iter_init(&vmi, mm, old_addr);
-	if (do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false) < 0) {
+	if (do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false, false) <
+	    0) {
 		/* OOM: unable to split vma, just get accounts right */
 		if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
 			vm_acct_memory(old_len >> PAGE_SHIFT);
@@ -993,7 +994,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 		VMA_ITERATOR(vmi, mm, addr + new_len);
 
 		retval = do_vmi_munmap(&vmi, mm, addr + new_len,
-				       old_len - new_len, &uf_unmap, true);
+				       old_len - new_len, &uf_unmap, true,
+				       false);
 		/* Returning 1 indicates mmap_lock is downgraded to read. */
 		if (retval == 1) {
 			downgraded = true;
-- 
2.40.1.606.ga4b1b128d6-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/6] PKEY:selftest pkey_enforce_api for munmap
  2023-05-15 13:05 [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 jeffxu
                   ` (4 preceding siblings ...)
  2023-05-15 13:05 ` [PATCH 5/6] KEY: Apply PKEY_ENFORCE_API to munmap jeffxu
@ 2023-05-15 13:05 ` jeffxu
  2023-05-15 14:28 ` [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 Dave Hansen
  2023-05-16 20:08 ` Kees Cook
  7 siblings, 0 replies; 44+ messages in thread
From: jeffxu @ 2023-05-15 13:05 UTC (permalink / raw)
  To: dave.hansen, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

From: Jeff Xu <jeffxu@google.com>

Add selftest for pkey_enforce_api for mprotect

Signed-off-by: Jeff Xu<jeffxu@google.com>
---
 tools/testing/selftests/mm/pkey_enforce_api.c | 437 ++++++++++++++++++
 1 file changed, 437 insertions(+)

diff --git a/tools/testing/selftests/mm/pkey_enforce_api.c b/tools/testing/selftests/mm/pkey_enforce_api.c
index 23663c89bc9c..92aa29248e1f 100644
--- a/tools/testing/selftests/mm/pkey_enforce_api.c
+++ b/tools/testing/selftests/mm/pkey_enforce_api.c
@@ -833,6 +833,429 @@ void test_mprotect_child_thread(bool enforce)
 	clean_single_address_with_pkey(pkey, ptr, size);
 }
 
+// mmap one address with one page.
+// assign PKEY to the address.
+// munmap on the address is protected.
+void test_munmap_single_address(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_single_address_with_pkey(enforce, size, &pkey, &ptr);
+
+	// disable write access.
+	pkey_write_deny(pkey);
+
+	ret = munmap(ptr, size);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	if (enforce) {
+		ret = munmap(ptr, size);
+		assert(!ret);
+	}
+
+	clean_single_address_with_pkey(pkey, ptr, size);
+}
+
+// mmap two address (continuous two pages).
+// assign PKEY to them with one mprotect_pkey call (merged address).
+// munmap two address in one call (merged address).
+void test_munmap_two_address_merge(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+	void *ptr2;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_two_continues_fixed_address_with_pkey(enforce, size, &pkey, &ptr,
+						    &ptr2);
+
+	// disable write.
+	pkey_write_deny(pkey);
+
+	// munmap on both addresses with one call (merged).
+	ret = munmap(ptr, size * 2);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	if (enforce) {
+		ret = munmap(ptr, size * 2);
+		assert(!ret);
+	}
+
+	ret = sys_pkey_free(pkey);
+	assert(ret == 0);
+}
+
+// mmap two address (continuous two pages).
+// assign PKEY to the second address.
+// munmap on the second address is protected.
+void test_munmap_two_address_deny_second(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+	void *ptr2;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_two_continues_fixed_address_protect_second_with_pkey(
+		enforce, size, &pkey, &ptr, &ptr2);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	ret = munmap(ptr2, size);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	ret = munmap(ptr, size);
+	assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	if (enforce) {
+		ret = munmap(ptr2, size);
+		assert(!ret);
+	}
+
+	ret = sys_pkey_free(pkey);
+	assert(ret == 0);
+}
+
+// mmap two address (continuous two pages).
+// assign PKEY to the second address.
+// munmap on the range that includes the second address.
+void test_munmap_two_address_deny_range(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+	void *ptr2;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_two_continues_fixed_address_protect_second_with_pkey(
+		enforce, size, &pkey, &ptr, &ptr2);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	ret = munmap(ptr, size * 2);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	if (enforce) {
+		ret = munmap(ptr, size * 2);
+		assert(!ret);
+	}
+
+	ret = sys_pkey_free(pkey);
+	assert(ret == 0);
+}
+
+// mmap one address with 4 pages.
+// assign PKEY to the second page only.
+// munmap on memory range that includes the second pages is protected.
+void test_munmap_vma_middle_addr(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr, *ptr2, *ptr3;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_4pages_fixed_protect_second_page(enforce, size, &pkey, &ptr,
+					       &ptr2, &ptr3);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	// munmap support merge, we are going to make sure we don't regress.
+	ret = munmap(addr1, size * 4);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	if (enforce) {
+		ret = munmap(ptr, size * 4);
+		assert(!ret);
+	}
+
+	ret = sys_pkey_free(pkey);
+	assert(ret == 0);
+}
+
+// mmap one address with 4 pages.
+// assign PKEY to the second page only.
+// munmap from 2nd page.
+void test_munmap_shrink(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr, *ptr2, *ptr3;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_4pages_fixed_protect_second_page(enforce, size, &pkey, &ptr,
+					       &ptr2, &ptr3);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	// munmap support merge, we are going to make sure we don't regress.
+	ret = munmap(ptr2, size * 3);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	if (enforce) {
+		ret = munmap(ptr2, size * 3);
+		assert(!ret);
+	}
+
+	ret = munmap(ptr, size);
+	assert(!ret);
+
+	ret = sys_pkey_free(pkey);
+	assert(ret == 0);
+}
+
+// mmap one address with 4 pages.
+// assign PKEY to the second page only.
+// munmap from 2nd page but size is less than one page
+void test_munmap_unaligned(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr, *ptr2, *ptr3;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_4pages_fixed_protect_second_page(enforce, size, &pkey, &ptr,
+					       &ptr2, &ptr3);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	// munmap support merge, we are going to make sure we don't regress.
+	ret = munmap(ptr2, size - 1);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	if (enforce) {
+		ret = munmap(ptr2, size - 1);
+		assert(!ret);
+	}
+
+	ret = munmap(ptr, size * 4);
+	assert(!ret);
+
+	ret = sys_pkey_free(pkey);
+	assert(ret == 0);
+}
+
+// mmap one address with 4 pages.
+// assign PKEY to the second page only.
+// munmap from 2nd page but size is less than one page
+void test_munmap_unaligned2(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr, *ptr2, *ptr3;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_4pages_fixed_protect_second_page(enforce, size, &pkey, &ptr,
+					       &ptr2, &ptr3);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	// munmap support merge, we are going to make sure we don't regress.
+	ret = munmap(ptr2, size + 1);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	if (enforce) {
+		ret = munmap(ptr2, size + 1);
+		assert(!ret);
+	}
+
+	ret = munmap(ptr, size * 4);
+	assert(!ret);
+
+	ret = sys_pkey_free(pkey);
+	assert(ret == 0);
+}
+
+// mmap one address with one page.
+// assign PKEY to the address.
+// munmap on the address but with size of 4 pages(should OK).
+void test_munmap_outbound_addr(bool enforce)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_single_fixed_address_with_pkey(enforce, size, &pkey, &ptr);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	// Interesting enough, this is allowed, even the other 3 pages are not allocated.
+	ret = munmap(addr1, size * 4);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey);
+
+	if (enforce) {
+		ret = munmap(addr1, size * 4);
+		assert(!ret);
+	}
+
+	clean_single_address_with_pkey(pkey, ptr, size);
+}
+// mmap two addresses, with a page gap between two.
+// assign pkeys on both address.
+// disable access to the second address.
+// munmap from start of address1 to the end of address 2,
+// because there is a gap in the memory range, mprotect will fail.
+void test_munmap_gapped_address_with_two_pkeys(bool enforce)
+{
+	int pkey, pkey2;
+	int ret;
+	void *ptr, *ptr2;
+	int size = PAGE_SIZE;
+
+	LOG_TEST_ENTER(enforce);
+
+	setup_address_with_gap_two_pkeys(enforce, size, &pkey, &pkey2, &ptr,
+					 &ptr2);
+
+	// disable write access.
+	pkey_write_deny(pkey2);
+
+	// Interesting enough, this is allowed, even there is a gap beween address 1 and 2.
+	ret = munmap(addr1, size * 3);
+	if (enforce)
+		assert(ret < 0);
+	else
+		assert(!ret);
+
+	pkey_write_allow(pkey2);
+	if (enforce) {
+		ret = munmap(addr1, size * 3);
+		assert(!ret);
+	}
+}
+
+// use write-deny pkey and see if program can exit properly.
+// This is manual test, run it at end if needed.
+void test_exit_munmap_disable_write(void)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+	int size = PAGE_SIZE;
+
+	pkey = sys_pkey_alloc(PKEY_ENFORCE_API, 0);
+	assert(pkey > 0);
+
+	// allocate 1 page.
+	ptr = mmap(addr1, size, PROT_READ,
+		   MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	assert(ptr == addr1);
+
+	// assign pkey to the first address.
+	ret = sys_mprotect_pkey(ptr, size, PROT_READ | PROT_WRITE | PROT_EXEC,
+				pkey);
+	assert(!ret);
+
+	// disable write through pkey.
+	pkey_write_deny(pkey);
+
+	ret = munmap(ptr, size);
+	assert(ret < 0);
+}
+
+// use disable-all pkey and see if program can exit properly.
+// This is manual test, run it at end if needed.
+void test_exit_munmap_disable_all(void)
+{
+	int pkey;
+	int ret;
+	void *ptr;
+	int size = PAGE_SIZE;
+
+	pkey = sys_pkey_alloc(PKEY_ENFORCE_API, 0);
+	assert(pkey > 0);
+
+	// allocate 1 page.
+	ptr = mmap(addr2, size, PROT_READ,
+		   MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
+	assert(ptr == addr2);
+
+	// assign pkey to the first address.
+	ret = sys_mprotect_pkey(ptr, size, PROT_READ | PROT_WRITE | PROT_EXEC,
+				pkey);
+	assert(!ret);
+
+	// disable write through pkey.
+	pkey_access_deny(pkey);
+
+	ret = munmap(addr1, size);
+	assert(ret < 0);
+}
+
 void test_enforce_api(void)
 {
 	for (int i = 0; i < 2; i++) {
@@ -848,7 +1271,21 @@ void test_enforce_api(void)
 		test_mprotect_unaligned2(enforce);
 		test_mprotect_child_thread(enforce);
 		test_mprotect_gapped_address_with_two_pkeys(enforce);
+
+		test_munmap_single_address(enforce);
+		test_munmap_two_address_merge(enforce);
+		test_munmap_two_address_deny_second(enforce);
+		test_munmap_two_address_deny_range(enforce);
+		test_munmap_vma_middle_addr(enforce);
+		test_munmap_outbound_addr(enforce);
+		test_munmap_shrink(enforce);
+		test_munmap_unaligned(enforce);
+		test_munmap_unaligned2(enforce);
+		test_munmap_gapped_address_with_two_pkeys(enforce);
 	}
+
+	test_exit_munmap_disable_write();
+	test_exit_munmap_disable_all();
 }
 
 int main(void)
-- 
2.40.1.606.ga4b1b128d6-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-15 13:05 [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 jeffxu
                   ` (5 preceding siblings ...)
  2023-05-15 13:05 ` [PATCH 6/6] PKEY:selftest pkey_enforce_api for munmap jeffxu
@ 2023-05-15 14:28 ` Dave Hansen
  2023-05-16  7:06   ` Stephen Röttger
  2023-05-16 20:08 ` Kees Cook
  7 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2023-05-15 14:28 UTC (permalink / raw)
  To: jeffxu, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On 5/15/23 06:05, jeffxu@chromium.org wrote:
> We're using PKU for in-process isolation to enforce control-flow integrity
> for a JIT compiler. In our threat model, an attacker exploits a 
> vulnerability and has arbitrary read/write access to the whole process
> space concurrently to other threads being executed. This attacker can
> manipulate some arguments to syscalls from some threads.

This all sounds like it hinges on the contents of PKRU in the attacker
thread.

Could you talk a bit about how the attacker is prevented from running
WRPKRU, XRSTOR or compelling the kernel to write to PKRU like at sigreturn?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-15 14:28 ` [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 Dave Hansen
@ 2023-05-16  7:06   ` Stephen Röttger
  2023-05-16 22:41     ` Dave Hansen
  0 siblings, 1 reply; 44+ messages in thread
From: Stephen Röttger @ 2023-05-16  7:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: jeffxu, luto, jorgelo, keescook, groeck, jannh, akpm, jeffxu,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

[-- Attachment #1: Type: text/plain, Size: 1249 bytes --]

On Mon, May 15, 2023 at 4:28 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/15/23 06:05, jeffxu@chromium.org wrote:
> > We're using PKU for in-process isolation to enforce control-flow integrity
> > for a JIT compiler. In our threat model, an attacker exploits a
> > vulnerability and has arbitrary read/write access to the whole process
> > space concurrently to other threads being executed. This attacker can
> > manipulate some arguments to syscalls from some threads.
>
> This all sounds like it hinges on the contents of PKRU in the attacker
> thread.
>
> Could you talk a bit about how the attacker is prevented from running
> WRPKRU, XRSTOR or compelling the kernel to write to PKRU like at sigreturn?

(resending without html)

Since we're using the feature for control-flow integrity, we assume
the control-flow is still intact at this point. I.e. the attacker
thread can't run arbitrary instructions.
* For JIT code, we're going to scan it for wrpkru instructions before
writing it to executable memory
* For regular code, we only use wrpkru around short critical sections
to temporarily enable write access

Sigreturn is a separate problem that we hope to solve by adding pkey
support to sigaltstack

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4005 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KEY: Apply PKEY_ENFORCE_API to munmap
  2023-05-15 13:05 ` [PATCH 5/6] KEY: Apply PKEY_ENFORCE_API to munmap jeffxu
@ 2023-05-16 20:06   ` Kees Cook
  2023-05-16 22:24     ` Jeff Xu
  2023-05-16 23:23   ` Dave Hansen
  1 sibling, 1 reply; 44+ messages in thread
From: Kees Cook @ 2023-05-16 20:06 UTC (permalink / raw)
  To: jeffxu
  Cc: dave.hansen, luto, jorgelo, groeck, jannh, sroettger, akpm,
	jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Mon, May 15, 2023 at 01:05:51PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@google.com>
> 
> This patch enables PKEY_ENFORCE_API for the munmap
> syscall.
> 
> Signed-off-by: Jeff Xu<jeffxu@google.com>
> ---
>  include/linux/mm.h |  2 +-
>  mm/mmap.c          | 34 ++++++++++++++++++++++++++--------
>  mm/mremap.c        |  6 ++++--
>  3 files changed, 31 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..48076e845d53 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3136,7 +3136,7 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
>  	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
>  extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  			 unsigned long start, size_t len, struct list_head *uf,
> -			 bool downgrade);
> +			 bool downgrade, bool syscall);

For type checking and readability, I suggest using an enum instead of
"bool". Perhaps something like:

enum caller_origin {
	ON_BEHALF_OF_KERNEL = 0,
	ON_BEHALF_OF_USERSPACE,
};

 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
-			 bool downgrade);
+			 bool downgrade, enum caller_origin called);

>  extern int do_munmap(struct mm_struct *, unsigned long, size_t,
>  		     struct list_head *uf);
>  extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 13678edaa22c..29329aa794a6 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2498,6 +2498,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>   * @uf: The userfaultfd list_head
>   * @downgrade: set to true if the user wants to attempt to write_downgrade the
>   * mmap_lock
> + * @syscall: set to true if this is called from syscall entry
>   *
>   * This function takes a @mas that is either pointing to the previous VMA or set
>   * to MA_START and sets it up to remove the mapping(s).  The @len will be
> @@ -2507,7 +2508,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>   */
>  int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  		  unsigned long start, size_t len, struct list_head *uf,
> -		  bool downgrade)
> +		  bool downgrade, bool syscall)
>  {
>  	unsigned long end;
>  	struct vm_area_struct *vma;
> @@ -2519,6 +2520,19 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  	if (end == start)
>  		return -EINVAL;
>  
> +	/*
> +	 * When called by syscall from userspace, check if the calling
> +	 * thread has the PKEY permission to modify the memory mapping.
> +	 */
> +	if (syscall && arch_check_pkey_enforce_api(mm, start, end) < 0) {

	if (called == ON_BEHALF_OF_USERSPACE &&
	    arch_check_pkey_enforce_api(mm, start, end) < 0) {

> +		char comm[TASK_COMM_LEN];
> +
> +		pr_warn_ratelimited(
> +			"munmap was denied on PKEY_ENFORCE_API memory, pid=%d '%s'\n",
> +			task_pid_nr(current), get_task_comm(comm, current));
> +		return -EACCES;
> +	}
> +
>  	 /* arch_unmap() might do unmaps itself.  */
>  	arch_unmap(mm, start, end);
>  
> @@ -2541,7 +2555,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  {
>  	VMA_ITERATOR(vmi, mm, start);
>  
> -	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
> +	return do_vmi_munmap(&vmi, mm, start, len, uf, false, false);

+	return do_vmi_munmap(&vmi, mm, start, len, uf, false, ON_BEHALF_OF_KERNEL);

> [...]
>  SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>  {
>  	addr = untagged_addr(addr);
> -	return __vm_munmap(addr, len, true);
> +	return __vm_munmap(addr, len, true, true);

+	return __vm_munmap(addr, len, true, ON_BEHALF_OF_USERSPACE);

etc.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/6] PKEY: Apply PKEY_ENFORCE_API to mprotect
  2023-05-15 13:05 ` [PATCH 3/6] PKEY: Apply PKEY_ENFORCE_API to mprotect jeffxu
@ 2023-05-16 20:07   ` Kees Cook
  2023-05-16 22:23     ` Jeff Xu
  2023-05-16 23:18   ` Dave Hansen
  1 sibling, 1 reply; 44+ messages in thread
From: Kees Cook @ 2023-05-16 20:07 UTC (permalink / raw)
  To: jeffxu
  Cc: dave.hansen, luto, jorgelo, groeck, jannh, sroettger, akpm,
	jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Mon, May 15, 2023 at 01:05:49PM +0000, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@google.com>
> 
> This patch enables PKEY_ENFORCE_API for the mprotect and
> mprotect_pkey syscalls.

All callers are from userspace -- this change looks like a no-op?

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-15 13:05 [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 jeffxu
                   ` (6 preceding siblings ...)
  2023-05-15 14:28 ` [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 Dave Hansen
@ 2023-05-16 20:08 ` Kees Cook
  2023-05-16 22:17   ` Jeff Xu
  2023-05-17 10:49   ` Stephen Röttger
  7 siblings, 2 replies; 44+ messages in thread
From: Kees Cook @ 2023-05-16 20:08 UTC (permalink / raw)
  To: jeffxu
  Cc: dave.hansen, luto, jorgelo, groeck, jannh, sroettger, akpm,
	jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Mon, May 15, 2023 at 01:05:46PM +0000, jeffxu@chromium.org wrote:
> This patch introduces a new flag, PKEY_ENFORCE_API, to the pkey_alloc()
> function. When a PKEY is created with this flag, it is enforced that any
> thread that wants to make changes to the memory mapping (such as mprotect)
> of the memory must have write access to the PKEY. PKEYs created without
> this flag will continue to work as they do now, for backwards 
> compatibility.
> 
> Only PKEY created from user space can have the new flag set, the PKEY
> allocated by the kernel internally will not have it. In other words,
> ARCH_DEFAULT_PKEY(0) and execute_only_pkey won’t have this flag set,
> and continue work as today.

Cool! Yeah, this looks like it could become quite useful. I assume
V8 folks are on board with this API, etc?

> This set of patch covers mprotect/munmap, I plan to work on other 
> syscalls after this. 

Which ones are on your list currently?

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-16 20:08 ` Kees Cook
@ 2023-05-16 22:17   ` Jeff Xu
  2023-05-16 22:30     ` Dave Hansen
  2023-05-17 10:49   ` Stephen Röttger
  1 sibling, 1 reply; 44+ messages in thread
From: Jeff Xu @ 2023-05-16 22:17 UTC (permalink / raw)
  To: Kees Cook
  Cc: jeffxu, dave.hansen, luto, jorgelo, groeck, jannh, sroettger,
	akpm, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Tue, May 16, 2023 at 1:08 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, May 15, 2023 at 01:05:46PM +0000, jeffxu@chromium.org wrote:
> > This patch introduces a new flag, PKEY_ENFORCE_API, to the pkey_alloc()
> > function. When a PKEY is created with this flag, it is enforced that any
> > thread that wants to make changes to the memory mapping (such as mprotect)
> > of the memory must have write access to the PKEY. PKEYs created without
> > this flag will continue to work as they do now, for backwards
> > compatibility.
> >
> > Only PKEY created from user space can have the new flag set, the PKEY
> > allocated by the kernel internally will not have it. In other words,
> > ARCH_DEFAULT_PKEY(0) and execute_only_pkey won’t have this flag set,
> > and continue work as today.
>
> Cool! Yeah, this looks like it could become quite useful. I assume
> V8 folks are on board with this API, etc?
>
> > This set of patch covers mprotect/munmap, I plan to work on other
> > syscalls after this.
>
> Which ones are on your list currently?
>
mprotect/mprotect_pkey/munmap
mmap/mremap
madvice,brk,sbrk

Thanks!
-Jeff Xu

> --
> Kees Cook

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/6] PKEY: Apply PKEY_ENFORCE_API to mprotect
  2023-05-16 20:07   ` Kees Cook
@ 2023-05-16 22:23     ` Jeff Xu
  0 siblings, 0 replies; 44+ messages in thread
From: Jeff Xu @ 2023-05-16 22:23 UTC (permalink / raw)
  To: Kees Cook
  Cc: jeffxu, dave.hansen, luto, jorgelo, groeck, jannh, sroettger,
	akpm, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Tue, May 16, 2023 at 1:07 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, May 15, 2023 at 01:05:49PM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@google.com>
> >
> > This patch enables PKEY_ENFORCE_API for the mprotect and
> > mprotect_pkey syscalls.
>
> All callers are from userspace -- this change looks like a no-op?
>
Yes. All callers are from user space now.
I am thinking about the future when someone adds a caller in kernel
code and may miss the check.
This is also consistent with munmap and other syscalls I plan to change.
There are comments on do_mprotect_pkey() to describe how this flag is used.


> -Kees
>
> --
> Kees Cook

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KEY: Apply PKEY_ENFORCE_API to munmap
  2023-05-16 20:06   ` Kees Cook
@ 2023-05-16 22:24     ` Jeff Xu
  0 siblings, 0 replies; 44+ messages in thread
From: Jeff Xu @ 2023-05-16 22:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: jeffxu, dave.hansen, luto, jorgelo, groeck, jannh, sroettger,
	akpm, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Tue, May 16, 2023 at 1:13 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, May 15, 2023 at 01:05:51PM +0000, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@google.com>
> >
> > This patch enables PKEY_ENFORCE_API for the munmap
> > syscall.
> >
> > Signed-off-by: Jeff Xu<jeffxu@google.com>
> > ---
> >  include/linux/mm.h |  2 +-
> >  mm/mmap.c          | 34 ++++++++++++++++++++++++++--------
> >  mm/mremap.c        |  6 ++++--
> >  3 files changed, 31 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 27ce77080c79..48076e845d53 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3136,7 +3136,7 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
> >       unsigned long pgoff, unsigned long *populate, struct list_head *uf);
> >  extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
> >                        unsigned long start, size_t len, struct list_head *uf,
> > -                      bool downgrade);
> > +                      bool downgrade, bool syscall);
>
> For type checking and readability, I suggest using an enum instead of
> "bool". Perhaps something like:
>
> enum caller_origin {
>         ON_BEHALF_OF_KERNEL = 0,
>         ON_BEHALF_OF_USERSPACE,
> };
>
Sure, it makes sense.


>  extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>                          unsigned long start, size_t len, struct list_head *uf,
> -                        bool downgrade);
> +                        bool downgrade, enum caller_origin called);
>
> >  extern int do_munmap(struct mm_struct *, unsigned long, size_t,
> >                    struct list_head *uf);
> >  extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 13678edaa22c..29329aa794a6 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2498,6 +2498,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >   * @uf: The userfaultfd list_head
> >   * @downgrade: set to true if the user wants to attempt to write_downgrade the
> >   * mmap_lock
> > + * @syscall: set to true if this is called from syscall entry
> >   *
> >   * This function takes a @mas that is either pointing to the previous VMA or set
> >   * to MA_START and sets it up to remove the mapping(s).  The @len will be
> > @@ -2507,7 +2508,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >   */
> >  int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
> >                 unsigned long start, size_t len, struct list_head *uf,
> > -               bool downgrade)
> > +               bool downgrade, bool syscall)
> >  {
> >       unsigned long end;
> >       struct vm_area_struct *vma;
> > @@ -2519,6 +2520,19 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
> >       if (end == start)
> >               return -EINVAL;
> >
> > +     /*
> > +      * When called by syscall from userspace, check if the calling
> > +      * thread has the PKEY permission to modify the memory mapping.
> > +      */
> > +     if (syscall && arch_check_pkey_enforce_api(mm, start, end) < 0) {
>
>         if (called == ON_BEHALF_OF_USERSPACE &&
>             arch_check_pkey_enforce_api(mm, start, end) < 0) {
>
> > +             char comm[TASK_COMM_LEN];
> > +
> > +             pr_warn_ratelimited(
> > +                     "munmap was denied on PKEY_ENFORCE_API memory, pid=%d '%s'\n",
> > +                     task_pid_nr(current), get_task_comm(comm, current));
> > +             return -EACCES;
> > +     }
> > +
> >        /* arch_unmap() might do unmaps itself.  */
> >       arch_unmap(mm, start, end);
> >
> > @@ -2541,7 +2555,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> >  {
> >       VMA_ITERATOR(vmi, mm, start);
> >
> > -     return do_vmi_munmap(&vmi, mm, start, len, uf, false);
> > +     return do_vmi_munmap(&vmi, mm, start, len, uf, false, false);
>
> +       return do_vmi_munmap(&vmi, mm, start, len, uf, false, ON_BEHALF_OF_KERNEL);
>
> > [...]
> >  SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
> >  {
> >       addr = untagged_addr(addr);
> > -     return __vm_munmap(addr, len, true);
> > +     return __vm_munmap(addr, len, true, true);
>
> +       return __vm_munmap(addr, len, true, ON_BEHALF_OF_USERSPACE);
>
> etc.
>
> --
> Kees Cook

Thanks!
-Jeff Xu

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-16 22:17   ` Jeff Xu
@ 2023-05-16 22:30     ` Dave Hansen
  2023-05-16 23:39       ` Jeff Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2023-05-16 22:30 UTC (permalink / raw)
  To: Jeff Xu, Kees Cook
  Cc: jeffxu, luto, jorgelo, groeck, jannh, sroettger, akpm,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

On 5/16/23 15:17, Jeff Xu wrote:
>>> This set of patch covers mprotect/munmap, I plan to work on other
>>> syscalls after this.
>> Which ones are on your list currently?
>>
> mprotect/mprotect_pkey/munmap
> mmap/mremap
> madvice,brk,sbrk

What about pkey_free()?

Without that, someone can presumably free the pkey and then reallocate
it without PKEY_ENFORCE_API.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-16  7:06   ` Stephen Röttger
@ 2023-05-16 22:41     ` Dave Hansen
  2023-05-17 10:51       ` Stephen Röttger
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2023-05-16 22:41 UTC (permalink / raw)
  To: Stephen Röttger
  Cc: jeffxu, luto, jorgelo, keescook, groeck, jannh, akpm, jeffxu,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

On 5/16/23 00:06, Stephen Röttger wrote:
> On Mon, May 15, 2023 at 4:28 PM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 5/15/23 06:05, jeffxu@chromium.org wrote:
>>> We're using PKU for in-process isolation to enforce control-flow integrity
>>> for a JIT compiler. In our threat model, an attacker exploits a
>>> vulnerability and has arbitrary read/write access to the whole process
>>> space concurrently to other threads being executed. This attacker can
>>> manipulate some arguments to syscalls from some threads.
>>
>> This all sounds like it hinges on the contents of PKRU in the attacker
>> thread.
>>
>> Could you talk a bit about how the attacker is prevented from running
>> WRPKRU, XRSTOR or compelling the kernel to write to PKRU like at sigreturn?
> 
> (resending without html)
> 
> Since we're using the feature for control-flow integrity, we assume
> the control-flow is still intact at this point. I.e. the attacker
> thread can't run arbitrary instructions.

Can't run arbitrary instructions, but can make (pretty) arbitrary syscalls?

> * For JIT code, we're going to scan it for wrpkru instructions before
> writing it to executable memory

... and XRSTOR, right?

> * For regular code, we only use wrpkru around short critical sections
> to temporarily enable write access
> 
> Sigreturn is a separate problem that we hope to solve by adding pkey
> support to sigaltstack

What kind of support were you planning to add?

I was thinking that an attacker with arbitrary write access would wait
until PKRU was on the userspace stack and *JUST* before the kernel
sigreturn code restores it to write a malicious value.  It could
presumably do this with some asynchronous mechanism so that even if
there was only one attacker thread, it could change its own value.

Also, the kernel side respect for PKRU is ... well ... rather weak.
It's a best effort and if we *happen* to be in a kernel context where
PKRU is relevant, we can try to respect PKRU.  But there are a whole
bunch of things like get_user_pages_remote() that just plain don't have
PKRU available and can't respect it at all.

I think io_uring also greatly expanded how common "remote" access to
process memory is.

So, overall, I'm thrilled to see another potential user for pkeys.  It
sounds like there's an actual user lined up here, which would be
wonderful.  But, I also want to make sure we don't go to the trouble to
build something that doesn't actually present meaningful, durable
obstacles to an attacker.

I also haven't more than glanced at the code.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] PKEY: Introduce PKEY_ENFORCE_API flag
  2023-05-15 13:05 ` [PATCH 1/6] PKEY: Introduce PKEY_ENFORCE_API flag jeffxu
@ 2023-05-16 23:14   ` Dave Hansen
  2023-05-16 23:55     ` Jeff Xu
  2023-05-17 11:07     ` Stephen Röttger
  0 siblings, 2 replies; 44+ messages in thread
From: Dave Hansen @ 2023-05-16 23:14 UTC (permalink / raw)
  To: jeffxu, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On 5/15/23 06:05, jeffxu@chromium.org wrote:
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -20,7 +20,7 @@ int __execute_only_pkey(struct mm_struct *mm)
>  	/* Do we need to assign a pkey for mm's execute-only maps? */
>  	if (execute_only_pkey == -1) {
>  		/* Go allocate one to use, which might fail */
> -		execute_only_pkey = mm_pkey_alloc(mm);
> +		execute_only_pkey = mm_pkey_alloc(mm, 0);
>  		if (execute_only_pkey < 0)
>  			return -1;
>  		need_to_set_mm_pkey = true;

In your threat model, what mechanism prevents the attacker from
modifying executable mappings?

I was trying to figure out if the implicit execute-only pkey should have
the PKEY_ENFORCE_API bit set.  I think that in particular would probably
cause some kind of ABI breakage, but it still reminded me that I have an
incomplete picture of the threat model.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/6] PKEY: Apply PKEY_ENFORCE_API to mprotect
  2023-05-15 13:05 ` [PATCH 3/6] PKEY: Apply PKEY_ENFORCE_API to mprotect jeffxu
  2023-05-16 20:07   ` Kees Cook
@ 2023-05-16 23:18   ` Dave Hansen
  2023-05-16 23:36     ` Jeff Xu
  1 sibling, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2023-05-16 23:18 UTC (permalink / raw)
  To: jeffxu, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On 5/15/23 06:05, jeffxu@chromium.org wrote:
>  /*
>   * pkey==-1 when doing a legacy mprotect()
> + * syscall==true if this is called by syscall from userspace.
> + * Note: this is always true for now, added as a reminder in case that
> + * do_mprotect_pkey is called directly by kernel in the future.
> + * Also it is consistent with __do_munmap().
>   */
>  static int do_mprotect_pkey(unsigned long start, size_t len,
> -		unsigned long prot, int pkey)
> +		unsigned long prot, int pkey, bool syscall)
>  {

The 'syscall' seems kinda silly (and a bit confusing).  It's easy to
check if the caller is a kthread or has a current->mm==NULL.  If you
*really* want a warning, I'd check for those rather than plumb a
apparently unused argument in here.

BTW, this warning is one of those things that will probably cause some
amount of angst.  I'd move it to the end of the series or just axe it
completely.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KEY: Apply PKEY_ENFORCE_API to munmap
  2023-05-15 13:05 ` [PATCH 5/6] KEY: Apply PKEY_ENFORCE_API to munmap jeffxu
  2023-05-16 20:06   ` Kees Cook
@ 2023-05-16 23:23   ` Dave Hansen
  2023-05-17  0:08     ` Jeff Xu
  1 sibling, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2023-05-16 23:23 UTC (permalink / raw)
  To: jeffxu, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On 5/15/23 06:05, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@google.com>
> 
> This patch enables PKEY_ENFORCE_API for the munmap
> syscall.

The basic problem here is how we know when the set of syscalls that are
patched here is good enough and how we catch future functionality that
might need to be captured as well.

This mechanism really needs to be able to defend against *any* changes
to the address space.  I assume that folks are using syscall filtering
to prevent new syscalls from causing havoc, but is there anything that
can be done for, say, things like madvise()?  I bet it was harmless for
a long time until MADV_DONTNEED showed up and made it able to
effectively zero memory.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/6] PKEY: Apply PKEY_ENFORCE_API to mprotect
  2023-05-16 23:18   ` Dave Hansen
@ 2023-05-16 23:36     ` Jeff Xu
  2023-05-17  4:50       ` Jeff Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Jeff Xu @ 2023-05-16 23:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: jeffxu, luto, jorgelo, keescook, groeck, jannh, sroettger, akpm,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Tue, May 16, 2023 at 4:19 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/15/23 06:05, jeffxu@chromium.org wrote:
> >  /*
> >   * pkey==-1 when doing a legacy mprotect()
> > + * syscall==true if this is called by syscall from userspace.
> > + * Note: this is always true for now, added as a reminder in case that
> > + * do_mprotect_pkey is called directly by kernel in the future.
> > + * Also it is consistent with __do_munmap().
> >   */
> >  static int do_mprotect_pkey(unsigned long start, size_t len,
> > -             unsigned long prot, int pkey)
> > +             unsigned long prot, int pkey, bool syscall)
> >  {
>
> The 'syscall' seems kinda silly (and a bit confusing).  It's easy to
> check if the caller is a kthread or has a current->mm==NULL.  If you
> *really* want a warning, I'd check for those rather than plumb a
> apparently unused argument in here.
>
> BTW, this warning is one of those things that will probably cause some
> amount of angst.  I'd move it to the end of the series or just axe it
> completely.

Agreed. syscall is not a good name here.
The intention is to check this at the system call entry point
For example, munmap can get called inside mremap(), but by that time
mremap() should already check that all the memory is writeable.

I will remove "syscall" from do_mprotect_pkey signature, it seems it caused
more confusion than helpful.  I will keep the comments/note in place to remind
future developer.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-16 22:30     ` Dave Hansen
@ 2023-05-16 23:39       ` Jeff Xu
  0 siblings, 0 replies; 44+ messages in thread
From: Jeff Xu @ 2023-05-16 23:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kees Cook, jeffxu, luto, jorgelo, groeck, jannh, sroettger, akpm,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Tue, May 16, 2023 at 3:30 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/16/23 15:17, Jeff Xu wrote:
> >>> This set of patch covers mprotect/munmap, I plan to work on other
> >>> syscalls after this.
> >> Which ones are on your list currently?
> >>
> > mprotect/mprotect_pkey/munmap
> > mmap/mremap
> > madvice,brk,sbrk
>
> What about pkey_free()?
>
> Without that, someone can presumably free the pkey and then reallocate
> it without PKEY_ENFORCE_API.
>
Great catch. I will add it to the list.
Thanks!
-Jeff Xu

>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] PKEY: Introduce PKEY_ENFORCE_API flag
  2023-05-16 23:14   ` Dave Hansen
@ 2023-05-16 23:55     ` Jeff Xu
  2023-05-17 11:07     ` Stephen Röttger
  1 sibling, 0 replies; 44+ messages in thread
From: Jeff Xu @ 2023-05-16 23:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: jeffxu, luto, jorgelo, keescook, groeck, jannh, sroettger, akpm,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Tue, May 16, 2023 at 4:14 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/15/23 06:05, jeffxu@chromium.org wrote:
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -20,7 +20,7 @@ int __execute_only_pkey(struct mm_struct *mm)
> >       /* Do we need to assign a pkey for mm's execute-only maps? */
> >       if (execute_only_pkey == -1) {
> >               /* Go allocate one to use, which might fail */
> > -             execute_only_pkey = mm_pkey_alloc(mm);
> > +             execute_only_pkey = mm_pkey_alloc(mm, 0);
> >               if (execute_only_pkey < 0)
> >                       return -1;
> >               need_to_set_mm_pkey = true;
>
> In your threat model, what mechanism prevents the attacker from
> modifying executable mappings?
>
> I was trying to figure out if the implicit execute-only pkey should have
> the PKEY_ENFORCE_API bit set.  I think that in particular would probably
> cause some kind of ABI breakage, but it still reminded me that I have an
> incomplete picture of the threat model.
Yes. The main reason for not adding it now is the ABI breakage.
As a next step,  we could potentially develop mseal(), which fits more
to the code segment.
The PKEY_ENFORCE_API allows munmap(), so the user case is slightly different.

I will leave the threat model / V8 specific question to Stephan.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] KEY: Apply PKEY_ENFORCE_API to munmap
  2023-05-16 23:23   ` Dave Hansen
@ 2023-05-17  0:08     ` Jeff Xu
  0 siblings, 0 replies; 44+ messages in thread
From: Jeff Xu @ 2023-05-17  0:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: jeffxu, luto, jorgelo, keescook, groeck, jannh, sroettger, akpm,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Tue, May 16, 2023 at 4:24 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/15/23 06:05, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@google.com>
> >
> > This patch enables PKEY_ENFORCE_API for the munmap
> > syscall.
>
> The basic problem here is how we know when the set of syscalls that are
> patched here is good enough and how we catch future functionality that
> might need to be captured as well.
>
> This mechanism really needs to be able to defend against *any* changes
> to the address space.  I assume that folks are using syscall filtering
> to prevent new syscalls from causing havoc, but is there anything that
> can be done for, say, things like madvise()?  I bet it was harmless for
> a long time until MADV_DONTNEED showed up and made it able to
> effectively zero memory.

Not any change, just a limited set of syscall from user space.
I think it is reasonable to hope that any kind of syscall ABI change that
affects VMA will get reviewed thoroughly from now on.

Also, if we continue to add mseal() to the kernel, we will have to pay more
attention to syscalls related to VMA.

Thanks
-Jeff Xu

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/6] PKEY: Apply PKEY_ENFORCE_API to mprotect
  2023-05-16 23:36     ` Jeff Xu
@ 2023-05-17  4:50       ` Jeff Xu
  0 siblings, 0 replies; 44+ messages in thread
From: Jeff Xu @ 2023-05-17  4:50 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Dave Hansen, luto, jorgelo, keescook, groeck, jannh, sroettger,
	akpm, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Tue, May 16, 2023 at 4:37 PM Jeff Xu <jeffxu@google.com> wrote:
>
> On Tue, May 16, 2023 at 4:19 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 5/15/23 06:05, jeffxu@chromium.org wrote:
> > >  /*
> > >   * pkey==-1 when doing a legacy mprotect()
> > > + * syscall==true if this is called by syscall from userspace.
> > > + * Note: this is always true for now, added as a reminder in case that
> > > + * do_mprotect_pkey is called directly by kernel in the future.
> > > + * Also it is consistent with __do_munmap().
> > >   */
> > >  static int do_mprotect_pkey(unsigned long start, size_t len,
> > > -             unsigned long prot, int pkey)
> > > +             unsigned long prot, int pkey, bool syscall)
> > >  {
> >
> > The 'syscall' seems kinda silly (and a bit confusing).  It's easy to
> > check if the caller is a kthread or has a current->mm==NULL.  If you
> > *really* want a warning, I'd check for those rather than plumb a
> > apparently unused argument in here.
> >
> > BTW, this warning is one of those things that will probably cause some
> > amount of angst.  I'd move it to the end of the series or just axe it
> > completely.
>
Okay, I will move the logging part to the end of the series.


> Agreed. syscall is not a good name here.
> The intention is to check this at the system call entry point
> For example, munmap can get called inside mremap(), but by that time
> mremap() should already check that all the memory is writeable.
>
> I will remove "syscall" from do_mprotect_pkey signature, it seems it caused
> more confusion than helpful.  I will keep the comments/note in place to remind
> future developer.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-16 20:08 ` Kees Cook
  2023-05-16 22:17   ` Jeff Xu
@ 2023-05-17 10:49   ` Stephen Röttger
  1 sibling, 0 replies; 44+ messages in thread
From: Stephen Röttger @ 2023-05-17 10:49 UTC (permalink / raw)
  To: Kees Cook
  Cc: jeffxu, dave.hansen, luto, jorgelo, groeck, jannh, akpm, jeffxu,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

[-- Attachment #1: Type: text/plain, Size: 1167 bytes --]

On Tue, May 16, 2023 at 10:08 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, May 15, 2023 at 01:05:46PM +0000, jeffxu@chromium.org wrote:
> > This patch introduces a new flag, PKEY_ENFORCE_API, to the pkey_alloc()
> > function. When a PKEY is created with this flag, it is enforced that any
> > thread that wants to make changes to the memory mapping (such as mprotect)
> > of the memory must have write access to the PKEY. PKEYs created without
> > this flag will continue to work as they do now, for backwards
> > compatibility.
> >
> > Only PKEY created from user space can have the new flag set, the PKEY
> > allocated by the kernel internally will not have it. In other words,
> > ARCH_DEFAULT_PKEY(0) and execute_only_pkey won’t have this flag set,
> > and continue work as today.
>
> Cool! Yeah, this looks like it could become quite useful. I assume
> V8 folks are on board with this API, etc?

Yes! (I'm from the v8 team driving the implementation on v8 side)

> > This set of patch covers mprotect/munmap, I plan to work on other
> > syscalls after this.
>
> Which ones are on your list currently?
>
> --
> Kees Cook

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4005 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-16 22:41     ` Dave Hansen
@ 2023-05-17 10:51       ` Stephen Röttger
  2023-05-17 15:07         ` Dave Hansen
  0 siblings, 1 reply; 44+ messages in thread
From: Stephen Röttger @ 2023-05-17 10:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: jeffxu, luto, jorgelo, keescook, groeck, jannh, akpm, jeffxu,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

[-- Attachment #1: Type: text/plain, Size: 4771 bytes --]

On Wed, May 17, 2023 at 12:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/16/23 00:06, Stephen Röttger wrote:
> > On Mon, May 15, 2023 at 4:28 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >>
> >> On 5/15/23 06:05, jeffxu@chromium.org wrote:
> >>> We're using PKU for in-process isolation to enforce control-flow integrity
> >>> for a JIT compiler. In our threat model, an attacker exploits a
> >>> vulnerability and has arbitrary read/write access to the whole process
> >>> space concurrently to other threads being executed. This attacker can
> >>> manipulate some arguments to syscalls from some threads.
> >>
> >> This all sounds like it hinges on the contents of PKRU in the attacker
> >> thread.
> >>
> >> Could you talk a bit about how the attacker is prevented from running
> >> WRPKRU, XRSTOR or compelling the kernel to write to PKRU like at sigreturn?
> >
> > (resending without html)
> >
> > Since we're using the feature for control-flow integrity, we assume
> > the control-flow is still intact at this point. I.e. the attacker
> > thread can't run arbitrary instructions.
>
> Can't run arbitrary instructions, but can make (pretty) arbitrary syscalls?

The threat model is that the attacker has arbitrary read/write, while other
threads run in parallel. So whenever a regular thread performs a syscall and
takes a syscall argument from memory, we assume that argument can be attacker
controlled.
Unfortunately, the line is a bit blurry which syscalls / syscall arguments we
need to assume to be attacker controlled. We're trying to approach this by
roughly categorizing syscalls+args:
* how commonly used is the syscall
* do we expect the argument to be taken from writable memory
* can we restrict the syscall+args with seccomp
* how difficult is it to restrict the syscall in userspace vs kernel
* does the syscall affect our protections (e.g. change control-flow or pkey)

Using munmap as an example:
* it's a very common syscall (nearly every seccomp filter will allow munmap)
* the addr argument will come from memory
* unmapping pkey-tagged pages breaks our assumptions
* it's hard to restrict in userspace since we'd need to keep track of all
  address ranges that are unsafe to unmap and hook the syscall to perform the
  validation on every call in the codebase.
* it's easy to validate in kernel with this patch

For most other syscalls, they either don't affect the control-flow, are easy to
avoid and block with seccomp or we can add validation in userspace (e.g. only
install signal handlers at program startup).

> > * For JIT code, we're going to scan it for wrpkru instructions before
> > writing it to executable memory
>
> ... and XRSTOR, right?

Right. We’ll just have a list of allowed instructions that the JIT compiler can
emit.

>
> > * For regular code, we only use wrpkru around short critical sections
> > to temporarily enable write access
> >
> > Sigreturn is a separate problem that we hope to solve by adding pkey
> > support to sigaltstack
>
> What kind of support were you planning to add?

We’d like to allow registering pkey-tagged memory as a sigaltstack. This would
allow the signal handler to run isolated from other threads. Right now, the
main reason this doesn’t work is that the kernel would need to change the pkru
state before storing the register state on the stack.

> I was thinking that an attacker with arbitrary write access would wait
> until PKRU was on the userspace stack and *JUST* before the kernel
> sigreturn code restores it to write a malicious value.  It could
> presumably do this with some asynchronous mechanism so that even if
> there was only one attacker thread, it could change its own value.

I’m not sure I follow the details, can you give an example of an asynchronous
mechanism to do this? E.g. would this be the kernel writing to the memory in a
syscall for example?

> Also, the kernel side respect for PKRU is ... well ... rather weak.
> It's a best effort and if we *happen* to be in a kernel context where
> PKRU is relevant, we can try to respect PKRU.  But there are a whole
> bunch of things like get_user_pages_remote() that just plain don't have
> PKRU available and can't respect it at all.
>
> I think io_uring also greatly expanded how common "remote" access to
> process memory is.
>
> So, overall, I'm thrilled to see another potential user for pkeys.  It
> sounds like there's an actual user lined up here, which would be
> wonderful.  But, I also want to make sure we don't go to the trouble to
> build something that doesn't actually present meaningful, durable
> obstacles to an attacker.
>
> I also haven't more than glanced at the code.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4005 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] PKEY: Introduce PKEY_ENFORCE_API flag
  2023-05-16 23:14   ` Dave Hansen
  2023-05-16 23:55     ` Jeff Xu
@ 2023-05-17 11:07     ` Stephen Röttger
  1 sibling, 0 replies; 44+ messages in thread
From: Stephen Röttger @ 2023-05-17 11:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: jeffxu, luto, jorgelo, keescook, groeck, jannh, akpm, jeffxu,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

[-- Attachment #1: Type: text/plain, Size: 1744 bytes --]

On Wed, May 17, 2023 at 1:14 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/15/23 06:05, jeffxu@chromium.org wrote:
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -20,7 +20,7 @@ int __execute_only_pkey(struct mm_struct *mm)
> >       /* Do we need to assign a pkey for mm's execute-only maps? */
> >       if (execute_only_pkey == -1) {
> >               /* Go allocate one to use, which might fail */
> > -             execute_only_pkey = mm_pkey_alloc(mm);
> > +             execute_only_pkey = mm_pkey_alloc(mm, 0);
> >               if (execute_only_pkey < 0)
> >                       return -1;
> >               need_to_set_mm_pkey = true;
>
> In your threat model, what mechanism prevents the attacker from
> modifying executable mappings?

There are different options how we can address this:
1) having a generic mseal() API as Jeff mentioned
2) tagging all code pages with the pkey we're using
    (would this affect memory sharing between processes?)
3) prevent this with seccomp + userspace validation
If we have pkey support, we will only create executable memory using
pkey_mprotect and everything else can be blocked with seccomp. This would still
allow turning R-X memory into RW- memory, but you can't change it back without
going through our codepath that has added validation.

There's a similar challenge with making RO memory writable. For this we'll need
to use approach 1) or 2) instead.

> I was trying to figure out if the implicit execute-only pkey should have
> the PKEY_ENFORCE_API bit set.  I think that in particular would probably
> cause some kind of ABI breakage, but it still reminded me that I have an
> incomplete picture of the threat model.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4005 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-17 10:51       ` Stephen Röttger
@ 2023-05-17 15:07         ` Dave Hansen
  2023-05-17 15:21           ` Jeff Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2023-05-17 15:07 UTC (permalink / raw)
  To: Stephen Röttger
  Cc: jeffxu, luto, jorgelo, keescook, groeck, jannh, akpm, jeffxu,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

On 5/17/23 03:51, Stephen Röttger wrote:
> On Wed, May 17, 2023 at 12:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> Can't run arbitrary instructions, but can make (pretty) arbitrary syscalls?
> 
> The threat model is that the attacker has arbitrary read/write, while other
> threads run in parallel. So whenever a regular thread performs a syscall and
> takes a syscall argument from memory, we assume that argument can be attacker
> controlled.
> Unfortunately, the line is a bit blurry which syscalls / syscall arguments we
> need to assume to be attacker controlled. 

Ahh, OK.  So, it's not that the *attacker* can make arbitrary syscalls.
It's that the attacker might leverage its arbitrary write to trick a
victim thread into turning what would otherwise be a good syscall into a
bad one with attacker-controlled content.

I guess that makes the readv/writev-style of things a bad idea in this
environment.

>>> Sigreturn is a separate problem that we hope to solve by adding pkey
>>> support to sigaltstack
>>
>> What kind of support were you planning to add?
> 
> We’d like to allow registering pkey-tagged memory as a sigaltstack. This would
> allow the signal handler to run isolated from other threads. Right now, the
> main reason this doesn’t work is that the kernel would need to change the pkru
> state before storing the register state on the stack.
> 
>> I was thinking that an attacker with arbitrary write access would wait
>> until PKRU was on the userspace stack and *JUST* before the kernel
>> sigreturn code restores it to write a malicious value.  It could
>> presumably do this with some asynchronous mechanism so that even if
>> there was only one attacker thread, it could change its own value.
> 
> I’m not sure I follow the details, can you give an example of an asynchronous
> mechanism to do this? E.g. would this be the kernel writing to the memory in a
> syscall for example?

I was thinking of all of the IORING_OP_*'s that can write to memory or
aio(7).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-17 15:07         ` Dave Hansen
@ 2023-05-17 15:21           ` Jeff Xu
  2023-05-17 15:29             ` Dave Hansen
  0 siblings, 1 reply; 44+ messages in thread
From: Jeff Xu @ 2023-05-17 15:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Stephen Röttger, jeffxu, luto, jorgelo, keescook, groeck,
	jannh, akpm, linux-kernel, linux-kselftest, linux-mm,
	linux-hardening

On Wed, May 17, 2023 at 8:07 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/17/23 03:51, Stephen Röttger wrote:
> > On Wed, May 17, 2023 at 12:41 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >> Can't run arbitrary instructions, but can make (pretty) arbitrary syscalls?
> >
> > The threat model is that the attacker has arbitrary read/write, while other
> > threads run in parallel. So whenever a regular thread performs a syscall and
> > takes a syscall argument from memory, we assume that argument can be attacker
> > controlled.
> > Unfortunately, the line is a bit blurry which syscalls / syscall arguments we
> > need to assume to be attacker controlled.
>
> Ahh, OK.  So, it's not that the *attacker* can make arbitrary syscalls.
> It's that the attacker might leverage its arbitrary write to trick a
> victim thread into turning what would otherwise be a good syscall into a
> bad one with attacker-controlled content.
>
> I guess that makes the readv/writev-style of things a bad idea in this
> environment.
>
> >>> Sigreturn is a separate problem that we hope to solve by adding pkey
> >>> support to sigaltstack
> >>
> >> What kind of support were you planning to add?
> >
> > We’d like to allow registering pkey-tagged memory as a sigaltstack. This would
> > allow the signal handler to run isolated from other threads. Right now, the
> > main reason this doesn’t work is that the kernel would need to change the pkru
> > state before storing the register state on the stack.
> >
> >> I was thinking that an attacker with arbitrary write access would wait
> >> until PKRU was on the userspace stack and *JUST* before the kernel
> >> sigreturn code restores it to write a malicious value.  It could
> >> presumably do this with some asynchronous mechanism so that even if
> >> there was only one attacker thread, it could change its own value.
> >
> > I’m not sure I follow the details, can you give an example of an asynchronous
> > mechanism to do this? E.g. would this be the kernel writing to the memory in a
> > syscall for example?
>
> I was thinking of all of the IORING_OP_*'s that can write to memory or
> aio(7).

IORING is challenging from security perspectives, for now, it is
disabled in ChromeOS.
Though I'm not sure how aio is related ?

Thanks
-Jeff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-17 15:21           ` Jeff Xu
@ 2023-05-17 15:29             ` Dave Hansen
  2023-05-17 23:48               ` Jeff Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2023-05-17 15:29 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Stephen Röttger, jeffxu, luto, jorgelo, keescook, groeck,
	jannh, akpm, linux-kernel, linux-kselftest, linux-mm,
	linux-hardening

On 5/17/23 08:21, Jeff Xu wrote:
>>> I’m not sure I follow the details, can you give an example of an asynchronous
>>> mechanism to do this? E.g. would this be the kernel writing to the memory in a
>>> syscall for example?
>> I was thinking of all of the IORING_OP_*'s that can write to memory or
>> aio(7).
> IORING is challenging from security perspectives, for now, it is 
> disabled in ChromeOS. Though I'm not sure how aio is related ?

Let's say you're the attacking thread and you're the *only* attacking
thread.  You have three things at your disposal:

 1. A benign thread doing aio_read()
 2. An arbitrary write primitive
 3. You can send signals to yourself
 4. You can calculate where your signal stack will be

You calculate the address of PKRU on the future signal stack.  You then
leverage the otherwise benign aio_write() to write a 0 to that PKRU
location.  Then, send a signal to yourself.  The attacker's PKRU value
will be written to the stack.  If you can time it right, the AIO will
complete while the signal handler is in progress and PKRU is on the
stack.  On sigreturn, the kernel restores the aio_read()-placed,
attacker-provided PKRU value.  Now the attacker has PKRU==0.  It
effectively build a WRPKRU primitive out of those other pieces.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-17 15:29             ` Dave Hansen
@ 2023-05-17 23:48               ` Jeff Xu
  2023-05-18 15:37                 ` Dave Hansen
  0 siblings, 1 reply; 44+ messages in thread
From: Jeff Xu @ 2023-05-17 23:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Stephen Röttger, jeffxu, luto, jorgelo, keescook, groeck,
	jannh, akpm, linux-kernel, linux-kselftest, linux-mm,
	linux-hardening

On Wed, May 17, 2023 at 8:29 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/17/23 08:21, Jeff Xu wrote:
> >>> I’m not sure I follow the details, can you give an example of an asynchronous
> >>> mechanism to do this? E.g. would this be the kernel writing to the memory in a
> >>> syscall for example?
> >> I was thinking of all of the IORING_OP_*'s that can write to memory or
> >> aio(7).
> > IORING is challenging from security perspectives, for now, it is
> > disabled in ChromeOS. Though I'm not sure how aio is related ?
>
> Let's say you're the attacking thread and you're the *only* attacking
> thread.  You have three things at your disposal:
>
>  1. A benign thread doing aio_read()
>  2. An arbitrary write primitive
>  3. You can send signals to yourself
>  4. You can calculate where your signal stack will be
>
> You calculate the address of PKRU on the future signal stack.  You then
> leverage the otherwise benign aio_write() to write a 0 to that PKRU
> location.  Then, send a signal to yourself.  The attacker's PKRU value
> will be written to the stack.  If you can time it right, the AIO will
> complete while the signal handler is in progress and PKRU is on the
> stack.  On sigreturn, the kernel restores the aio_read()-placed,
> attacker-provided PKRU value.  Now the attacker has PKRU==0.  It
> effectively build a WRPKRU primitive out of those other pieces.
>
>
Ah, I understand the question now, thanks for the explanation.
Signalling handling is the next project that I will be working on.

I'm leaning towards saving PKRU register to the thread struct, similar
to how context switch works. This will address the attack scenario you
described.
However, there are a few challenges I have not yet worked through.
First, the code needs to track when the first signaling entry occurs
(saving the PKRU register to the thread struct) and when it is last
returned (restoring the PKRU register from the thread struct). One way
to do this would be to add another member to the thread struct to
track the level of signaling re-entry. Second, signal is used in
error handling, including the kernel's own signaling handling code
path. I haven't worked through this part of code logic completely.

If the first approach is too complicated or considered intrusive,  I
could take a different approach. In this approach, I would not track
signaling re-entry. Instead, I would modify the PKRU saved in AltStack
during handling of the signal, the steps are:
a> save PKRU to tmp variable.
b> modify PKRU to allow writing to the PKEY protected AltStack
c> XSAVE.
d> write tmp to the memory address of PKRU in  AltStack at the
correct offset.
Since the thread's PKRU is saved to stack, XRSTOR will restore the
thread's original PKRU during sigreturn in normal situations. This
approach might be a little hacky because it overwrites XSAVE results.
If we go with this route, I need someone's help on the overwriting
function, it is CPU specific.
However this approach will not work if an attacker can install its own
signaling handling (therefore gains the ability to overwrite PKRU stored
in stack, as you described), the application will want to install all the
signaling handling with PKEY protected AltStack at startup time, and
disallow additional signaling handling after that, this is programmatically
achievable in V8, as Stephan mentioned.

I would appreciate getting more comments in the signaling handling
area on those two approaches, or are there  better ways to do what we
want? Do you think we could continue signaling handling discussion
from the original thread that Kees started [1] ? There were already
lots of discussions there about signalling handling,  so it will be
easier for future readers to understand the context. I can repost
there. Or I can start a new thread for signaling handling, I'm
worried that those discussions will get lengthy and context get lost
with patch version update.

Although the signaling handling project is related,  I think VMA
protection using the PKRU project can stand on its own. We could solve
this for V8 first then move next to Signaling handling, the work here
could also pave the way to add mseal() in future, I expect lots of
code logic will be similar.

[1] https://lore.kernel.org/all/202208221331.71C50A6F@keescook/

Thanks!
Best regards,
-Jeff Xu








On Wed, May 17, 2023 at 8:29 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/17/23 08:21, Jeff Xu wrote:
> >>> I’m not sure I follow the details, can you give an example of an asynchronous
> >>> mechanism to do this? E.g. would this be the kernel writing to the memory in a
> >>> syscall for example?
> >> I was thinking of all of the IORING_OP_*'s that can write to memory or
> >> aio(7).
> > IORING is challenging from security perspectives, for now, it is
> > disabled in ChromeOS. Though I'm not sure how aio is related ?
>
> Let's say you're the attacking thread and you're the *only* attacking
> thread.  You have three things at your disposal:
>
>  1. A benign thread doing aio_read()
>  2. An arbitrary write primitive
>  3. You can send signals to yourself
>  4. You can calculate where your signal stack will be
>
> You calculate the address of PKRU on the future signal stack.  You then
> leverage the otherwise benign aio_write() to write a 0 to that PKRU
> location.  Then, send a signal to yourself.  The attacker's PKRU value
> will be written to the stack.  If you can time it right, the AIO will
> complete while the signal handler is in progress and PKRU is on the
> stack.  On sigreturn, the kernel restores the aio_read()-placed,
> attacker-provided PKRU value.  Now the attacker has PKRU==0.  It
> effectively build a WRPKRU primitive out of those other pieces.
>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-17 23:48               ` Jeff Xu
@ 2023-05-18 15:37                 ` Dave Hansen
  2023-05-18 20:20                   ` Jeff Xu
  2023-05-31 23:02                   ` Jeff Xu
  0 siblings, 2 replies; 44+ messages in thread
From: Dave Hansen @ 2023-05-18 15:37 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Stephen Röttger, jeffxu, luto, jorgelo, keescook, groeck,
	jannh, akpm, linux-kernel, linux-kselftest, linux-mm,
	linux-hardening

On 5/17/23 16:48, Jeff Xu wrote:
> However, there are a few challenges I have not yet worked through.
> First, the code needs to track when the first signaling entry occurs
> (saving the PKRU register to the thread struct) and when it is last
> returned (restoring the PKRU register from the thread struct). 

Would tracking signal "depth" work in the face of things like siglongjmp?

Taking a step back...

Here's my concern about this whole thing: it's headed down a rabbit hole
which is *highly* specialized both in the apps that will use it and the
attacks it will mitigate.  It probably *requires* turning off a bunch of
syscalls (like io_uring) that folks kinda like in general.

We're balancing that highly specialized mitigation with a feature that
add new ABI, touches core memory management code and signal handling.

On the x86 side, PKRU is a painfully special snowflake.  It's exposed in
the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel.
This would be making it an even more special snowflake because it would
need new altstack ABI and handling.

I'm just not sure the gain is worth the pain.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-18 15:37                 ` Dave Hansen
@ 2023-05-18 20:20                   ` Jeff Xu
  2023-05-18 21:04                     ` Dave Hansen
  2023-05-31 23:02                   ` Jeff Xu
  1 sibling, 1 reply; 44+ messages in thread
From: Jeff Xu @ 2023-05-18 20:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Stephen Röttger, jeffxu, luto, jorgelo, keescook, groeck,
	jannh, akpm, linux-kernel, linux-kselftest, linux-mm,
	linux-hardening

Hello Dave,

Thanks for your email.

On Thu, May 18, 2023 at 8:38 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/17/23 16:48, Jeff Xu wrote:
> > However, there are a few challenges I have not yet worked through.
> > First, the code needs to track when the first signaling entry occurs
> > (saving the PKRU register to the thread struct) and when it is last
> > returned (restoring the PKRU register from the thread struct).
>
> Would tracking signal "depth" work in the face of things like siglongjmp?
>
Thank you for your question! I am eager to learn more about this area
and I worry about blind spots. I will investigate and get back to you.

> Taking a step back...
>
> Here's my concern about this whole thing: it's headed down a rabbit hole
> which is *highly* specialized both in the apps that will use it and the
> attacks it will mitigate.  It probably *requires* turning off a bunch of
> syscalls (like io_uring) that folks kinda like in general.
>
ChromeOS currently disabled io_uring, but it is not required to do so.
io_uring supports the IORING_OP_MADVICE operation, which calls the
do_madvise() function. This means that io_uring will have the same
pkey checks as the madvice() system call.  From that perspective, we
will fully support io_uring for this feature.

> We're balancing that highly specialized mitigation with a feature that
> add new ABI, touches core memory management code and signal handling.
>
The ABI change uses the existing flag field in pkey_alloc() which is
reserved. The implementation is backward compatible with all existing
pkey usages in both kernel and user space.  Or do you have other
concerns about ABI in mind ?

Yes, you are right about the risk of touching core mm code. To
minimize the risk, I try to control the scope of the change (it is
about 3 lines in mprotect, more in munmap but really just 3 effective
lines from syscall entry). I added new self-tests in mm to make sure
it doesn't regress in api behavior. I run those tests before and after
my kernel code change to make sure the behavior remains the same, I
tested it on 5.15 and 6.1 and 6.4-rc1.  Actually, the testing
discovered a behavior change for mprotect() between 6.1 and 6.4  (not
from this patch, there are refactoring works going on in mm) see this
thread [1]
I hope those steps will help to mitigate the risk.

Agreed on signaling handling is a tough part: what do you think about
the approach (modifying PKRU from saved stack after XSAVE), is there a
blocker ?

> On the x86 side, PKRU is a painfully special snowflake.  It's exposed in
> the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel.
> This would be making it an even more special snowflake because it would

I admit I'm quite ignorant on XSAVE  to understand the above
statement, and how that is related. Could you explain it to me please
? And what is in your mind that might improve the situation ?

> need new altstack ABI and handling.
>
I thought adding protected memory support to signaling handling is an
independent project with its own weight. As Jann Horn points out in
[2]:  "we could prevent the attacker from corrupting the signal
context if we can protect the signal stack with a pkey."   However,
the kernel will send SIGSEGV when the stack is protected by PKEY,  so
there is a benefit to make this work.  (Maybe Jann can share some more
thoughts on the benefits)

And I believe we could do this in a way with minimum ABI change, as below:
- allocate PKEY with a new flag (PKEY_ALTSTACK)
- at sigaltstack() call, detect the memory is PKEY_ALTSTACK protected,
(similar as what mprotect does in this patch) and save it along with
stack address/size.
- at signaling handling, use the saved info to fill in PKRU.
The ABI change is similar to PKEY_ENFORCE_API, and there is no
backward compatibility issue.

Will these mentioned help our case ? What do you think ?

(Stephan has more info on gains,  as far as I know, V8 engineers have
worked/thought really hard to come to a suitable solution to make
chrome browser safer)

[1] https://lore.kernel.org/linux-mm/20230516165754.pocx4kaagn3yyw3r@revolver/T/
[2] https://docs.google.com/document/d/1OlnJbR5TMoaOAJsf4hHOc-FdTmYK2aDUI7d2hfCZSOo/edit?resourcekey=0-v9UJXONYsnG5PlCBbcYqIw#

Thanks!
Best regards,
-Jeff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-18 20:20                   ` Jeff Xu
@ 2023-05-18 21:04                     ` Dave Hansen
  2023-05-19 11:13                       ` Stephen Röttger
                                         ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Dave Hansen @ 2023-05-18 21:04 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Stephen Röttger, jeffxu, luto, jorgelo, keescook, groeck,
	jannh, akpm, linux-kernel, linux-kselftest, linux-mm,
	linux-hardening

On 5/18/23 13:20, Jeff Xu wrote:>> Here's my concern about this whole
thing: it's headed down a rabbit hole
>> which is *highly* specialized both in the apps that will use it and the
>> attacks it will mitigate.  It probably *requires* turning off a bunch of
>> syscalls (like io_uring) that folks kinda like in general.
>>
> ChromeOS currently disabled io_uring, but it is not required to do so.
> io_uring supports the IORING_OP_MADVICE operation, which calls the
> do_madvise() function. This means that io_uring will have the same
> pkey checks as the madvice() system call.  From that perspective, we
> will fully support io_uring for this feature.

io_uring fundamentally doesn't have the same checks.  The kernel side
work can be done from an asynchronous kernel thread.  That kernel thread
doesn't have a meaningful PKRU value.  The register has a value, but
it's not really related to the userspace threads that are sending it
requests.

>> We're balancing that highly specialized mitigation with a feature that
>> add new ABI, touches core memory management code and signal handling.
>>
> The ABI change uses the existing flag field in pkey_alloc() which is
> reserved. The implementation is backward compatible with all existing
> pkey usages in both kernel and user space.  Or do you have other
> concerns about ABI in mind ?

I'm not worried about the past, I'm worried any time we add a new ABI
since we need to support it forever.

> Yes, you are right about the risk of touching core mm code. To
> minimize the risk, I try to control the scope of the change (it is
> about 3 lines in mprotect, more in munmap but really just 3 effective
> lines from syscall entry). I added new self-tests in mm to make sure
> it doesn't regress in api behavior. I run those tests before and after
> my kernel code change to make sure the behavior remains the same, I
> tested it on 5.15 and 6.1 and 6.4-rc1.  Actually, the testing
> discovered a behavior change for mprotect() between 6.1 and 6.4  (not
> from this patch, there are refactoring works going on in mm) see this
> thread [1]
> I hope those steps will help to mitigate the risk.
> 
> Agreed on signaling handling is a tough part: what do you think about
> the approach (modifying PKRU from saved stack after XSAVE), is there a
> blocker ?

Yes, signal entry and sigreturn are not necessarily symmetric so you
can't really have a stack.

>> On the x86 side, PKRU is a painfully special snowflake.  It's exposed in
>> the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel.
>> This would be making it an even more special snowflake because it would
> 
> I admit I'm quite ignorant on XSAVE  to understand the above
> statement, and how that is related. Could you explain it to me please
> ? And what is in your mind that might improve the situation ?

In a nutshell: XSAVE components are classified as either user or
supervisor.  User components can be modified from userspace and
supervisor ones only from the kernel.  In general, user components don't
affect the kernel; the kernel doesn't care what is in ZMM11 (an
XSAVE-managed register).  That lets us do fun stuff like be lazy about
when ZMM11 is saved/restored.  Being lazy is good because it give us
things like faster context switches and KVM VMEXIT handling.

PKRU is a user component, but it affects the kernel when the kernel does
copy_to/from_user() and friends.  That means that the kernel can't do
any "fun stuff" with PKRU.  As soon as userspace provides a new value,
the kernel needs to start respecting it.  That makes PKRU a very special
snowflake.

So, even though PKRU can be managed by XSAVE, it isn't.  It isn't kept
in the kernel XSAVE buffer.  But it *IS* in the signal stack XSAVE
buffer.  You *can* save/restore it with the other XSAVE components with
ptrace().  The user<->kernel ABI pretends that PKRU is XSAVE managed
even though it is not.

All of this is special-cased.  There's a ton of code to handle this
mess.  It's _complicated_.  I haven't even started talking about how
this interacts with KVM and guests.

How could we improve it?  A time machine would help to either change the
architecture or have Linux ignore the fact that XSAVE knows anything
about PKRU.

So, the bar is pretty high for things that want to further muck with
PKRU.  Add signal and sigaltstack in particular into the fray, and we've
got a recipe for disaster.  sigaltstack and XSAVE don't really get along
very well.  https://lwn.net/Articles/862541/

>> need new altstack ABI and handling.
>>
> I thought adding protected memory support to signaling handling is an
> independent project with its own weight. As Jann Horn points out in
> [2]:  "we could prevent the attacker from corrupting the signal
> context if we can protect the signal stack with a pkey."   However,
> the kernel will send SIGSEGV when the stack is protected by PKEY,  so
> there is a benefit to make this work.  (Maybe Jann can share some more
> thoughts on the benefits)
> 
> And I believe we could do this in a way with minimum ABI change, as below:
> - allocate PKEY with a new flag (PKEY_ALTSTACK)
> - at sigaltstack() call, detect the memory is PKEY_ALTSTACK protected,
> (similar as what mprotect does in this patch) and save it along with
> stack address/size.
> - at signaling handling, use the saved info to fill in PKRU.
> The ABI change is similar to PKEY_ENFORCE_API, and there is no
> backward compatibility issue.
> 
> Will these mentioned help our case ? What do you think ?

To be honest, no.

What you've laid out here is the tip of the complexity iceberg.  There
are a lot of pieces of the kernel that are not yet factored in.

Let's also remember: protection keys is *NOT* a security feature.  It's
arguable that pkeys is a square peg trying to go into a round security hole.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] PKEY: Add arch_check_pkey_enforce_api()
  2023-05-15 13:05 ` [PATCH 2/6] PKEY: Add arch_check_pkey_enforce_api() jeffxu
@ 2023-05-18 21:43   ` Dave Hansen
  2023-05-18 22:51     ` Jeff Xu
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2023-05-18 21:43 UTC (permalink / raw)
  To: jeffxu, luto, jorgelo, keescook, groeck, jannh, sroettger
  Cc: akpm, jeffxu, linux-kernel, linux-kselftest, linux-mm, linux-hardening

On 5/15/23 06:05, jeffxu@chromium.org wrote:
> +static inline int __arch_check_vma_pkey_for_write(struct vm_area_struct *vma)
> +{
> +	int pkey = vma_pkey(vma);
> +
> +	if (mm_pkey_enforce_api(vma->vm_mm, pkey)) {
> +		if (!__pkru_allows_write(read_pkru(), pkey))
> +			return -EACCES;
> +	}
> +
> +	return 0;
> +}

Please think very carefully about what I'm about to say:

What connects vma->vm_mm to read_pkru() here?

Now think about what happens when we have kthread_use_mm() or a ptrace()
doing get_task_mm() and working on another process's mm or VMA.

Look at arch_vma_access_permitted() and notice how it avoids read_pkru()
for 'foreign' aka. 'remote' accesses:

> static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
>                 bool write, bool execute, bool foreign)
> {
...
>         if (foreign || vma_is_foreign(vma))
>                 return true;
>         return // check read_pkru()
> }

In other words, it lets all remote accesses right through.  That's
because there is *NOTHING* that fundamentally and tightly connects the
PKRU value in this context to the VMA or the context that initiated this
operation.

If your security model depends on PKRU protection, this 'remote'
disconnection is problematic.  The PKRU enforcement inside the kernel is
best-effort.  That usually doesn't map into the security space very well.

Do you have a solid handle on all call paths that will reach
__arch_check_vma_pkey_for_write() and can you ensure they are all
non-remote?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] PKEY: Add arch_check_pkey_enforce_api()
  2023-05-18 21:43   ` Dave Hansen
@ 2023-05-18 22:51     ` Jeff Xu
  2023-05-19  0:00       ` Dave Hansen
  0 siblings, 1 reply; 44+ messages in thread
From: Jeff Xu @ 2023-05-18 22:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: jeffxu, luto, jorgelo, keescook, groeck, jannh, sroettger, akpm,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

On Thu, May 18, 2023 at 2:43 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/15/23 06:05, jeffxu@chromium.org wrote:
> > +static inline int __arch_check_vma_pkey_for_write(struct vm_area_struct *vma)
> > +{
> > +     int pkey = vma_pkey(vma);
> > +
> > +     if (mm_pkey_enforce_api(vma->vm_mm, pkey)) {
> > +             if (!__pkru_allows_write(read_pkru(), pkey))
> > +                     return -EACCES;
> > +     }
> > +
> > +     return 0;
> > +}
>
> Please think very carefully about what I'm about to say:
>
> What connects vma->vm_mm to read_pkru() here?
>
> Now think about what happens when we have kthread_use_mm() or a ptrace()
> doing get_task_mm() and working on another process's mm or VMA.
>
> Look at arch_vma_access_permitted() and notice how it avoids read_pkru()
> for 'foreign' aka. 'remote' accesses:
>
> > static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
> >                 bool write, bool execute, bool foreign)
> > {
> ...
> >         if (foreign || vma_is_foreign(vma))
> >                 return true;
> >         return // check read_pkru()
> > }
>
> In other words, it lets all remote accesses right through.  That's
> because there is *NOTHING* that fundamentally and tightly connects the
> PKRU value in this context to the VMA or the context that initiated this
> operation.
>
> If your security model depends on PKRU protection, this 'remote'
> disconnection is problematic.  The PKRU enforcement inside the kernel is
> best-effort.  That usually doesn't map into the security space very well.
>
> Do you have a solid handle on all call paths that will reach
> __arch_check_vma_pkey_for_write() and can you ensure they are all
> non-remote?

Is this about the attack scenario where the attacker uses ptrace()
into the chrome process ? if so it is not in our threat model, and
that is more related to sandboxing on the host.

Or is this about io_uring? Yes, io_uring kernel thread breaks our
expectations of PKRU & user space threads, however I thought the break
is not just for this - any syscall involved in memory operation will
break after into io_uring ?

Other than those, yes, I try to ensure the check is only used at the
beginning of syscall entry in all cases, which should be non-remote I
hope.

Thanks
-Jeff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] PKEY: Add arch_check_pkey_enforce_api()
  2023-05-18 22:51     ` Jeff Xu
@ 2023-05-19  0:00       ` Dave Hansen
  2023-05-19 11:22         ` Stephen Röttger
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2023-05-19  0:00 UTC (permalink / raw)
  To: Jeff Xu
  Cc: jeffxu, luto, jorgelo, keescook, groeck, jannh, sroettger, akpm,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

On 5/18/23 15:51, Jeff Xu wrote:
>> Do you have a solid handle on all call paths that will reach
>> __arch_check_vma_pkey_for_write() and can you ensure they are all
>> non-remote?
> Is this about the attack scenario where the attacker uses ptrace()
> into the chrome process ? if so it is not in our threat model, and
> that is more related to sandboxing on the host.

The attacker would use *some* remote interface.  ptrace() is just one of
those remote interfaces.

> Or is this about io_uring? Yes, io_uring kernel thread breaks our
> expectations of PKRU & user space threads, however I thought the break
> is not just for this - any syscall involved in memory operation will
> break after into io_uring ?

I'm not quite following.

Please just do me a favor: have the io_uring maintainers look at your
proposal.  Make sure that the defenses you are building can work in a
process where io_uring is in use by the benign threads.

Those same folks are pretty familiar with the other, more traditional
I/O syscalls that have in-memory descriptors that control syscall
behavior like readv/writev.  Those also need a close look.

> Other than those, yes, I try to ensure the check is only used at the
> beginning of syscall entry in all cases, which should be non-remote I
> hope.

You're right that synchronous, shallow syscall paths are usually
non-remote.  But those aren't the problem.  The problem is that there
*ARE* remote accesses and those are a potential hole for this whole
mechanism.

Can they be closed?  I don't know.  I honestly don't have a great grasp
on how widespread these things are.  You'll need a much more complete
grasp on them than I have before this thing can go forward.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-18 21:04                     ` Dave Hansen
@ 2023-05-19 11:13                       ` Stephen Röttger
  2023-05-24 20:15                       ` Jeff Xu
  2023-06-01  1:39                       ` Jeff Xu
  2 siblings, 0 replies; 44+ messages in thread
From: Stephen Röttger @ 2023-05-19 11:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jeff Xu, jeffxu, luto, jorgelo, keescook, groeck, jannh, akpm,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

[-- Attachment #1: Type: text/plain, Size: 7957 bytes --]

On Thu, May 18, 2023 at 11:04 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/18/23 13:20, Jeff Xu wrote:>> Here's my concern about this whole
> thing: it's headed down a rabbit hole
> >> which is *highly* specialized both in the apps that will use it and the
> >> attacks it will mitigate.  It probably *requires* turning off a bunch of
> >> syscalls (like io_uring) that folks kinda like in general.
> >>
> > ChromeOS currently disabled io_uring, but it is not required to do so.
> > io_uring supports the IORING_OP_MADVICE operation, which calls the
> > do_madvise() function. This means that io_uring will have the same
> > pkey checks as the madvice() system call.  From that perspective, we
> > will fully support io_uring for this feature.
>
> io_uring fundamentally doesn't have the same checks.  The kernel side
> work can be done from an asynchronous kernel thread.  That kernel thread
> doesn't have a meaningful PKRU value.  The register has a value, but
> it's not really related to the userspace threads that are sending it
> requests.
>
> >> We're balancing that highly specialized mitigation with a feature that
> >> add new ABI, touches core memory management code and signal handling.
> >>
> > The ABI change uses the existing flag field in pkey_alloc() which is
> > reserved. The implementation is backward compatible with all existing
> > pkey usages in both kernel and user space.  Or do you have other
> > concerns about ABI in mind ?
>
> I'm not worried about the past, I'm worried any time we add a new ABI
> since we need to support it forever.
>
> > Yes, you are right about the risk of touching core mm code. To
> > minimize the risk, I try to control the scope of the change (it is
> > about 3 lines in mprotect, more in munmap but really just 3 effective
> > lines from syscall entry). I added new self-tests in mm to make sure
> > it doesn't regress in api behavior. I run those tests before and after
> > my kernel code change to make sure the behavior remains the same, I
> > tested it on 5.15 and 6.1 and 6.4-rc1.  Actually, the testing
> > discovered a behavior change for mprotect() between 6.1 and 6.4  (not
> > from this patch, there are refactoring works going on in mm) see this
> > thread [1]
> > I hope those steps will help to mitigate the risk.
> >
> > Agreed on signaling handling is a tough part: what do you think about
> > the approach (modifying PKRU from saved stack after XSAVE), is there a
> > blocker ?
>
> Yes, signal entry and sigreturn are not necessarily symmetric so you
> can't really have a stack.
>
> >> On the x86 side, PKRU is a painfully special snowflake.  It's exposed in
> >> the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel.
> >> This would be making it an even more special snowflake because it would
> >
> > I admit I'm quite ignorant on XSAVE  to understand the above
> > statement, and how that is related. Could you explain it to me please
> > ? And what is in your mind that might improve the situation ?
>
> In a nutshell: XSAVE components are classified as either user or
> supervisor.  User components can be modified from userspace and
> supervisor ones only from the kernel.  In general, user components don't
> affect the kernel; the kernel doesn't care what is in ZMM11 (an
> XSAVE-managed register).  That lets us do fun stuff like be lazy about
> when ZMM11 is saved/restored.  Being lazy is good because it give us
> things like faster context switches and KVM VMEXIT handling.
>
> PKRU is a user component, but it affects the kernel when the kernel does
> copy_to/from_user() and friends.  That means that the kernel can't do
> any "fun stuff" with PKRU.  As soon as userspace provides a new value,
> the kernel needs to start respecting it.  That makes PKRU a very special
> snowflake.
>
> So, even though PKRU can be managed by XSAVE, it isn't.  It isn't kept
> in the kernel XSAVE buffer.  But it *IS* in the signal stack XSAVE
> buffer.  You *can* save/restore it with the other XSAVE components with
> ptrace().  The user<->kernel ABI pretends that PKRU is XSAVE managed
> even though it is not.
>
> All of this is special-cased.  There's a ton of code to handle this
> mess.  It's _complicated_.  I haven't even started talking about how
> this interacts with KVM and guests.
>
> How could we improve it?  A time machine would help to either change the
> architecture or have Linux ignore the fact that XSAVE knows anything
> about PKRU.
>
> So, the bar is pretty high for things that want to further muck with
> PKRU.  Add signal and sigaltstack in particular into the fray, and we've
> got a recipe for disaster.  sigaltstack and XSAVE don't really get along
> very well.  https://lwn.net/Articles/862541/
>
> >> need new altstack ABI and handling.
> >>
> > I thought adding protected memory support to signaling handling is an
> > independent project with its own weight. As Jann Horn points out in
> > [2]:  "we could prevent the attacker from corrupting the signal
> > context if we can protect the signal stack with a pkey."   However,
> > the kernel will send SIGSEGV when the stack is protected by PKEY,  so
> > there is a benefit to make this work.  (Maybe Jann can share some more
> > thoughts on the benefits)
> >
> > And I believe we could do this in a way with minimum ABI change, as below:
> > - allocate PKEY with a new flag (PKEY_ALTSTACK)
> > - at sigaltstack() call, detect the memory is PKEY_ALTSTACK protected,
> > (similar as what mprotect does in this patch) and save it along with
> > stack address/size.
> > - at signaling handling, use the saved info to fill in PKRU.
> > The ABI change is similar to PKEY_ENFORCE_API, and there is no
> > backward compatibility issue.
> >
> > Will these mentioned help our case ? What do you think ?
>
> To be honest, no.
>
> What you've laid out here is the tip of the complexity iceberg.  There
> are a lot of pieces of the kernel that are not yet factored in.
>
> Let's also remember: protection keys is *NOT* a security feature.  It's
> arguable that pkeys is a square peg trying to go into a round security hole.

While they're not a security feature, they're pretty close to providing us with
exactly what we need: per-thread memory permissions that we can use for
in-process isolation.
We've spent quite some effort up front thinking about potential attacks and
we're confident we can build something that will pose a meaningful boundary.

> On the x86 side, PKRU is a painfully special snowflake.  It's exposed in
> the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel.
> This would be making it an even more special snowflake because it would
> need new altstack ABI and handling.

Most of the complexity in the signal handling proposal seems to come from the
saving/restoring pkru before/after the signal handler execution. However, this
is just nice to have. We just need the kernel to allow us to register
pkey-tagged memory as a sigaltstack, i.e. it shouldn't crash when trying to
write the register state to the stack. Everything else, we can do in userland.

> It probably *requires* turning off a bunch of
> syscalls (like io_uring) that folks kinda like in general.

Kind of. This approach only works in combination with an effort in userland to
restrict the syscalls. Though that doesn't mean you have to turn them off,
there's also the option of adding validation before it.
The same applies to the memory management syscalls in this patchset. We can add
validation for these in userland, but we're hoping to do it in kernel instead
for the reasons I mentioned before (e.g. they're very common and it's much
easier to validate in the kernel). Also subjectively it seems like a
nice property
if the pkey protections would not just apply to the memory contents, but also
apply to the metadata.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4005 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] PKEY: Add arch_check_pkey_enforce_api()
  2023-05-19  0:00       ` Dave Hansen
@ 2023-05-19 11:22         ` Stephen Röttger
  0 siblings, 0 replies; 44+ messages in thread
From: Stephen Röttger @ 2023-05-19 11:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jeff Xu, jeffxu, luto, jorgelo, keescook, groeck, jannh, akpm,
	linux-kernel, linux-kselftest, linux-mm, linux-hardening

[-- Attachment #1: Type: text/plain, Size: 2067 bytes --]

On Fri, May 19, 2023 at 2:00 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/18/23 15:51, Jeff Xu wrote:
> >> Do you have a solid handle on all call paths that will reach
> >> __arch_check_vma_pkey_for_write() and can you ensure they are all
> >> non-remote?
> > Is this about the attack scenario where the attacker uses ptrace()
> > into the chrome process ? if so it is not in our threat model, and
> > that is more related to sandboxing on the host.
>
> The attacker would use *some* remote interface.  ptrace() is just one of
> those remote interfaces.
>
> > Or is this about io_uring? Yes, io_uring kernel thread breaks our
> > expectations of PKRU & user space threads, however I thought the break
> > is not just for this - any syscall involved in memory operation will
> > break after into io_uring ?
>
> I'm not quite following.
>
> Please just do me a favor: have the io_uring maintainers look at your
> proposal.  Make sure that the defenses you are building can work in a
> process where io_uring is in use by the benign threads.
>
> Those same folks are pretty familiar with the other, more traditional
> I/O syscalls that have in-memory descriptors that control syscall
> behavior like readv/writev.  Those also need a close look.
>
> > Other than those, yes, I try to ensure the check is only used at the
> > beginning of syscall entry in all cases, which should be non-remote I
> > hope.
>
> You're right that synchronous, shallow syscall paths are usually
> non-remote.  But those aren't the problem.  The problem is that there
> *ARE* remote accesses and those are a potential hole for this whole
> mechanism.
>
> Can they be closed?  I don't know.  I honestly don't have a great grasp
> on how widespread these things are.  You'll need a much more complete
> grasp on them than I have before this thing can go forward.

I don't think the remote writes are a problem for us if they're initiated from
the same process. It's a case of syscalls where we need to add special
validation in userspace.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4005 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-18 21:04                     ` Dave Hansen
  2023-05-19 11:13                       ` Stephen Röttger
@ 2023-05-24 20:15                       ` Jeff Xu
  2023-06-01  1:39                       ` Jeff Xu
  2 siblings, 0 replies; 44+ messages in thread
From: Jeff Xu @ 2023-05-24 20:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jeff Xu, Stephen Röttger, luto, jorgelo, keescook, groeck,
	jannh, akpm, linux-kernel, linux-kselftest, linux-mm,
	linux-hardening

Thanks for bringing this to my attention. Regarding io_uring:
>
> io_uring fundamentally doesn't have the same checks.  The kernel side
> work can be done from an asynchronous kernel thread.  That kernel thread
> doesn't have a meaningful PKRU value.  The register has a value, but
> it's not really related to the userspace threads that are sending it
> requests.
>

I asked the question to the io_uring list [1].  io_uring thread will
respect PKRU of the user thread, async or not,  the behavior is the
same as regular syscall. There will be no issue for io_uring, i.e if
it decides to add more memory mapping syscalls to supported cmd in
future.

[1] https://lore.kernel.org/io-uring/CABi2SkUp45HEt7eQ6a47Z7b3LzW=4m3xAakG35os7puCO2dkng@mail.gmail.com/

Thanks.
-Jeff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-18 15:37                 ` Dave Hansen
  2023-05-18 20:20                   ` Jeff Xu
@ 2023-05-31 23:02                   ` Jeff Xu
  1 sibling, 0 replies; 44+ messages in thread
From: Jeff Xu @ 2023-05-31 23:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jeff Xu, Stephen Röttger, luto, jorgelo, keescook, groeck,
	jannh, akpm, linux-kernel, linux-kselftest, linux-mm,
	linux-hardening

Hi Dave,

Regarding siglongjmp:

On Thu, May 18, 2023 at 8:37 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/17/23 16:48, Jeff Xu wrote:
> > However, there are a few challenges I have not yet worked through.
> > First, the code needs to track when the first signaling entry occurs
> > (saving the PKRU register to the thread struct) and when it is last
> > returned (restoring the PKRU register from the thread struct).
>
> Would tracking signal "depth" work in the face of things like siglongjmp?
>
siglongjmp is interesting, thanks for bringing this up.

With siglongjmp, the thread doesn't go back to the place where signal is
raised, indeed, this idea of tracking the first signaling entry
doesn't work well with siglongjmp.

Thanks for your insight!
-Jeff


-Jeff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-05-18 21:04                     ` Dave Hansen
  2023-05-19 11:13                       ` Stephen Röttger
  2023-05-24 20:15                       ` Jeff Xu
@ 2023-06-01  1:39                       ` Jeff Xu
  2023-06-01 16:16                         ` Dave Hansen
  2 siblings, 1 reply; 44+ messages in thread
From: Jeff Xu @ 2023-06-01  1:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Jeff Xu, Stephen Röttger, luto, jorgelo, keescook, groeck,
	jannh, akpm, linux-kernel, linux-kselftest, linux-mm,
	linux-hardening

Hi Dave,
Thanks for feedback, regarding sigaltstack:

On Thu, May 18, 2023 at 2:04 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > Agreed on signaling handling is a tough part: what do you think about
> > the approach (modifying PKRU from saved stack after XSAVE), is there a
> > blocker ?
>
> Yes, signal entry and sigreturn are not necessarily symmetric so you
> can't really have a stack.
>

To clarify: I mean this option below:
- before get_sigframe(), save PKUR => tmp
- modify thread's PKRU so it can write to sigframe
- XSAVE
- save tmp => sigframe

I believe you proposed this in a previous discussion [1]:
and I quote here:
"There's a delicate point when building the stack frame that the
kernel would need to move over to the new PKRU value to build the
frame before it writes the *OLD* value to the frame.  But, it's far
from impossible."

sigreturn will restore thread's original PKRU from sigframe.
In case of asymmetrics caused by siglongjmp, user space doesn't call
sigreturn, the application needs to set desired PKRU before siglongjmp.

I think this solution should work.

[1] https://lore.kernel.org/lkml/b4f0dca5-1d15-67f7-4600-9a0a91e9d0bd@intel.com/

Best regards,
-Jeff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1
  2023-06-01  1:39                       ` Jeff Xu
@ 2023-06-01 16:16                         ` Dave Hansen
  0 siblings, 0 replies; 44+ messages in thread
From: Dave Hansen @ 2023-06-01 16:16 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Jeff Xu, Stephen Röttger, luto, jorgelo, keescook, groeck,
	jannh, akpm, linux-kernel, linux-kselftest, linux-mm,
	linux-hardening

On 5/31/23 18:39, Jeff Xu wrote:
> I think this solution should work.

By "work" I think you mean that if laser-focused on this one use case,
without a full implementation, it looks like it can work.

I'll give you a "maybe" on that.

But that leaves out the bigger picture.  How many other things will we
regress doing this?  What's the opportunity cost?  What other things
will get neglected because we did _this_ one?  Are there more users out
there?

Looking at the big picture, I'm not convinced those tradeoffs are good
ones (and you're not going to find anyone that's a bigger fan of pkeys
than me).

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2023-06-01 16:20 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-15 13:05 [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 jeffxu
2023-05-15 13:05 ` [PATCH 1/6] PKEY: Introduce PKEY_ENFORCE_API flag jeffxu
2023-05-16 23:14   ` Dave Hansen
2023-05-16 23:55     ` Jeff Xu
2023-05-17 11:07     ` Stephen Röttger
2023-05-15 13:05 ` [PATCH 2/6] PKEY: Add arch_check_pkey_enforce_api() jeffxu
2023-05-18 21:43   ` Dave Hansen
2023-05-18 22:51     ` Jeff Xu
2023-05-19  0:00       ` Dave Hansen
2023-05-19 11:22         ` Stephen Röttger
2023-05-15 13:05 ` [PATCH 3/6] PKEY: Apply PKEY_ENFORCE_API to mprotect jeffxu
2023-05-16 20:07   ` Kees Cook
2023-05-16 22:23     ` Jeff Xu
2023-05-16 23:18   ` Dave Hansen
2023-05-16 23:36     ` Jeff Xu
2023-05-17  4:50       ` Jeff Xu
2023-05-15 13:05 ` [PATCH 4/6] PKEY:selftest pkey_enforce_api for mprotect jeffxu
2023-05-15 13:05 ` [PATCH 5/6] KEY: Apply PKEY_ENFORCE_API to munmap jeffxu
2023-05-16 20:06   ` Kees Cook
2023-05-16 22:24     ` Jeff Xu
2023-05-16 23:23   ` Dave Hansen
2023-05-17  0:08     ` Jeff Xu
2023-05-15 13:05 ` [PATCH 6/6] PKEY:selftest pkey_enforce_api for munmap jeffxu
2023-05-15 14:28 ` [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1 Dave Hansen
2023-05-16  7:06   ` Stephen Röttger
2023-05-16 22:41     ` Dave Hansen
2023-05-17 10:51       ` Stephen Röttger
2023-05-17 15:07         ` Dave Hansen
2023-05-17 15:21           ` Jeff Xu
2023-05-17 15:29             ` Dave Hansen
2023-05-17 23:48               ` Jeff Xu
2023-05-18 15:37                 ` Dave Hansen
2023-05-18 20:20                   ` Jeff Xu
2023-05-18 21:04                     ` Dave Hansen
2023-05-19 11:13                       ` Stephen Röttger
2023-05-24 20:15                       ` Jeff Xu
2023-06-01  1:39                       ` Jeff Xu
2023-06-01 16:16                         ` Dave Hansen
2023-05-31 23:02                   ` Jeff Xu
2023-05-16 20:08 ` Kees Cook
2023-05-16 22:17   ` Jeff Xu
2023-05-16 22:30     ` Dave Hansen
2023-05-16 23:39       ` Jeff Xu
2023-05-17 10:49   ` Stephen Röttger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).