[PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-13 13:49 ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-13 13:49 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

Hi,

The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable. Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but
subsequently changed to read-only. Therefore the filter simply rejects
any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
support (Branch Target Identification), the dynamic loader cannot change
an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
For libraries, it can resort to unmapping and re-mapping but for the
main executable it does not have a file descriptor. The original bug
report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
for libraries - [3].

Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
flag, inherited on fork() and execve(). The kernel tracks a previously
writeable mapping via a new VM_WAS_WRITE flag (64-bit only
architectures). I went for a personality flag by analogy with the
READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
we don't want more personality flags. A minor downside with the
personality flag is that there is no way for the user to query which
flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
this.

Posting this as an RFC to start a discussion and cc'ing some of the
systemd guys and those involved in the earlier thread around the glibc
workaround for dynamic libraries [4]. Before thinking of upstreaming
this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
BPF filter with the in-kernel one.

Thanks,

Catalin

[1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
[3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
[3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com

Catalin Marinas (4):
  mm: Track previously writeable vma permission
  mm, personality: Implement memory-deny-write-execute as a personality
    flag
  fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
    flag
  arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC

 arch/arm64/Kconfig               |  1 +
 fs/binfmt_elf.c                  |  2 ++
 include/linux/mm.h               |  6 ++++++
 include/linux/mman.h             | 18 +++++++++++++++++-
 include/uapi/linux/binfmts.h     |  4 ++++
 include/uapi/linux/personality.h |  1 +
 mm/Kconfig                       |  4 ++++
 mm/mmap.c                        |  3 +++
 mm/mprotect.c                    |  5 +++++
 9 files changed, 43 insertions(+), 1 deletion(-)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-13 13:49 ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-13 13:49 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

Hi,

The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable. Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but
subsequently changed to read-only. Therefore the filter simply rejects
any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
support (Branch Target Identification), the dynamic loader cannot change
an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
For libraries, it can resort to unmapping and re-mapping but for the
main executable it does not have a file descriptor. The original bug
report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
for libraries - [3].

Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
flag, inherited on fork() and execve(). The kernel tracks a previously
writeable mapping via a new VM_WAS_WRITE flag (64-bit only
architectures). I went for a personality flag by analogy with the
READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
we don't want more personality flags. A minor downside with the
personality flag is that there is no way for the user to query which
flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
this.

Posting this as an RFC to start a discussion and cc'ing some of the
systemd guys and those involved in the earlier thread around the glibc
workaround for dynamic libraries [4]. Before thinking of upstreaming
this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
BPF filter with the in-kernel one.

Thanks,

Catalin

[1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
[3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
[3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com

Catalin Marinas (4):
  mm: Track previously writeable vma permission
  mm, personality: Implement memory-deny-write-execute as a personality
    flag
  fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
    flag
  arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC

 arch/arm64/Kconfig               |  1 +
 fs/binfmt_elf.c                  |  2 ++
 include/linux/mm.h               |  6 ++++++
 include/linux/mman.h             | 18 +++++++++++++++++-
 include/uapi/linux/binfmts.h     |  4 ++++
 include/uapi/linux/personality.h |  1 +
 mm/Kconfig                       |  4 ++++
 mm/mmap.c                        |  3 +++
 mm/mprotect.c                    |  5 +++++
 9 files changed, 43 insertions(+), 1 deletion(-)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH RFC 1/4] mm: Track previously writeable vma permission
  2022-04-13 13:49 ` Catalin Marinas
@ 2022-04-13 13:49   ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-13 13:49 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

In order to support a memory-deny-write-execute policy for mprotect()
and prevent a previously writeable mapping from being made executable,
track the past VM_WRITE permission via a new VM_WAS_WRITE flag that is
not cleared on permission change.

VM_WAS_WRITE is a high VMA flag and since not all architectures may want
this feature, only define it if CONFIG_ARCH_ENABLE_DENY_WRITE_EXEC is
selected, otherwise it is VM_NONE (zero).

Note that the new VM_WAS_WRITE flag would prevent merging of an always
read-only vma with a previously writeable vma that was made read-only. I
don't consider this a common case and even if we somehow allow such
merging, it would be confusing for the user if a read-only vma inherits
a VM_WAS_WRITE flag or the VM_WAS_WRITE flag is dropped.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mm.h   | 6 ++++++
 include/linux/mman.h | 8 +++++++-
 mm/Kconfig           | 4 ++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..bec37abc0773 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -317,6 +317,12 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#ifdef CONFIG_ARCH_ENABLE_DENY_WRITE_EXEC
+#define VM_WAS_WRITE		BIT(37)	/* only with ARCH_USES_HIGH_VMA_FLAGS */
+#else
+#define VM_WAS_WRITE		VM_NONE
+#endif
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
 # define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
diff --git a/include/linux/mman.h b/include/linux/mman.h
index b66e91b8176c..2d841ddae2aa 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -141,10 +141,16 @@ static inline bool arch_validate_flags(unsigned long flags)
 static inline unsigned long
 calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 {
-	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
+	unsigned long vm_flags =
+	       _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
 	       arch_calc_vm_prot_bits(prot, pkey);
+
+	if (vm_flags & VM_WRITE)
+		vm_flags |= VM_WAS_WRITE;
+
+	return vm_flags;
 }
 
 /*
diff --git a/mm/Kconfig b/mm/Kconfig
index 034d87953600..f140109f2a1e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -822,6 +822,10 @@ config ARCH_USES_HIGH_VMA_FLAGS
 config ARCH_HAS_PKEYS
 	bool
 
+config ARCH_ENABLE_DENY_WRITE_EXEC
+	bool
+	depends on ARCH_USES_HIGH_VMA_FLAGS
+
 config PERCPU_STATS
 	bool "Collect percpu memory statistics"
 	help

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 1/4] mm: Track previously writeable vma permission
@ 2022-04-13 13:49   ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-13 13:49 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

In order to support a memory-deny-write-execute policy for mprotect()
and prevent a previously writeable mapping from being made executable,
track the past VM_WRITE permission via a new VM_WAS_WRITE flag that is
not cleared on permission change.

VM_WAS_WRITE is a high VMA flag and since not all architectures may want
this feature, only define it if CONFIG_ARCH_ENABLE_DENY_WRITE_EXEC is
selected, otherwise it is VM_NONE (zero).

Note that the new VM_WAS_WRITE flag would prevent merging of an always
read-only vma with a previously writeable vma that was made read-only. I
don't consider this a common case and even if we somehow allow such
merging, it would be confusing for the user if a read-only vma inherits
a VM_WAS_WRITE flag or the VM_WAS_WRITE flag is dropped.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mm.h   | 6 ++++++
 include/linux/mman.h | 8 +++++++-
 mm/Kconfig           | 4 ++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..bec37abc0773 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -317,6 +317,12 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#ifdef CONFIG_ARCH_ENABLE_DENY_WRITE_EXEC
+#define VM_WAS_WRITE		BIT(37)	/* only with ARCH_USES_HIGH_VMA_FLAGS */
+#else
+#define VM_WAS_WRITE		VM_NONE
+#endif
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
 # define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
diff --git a/include/linux/mman.h b/include/linux/mman.h
index b66e91b8176c..2d841ddae2aa 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -141,10 +141,16 @@ static inline bool arch_validate_flags(unsigned long flags)
 static inline unsigned long
 calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
 {
-	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
+	unsigned long vm_flags =
+	       _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
 	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
 	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
 	       arch_calc_vm_prot_bits(prot, pkey);
+
+	if (vm_flags & VM_WRITE)
+		vm_flags |= VM_WAS_WRITE;
+
+	return vm_flags;
 }
 
 /*
diff --git a/mm/Kconfig b/mm/Kconfig
index 034d87953600..f140109f2a1e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -822,6 +822,10 @@ config ARCH_USES_HIGH_VMA_FLAGS
 config ARCH_HAS_PKEYS
 	bool
 
+config ARCH_ENABLE_DENY_WRITE_EXEC
+	bool
+	depends on ARCH_USES_HIGH_VMA_FLAGS
+
 config PERCPU_STATS
 	bool "Collect percpu memory statistics"
 	help

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
  2022-04-13 13:49 ` Catalin Marinas
@ 2022-04-13 13:49   ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-13 13:49 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

The aim of such policy is to prevent a user task from inadvertently
creating an executable mapping that is or was writeable (and
subsequently made read-only).

An example of mmap() returning -EACCESS if the policy is enabled:

	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);

Similarly, mprotect() would return -EACCESS below:

	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);

With the past vma writeable permission tracking, mprotect() below would
also fail with -EACCESS:

	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
	mprotect(addr, size, PROT_READ | PROT_EXEC);

While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
systemd MDWE approach via SECCOMP BPF filters), we want the following
scenario to succeed:

	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);

where PROT_BTI enables branch tracking identification on arm64.

The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
and execve(), was made by analogy to READ_IMPLIES_EXEC.

Note that it is sufficient to check for VM_WAS_WRITE in
map_deny_write_exec() as this flag is always set on VM_WRITE mappings.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mman.h             | 10 ++++++++++
 include/uapi/linux/personality.h |  1 +
 mm/mmap.c                        |  3 +++
 mm/mprotect.c                    |  5 +++++
 4 files changed, 19 insertions(+)

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 2d841ddae2aa..17e91a1bdfb3 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -166,4 +166,14 @@ calc_vm_flag_bits(unsigned long flags)
 }
 
 unsigned long vm_commit_limit(void);
+
+static inline bool map_deny_write_exec(unsigned long vm_flags)
+{
+	if (IS_ENABLED(CONFIG_ARCH_ENABLE_DENY_WRITE_EXEC) &&
+	    (current->personality & DENY_WRITE_EXEC) &&
+	    (vm_flags & VM_EXEC) && (vm_flags & VM_WAS_WRITE))
+		return true;
+	return false;
+}
+
 #endif /* _LINUX_MMAN_H */
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
index 49796b7756af..c8d924be3dcd 100644
--- a/include/uapi/linux/personality.h
+++ b/include/uapi/linux/personality.h
@@ -22,6 +22,7 @@ enum {
 	WHOLE_SECONDS =		0x2000000,
 	STICKY_TIMEOUTS	=	0x4000000,
 	ADDR_LIMIT_3GB = 	0x8000000,
+	DENY_WRITE_EXEC =	0x10000000,
 };
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
index 3aa839f81e63..8e894270a80e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1579,6 +1579,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
+	if (map_deny_write_exec(vm_flags))
+		return -EACCES;
+
 	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b69ce7a7b2b7..ff0d13a4c1ed 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -627,6 +627,11 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 			goto out;
 		}
 
+		if (map_deny_write_exec(newflags)) {
+			error = -EACCES;
+			goto out;
+		}
+
 		/* Allow architectures to sanity-check the new flags */
 		if (!arch_validate_flags(newflags)) {
 			error = -EINVAL;

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
@ 2022-04-13 13:49   ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-13 13:49 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

The aim of such policy is to prevent a user task from inadvertently
creating an executable mapping that is or was writeable (and
subsequently made read-only).

An example of mmap() returning -EACCESS if the policy is enabled:

	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);

Similarly, mprotect() would return -EACCESS below:

	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);

With the past vma writeable permission tracking, mprotect() below would
also fail with -EACCESS:

	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
	mprotect(addr, size, PROT_READ | PROT_EXEC);

While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
systemd MDWE approach via SECCOMP BPF filters), we want the following
scenario to succeed:

	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);

where PROT_BTI enables branch tracking identification on arm64.

The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
and execve(), was made by analogy to READ_IMPLIES_EXEC.

Note that it is sufficient to check for VM_WAS_WRITE in
map_deny_write_exec() as this flag is always set on VM_WRITE mappings.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mman.h             | 10 ++++++++++
 include/uapi/linux/personality.h |  1 +
 mm/mmap.c                        |  3 +++
 mm/mprotect.c                    |  5 +++++
 4 files changed, 19 insertions(+)

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 2d841ddae2aa..17e91a1bdfb3 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -166,4 +166,14 @@ calc_vm_flag_bits(unsigned long flags)
 }
 
 unsigned long vm_commit_limit(void);
+
+static inline bool map_deny_write_exec(unsigned long vm_flags)
+{
+	if (IS_ENABLED(CONFIG_ARCH_ENABLE_DENY_WRITE_EXEC) &&
+	    (current->personality & DENY_WRITE_EXEC) &&
+	    (vm_flags & VM_EXEC) && (vm_flags & VM_WAS_WRITE))
+		return true;
+	return false;
+}
+
 #endif /* _LINUX_MMAN_H */
diff --git a/include/uapi/linux/personality.h b/include/uapi/linux/personality.h
index 49796b7756af..c8d924be3dcd 100644
--- a/include/uapi/linux/personality.h
+++ b/include/uapi/linux/personality.h
@@ -22,6 +22,7 @@ enum {
 	WHOLE_SECONDS =		0x2000000,
 	STICKY_TIMEOUTS	=	0x4000000,
 	ADDR_LIMIT_3GB = 	0x8000000,
+	DENY_WRITE_EXEC =	0x10000000,
 };
 
 /*
diff --git a/mm/mmap.c b/mm/mmap.c
index 3aa839f81e63..8e894270a80e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1579,6 +1579,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
+	if (map_deny_write_exec(vm_flags))
+		return -EACCES;
+
 	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b69ce7a7b2b7..ff0d13a4c1ed 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -627,6 +627,11 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 			goto out;
 		}
 
+		if (map_deny_write_exec(newflags)) {
+			error = -EACCES;
+			goto out;
+		}
+
 		/* Allow architectures to sanity-check the new flags */
 		if (!arch_validate_flags(newflags)) {
 			error = -EINVAL;

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 3/4] fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality flag
  2022-04-13 13:49 ` Catalin Marinas
@ 2022-04-13 13:49   ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-13 13:49 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

Since personality() accepts any flags and does not mask out any unknown
bits, inform user space that such personality flag is supported via an
AT_FLAGS_DENY_WRITE_EXEC flag.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 fs/binfmt_elf.c              | 2 ++
 include/uapi/linux/binfmts.h | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 6556e13ed95f..4e6cba1f67ee 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -265,6 +265,8 @@ create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec,
 	NEW_AUX_ENT(AT_BASE, interp_load_addr);
 	if (bprm->interp_flags & BINPRM_FLAGS_PRESERVE_ARGV0)
 		flags |= AT_FLAGS_PRESERVE_ARGV0;
+	if (IS_ENABLED(CONFIG_ARCH_ENABLE_DENY_WRITE_EXEC))
+		flags |= AT_FLAGS_DENY_WRITE_EXEC;
 	NEW_AUX_ENT(AT_FLAGS, flags);
 	NEW_AUX_ENT(AT_ENTRY, e_entry);
 	NEW_AUX_ENT(AT_UID, from_kuid_munged(cred->user_ns, cred->uid));
diff --git a/include/uapi/linux/binfmts.h b/include/uapi/linux/binfmts.h
index c6f9450efc12..304bbb30264c 100644
--- a/include/uapi/linux/binfmts.h
+++ b/include/uapi/linux/binfmts.h
@@ -22,4 +22,8 @@ struct pt_regs;
 #define AT_FLAGS_PRESERVE_ARGV0_BIT 0
 #define AT_FLAGS_PRESERVE_ARGV0 (1 << AT_FLAGS_PRESERVE_ARGV0_BIT)
 
+/* support for DENY_WRITE_EXEC personality flag */
+#define AT_FLAGS_DENY_WRITE_EXEC_BIT 1
+#define AT_FLAGS_DENY_WRITE_EXEC (1 << AT_FLAGS_DENY_WRITE_EXEC_BIT)
+
 #endif /* _UAPI_LINUX_BINFMTS_H */

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 3/4] fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality flag
@ 2022-04-13 13:49   ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-13 13:49 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

Since personality() accepts any flags and does not mask out any unknown
bits, inform user space that such personality flag is supported via an
AT_FLAGS_DENY_WRITE_EXEC flag.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 fs/binfmt_elf.c              | 2 ++
 include/uapi/linux/binfmts.h | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 6556e13ed95f..4e6cba1f67ee 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -265,6 +265,8 @@ create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec,
 	NEW_AUX_ENT(AT_BASE, interp_load_addr);
 	if (bprm->interp_flags & BINPRM_FLAGS_PRESERVE_ARGV0)
 		flags |= AT_FLAGS_PRESERVE_ARGV0;
+	if (IS_ENABLED(CONFIG_ARCH_ENABLE_DENY_WRITE_EXEC))
+		flags |= AT_FLAGS_DENY_WRITE_EXEC;
 	NEW_AUX_ENT(AT_FLAGS, flags);
 	NEW_AUX_ENT(AT_ENTRY, e_entry);
 	NEW_AUX_ENT(AT_UID, from_kuid_munged(cred->user_ns, cred->uid));
diff --git a/include/uapi/linux/binfmts.h b/include/uapi/linux/binfmts.h
index c6f9450efc12..304bbb30264c 100644
--- a/include/uapi/linux/binfmts.h
+++ b/include/uapi/linux/binfmts.h
@@ -22,4 +22,8 @@ struct pt_regs;
 #define AT_FLAGS_PRESERVE_ARGV0_BIT 0
 #define AT_FLAGS_PRESERVE_ARGV0 (1 << AT_FLAGS_PRESERVE_ARGV0_BIT)
 
+/* support for DENY_WRITE_EXEC personality flag */
+#define AT_FLAGS_DENY_WRITE_EXEC_BIT 1
+#define AT_FLAGS_DENY_WRITE_EXEC (1 << AT_FLAGS_DENY_WRITE_EXEC_BIT)
+
 #endif /* _UAPI_LINUX_BINFMTS_H */

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 4/4] arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC
  2022-04-13 13:49 ` Catalin Marinas
@ 2022-04-13 13:49   ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-13 13:49 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

This will allow the DENY_WRITE_EXEC personality flag to prevent creating
a PROT_EXEC mapping that is or was also PROT_WRITE.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 57c4c995965f..6cbdc8294337 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -13,6 +13,7 @@ config ARM64
 	select ARCH_BINFMT_ELF_EXTRA_PHDRS
 	select ARCH_BINFMT_ELF_STATE
 	select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
+	select ARCH_ENABLE_DENY_WRITE_EXEC
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
 	select ARCH_ENABLE_MEMORY_HOTPLUG
 	select ARCH_ENABLE_MEMORY_HOTREMOVE

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 4/4] arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC
@ 2022-04-13 13:49   ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-13 13:49 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

This will allow the DENY_WRITE_EXEC personality flag to prevent creating
a PROT_EXEC mapping that is or was also PROT_WRITE.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 57c4c995965f..6cbdc8294337 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -13,6 +13,7 @@ config ARM64
 	select ARCH_BINFMT_ELF_EXTRA_PHDRS
 	select ARCH_BINFMT_ELF_STATE
 	select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
+	select ARCH_ENABLE_DENY_WRITE_EXEC
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
 	select ARCH_ENABLE_MEMORY_HOTPLUG
 	select ARCH_ENABLE_MEMORY_HOTREMOVE

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-13 13:49 ` Catalin Marinas
@ 2022-04-13 18:39   ` Topi Miettinen
  -1 siblings, 0 replies; 50+ messages in thread
From: Topi Miettinen @ 2022-04-13 18:39 UTC (permalink / raw)
  To: Catalin Marinas, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, linux-mm,
	linux-arm-kernel, linux-kernel, linux-abi-devel

On 13.4.2022 16.49, Catalin Marinas wrote:
> Hi,
> 
> The background to this is that systemd has a configuration option called
> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> is to prevent a user task from inadvertently creating an executable
> mapping that is (or was) writeable. Since such BPF filter is stateless,
> it cannot detect mappings that were previously writeable but
> subsequently changed to read-only. Therefore the filter simply rejects
> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> support (Branch Target Identification), the dynamic loader cannot change
> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> For libraries, it can resort to unmapping and re-mapping but for the
> main executable it does not have a file descriptor. The original bug
> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> for libraries - [3].
> 
> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
> flag, inherited on fork() and execve(). The kernel tracks a previously
> writeable mapping via a new VM_WAS_WRITE flag (64-bit only
> architectures). I went for a personality flag by analogy with the
> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
> we don't want more personality flags. A minor downside with the
> personality flag is that there is no way for the user to query which
> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
> this.

With systemd there's a BPF construct to block personality changes 
(LockPersonality=yes) but I think prctl() would be easier to lock down 
irrevocably.

Requiring or implying NoNewPrivileges could prevent nasty surprises from 
set-uid Python programs which happen to use FFI.

> Posting this as an RFC to start a discussion and cc'ing some of the
> systemd guys and those involved in the earlier thread around the glibc
> workaround for dynamic libraries [4]. Before thinking of upstreaming
> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
> BPF filter with the in-kernel one.

As the author of this feature in systemd (also similar feature in 
Firejail), I'd highly prefer in-kernel version to BPF protection. I'd 
definitely also want to use this in place of BPF on x86_64 and other 
arches too.

In-kernel version would probably allow covering pretty easily this case 
(maybe it already does):

	fd = memfd_create(...);
	write(fd, malicious_code, sizeof(malicious_code));
	mmap(..., PROT_EXEC, ..., fd);

Other memory W^X implementations include S.A.R.A [1] and SELinux 
EXECMEM/EXECSTACK/EXECHEAP protections [2], [3]. SELinux checks 
IS_PRIVATE(file_inode(file)) and vma->anon_vma != NULL, which might be 
useful additions here too (or future extensions if you prefer).

-Topi

[1] https://smeso.it/sara/
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3708
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3787

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-13 18:39   ` Topi Miettinen
  0 siblings, 0 replies; 50+ messages in thread
From: Topi Miettinen @ 2022-04-13 18:39 UTC (permalink / raw)
  To: Catalin Marinas, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, linux-mm,
	linux-arm-kernel, linux-kernel, linux-abi-devel

On 13.4.2022 16.49, Catalin Marinas wrote:
> Hi,
> 
> The background to this is that systemd has a configuration option called
> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> is to prevent a user task from inadvertently creating an executable
> mapping that is (or was) writeable. Since such BPF filter is stateless,
> it cannot detect mappings that were previously writeable but
> subsequently changed to read-only. Therefore the filter simply rejects
> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> support (Branch Target Identification), the dynamic loader cannot change
> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> For libraries, it can resort to unmapping and re-mapping but for the
> main executable it does not have a file descriptor. The original bug
> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> for libraries - [3].
> 
> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
> flag, inherited on fork() and execve(). The kernel tracks a previously
> writeable mapping via a new VM_WAS_WRITE flag (64-bit only
> architectures). I went for a personality flag by analogy with the
> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
> we don't want more personality flags. A minor downside with the
> personality flag is that there is no way for the user to query which
> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
> this.

With systemd there's a BPF construct to block personality changes 
(LockPersonality=yes) but I think prctl() would be easier to lock down 
irrevocably.

Requiring or implying NoNewPrivileges could prevent nasty surprises from 
set-uid Python programs which happen to use FFI.

> Posting this as an RFC to start a discussion and cc'ing some of the
> systemd guys and those involved in the earlier thread around the glibc
> workaround for dynamic libraries [4]. Before thinking of upstreaming
> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
> BPF filter with the in-kernel one.

As the author of this feature in systemd (also similar feature in 
Firejail), I'd highly prefer in-kernel version to BPF protection. I'd 
definitely also want to use this in place of BPF on x86_64 and other 
arches too.

In-kernel version would probably allow covering pretty easily this case 
(maybe it already does):

	fd = memfd_create(...);
	write(fd, malicious_code, sizeof(malicious_code));
	mmap(..., PROT_EXEC, ..., fd);

Other memory W^X implementations include S.A.R.A [1] and SELinux 
EXECMEM/EXECSTACK/EXECHEAP protections [2], [3]. SELinux checks 
IS_PRIVATE(file_inode(file)) and vma->anon_vma != NULL, which might be 
useful additions here too (or future extensions if you prefer).

-Topi

[1] https://smeso.it/sara/
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3708
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3787

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-13 18:39   ` Topi Miettinen
@ 2022-04-14 13:49     ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-14 13:49 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Kees Cook, Szabolcs Nagy, Mark Brown,
	Jeremy Linton, linux-mm, linux-arm-kernel, linux-kernel,
	linux-abi-devel

Hi Topi,

On Wed, Apr 13, 2022 at 09:39:37PM +0300, Topi Miettinen wrote:
> On 13.4.2022 16.49, Catalin Marinas wrote:
> > The background to this is that systemd has a configuration option called
> > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> > is to prevent a user task from inadvertently creating an executable
> > mapping that is (or was) writeable. Since such BPF filter is stateless,
> > it cannot detect mappings that were previously writeable but
> > subsequently changed to read-only. Therefore the filter simply rejects
> > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> > support (Branch Target Identification), the dynamic loader cannot change
> > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> > For libraries, it can resort to unmapping and re-mapping but for the
> > main executable it does not have a file descriptor. The original bug
> > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> > for libraries - [3].
> > 
> > Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
> > flag, inherited on fork() and execve(). The kernel tracks a previously
> > writeable mapping via a new VM_WAS_WRITE flag (64-bit only
> > architectures).

BTW, we can avoid the VM_WAS_WRITE tracking if we limit the check to the
current permissions. It would allow mprotect(PROT_EXEC) only if the
mapping is already executable. The mmap(PROT_WRITE|PROT_EXEC) would be
rejected, as expected. The difference from the current BPF filter is
that mprotect(PROT_EXEC|PROT_BTI) is allowed if the mapping was
previously mmap(PROT_READ|PROT_EXEC).

I'm open to go in this direction if it fits the systemd requirements
better. It's also less state to track in the kernel.

> > I went for a personality flag by analogy with the
> > READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
> > we don't want more personality flags. A minor downside with the
> > personality flag is that there is no way for the user to query which
> > flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
> > this.
> 
> With systemd there's a BPF construct to block personality changes
> (LockPersonality=yes) but I think prctl() would be easier to lock down
> irrevocably.

The personality flag is not sticky, so one could inadvertently clear it
without LockPersonality=yes. We could add a mask of sticky bits to
sys_personality() (only for new flags), though we might as well go with
a prctl(), we have more flexibility and finer-grained control if we want
to expand this to memfd files.

Would there be any reason to disable such behaviour for an application,
once enabled? I don't think it's currently possible with the BPF filter
but we can add it to a prctl().

> > Posting this as an RFC to start a discussion and cc'ing some of the
> > systemd guys and those involved in the earlier thread around the glibc
> > workaround for dynamic libraries [4]. Before thinking of upstreaming
> > this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
> > BPF filter with the in-kernel one.
> 
> As the author of this feature in systemd (also similar feature in Firejail),
> I'd highly prefer in-kernel version to BPF protection. I'd definitely also
> want to use this in place of BPF on x86_64 and other arches too.

It is generic, so yes, it can be enabled for other architectures. A
minor issue with the VM_WAS_WRITE flag is that we ran out of 32-bit vma
flags. It could be expanded but not sure how much you care about MDWE on
32-bit architectures. An alternative is to drop the VM_WAS_WRITE
approach entirely, just use the current protection for the decision.

> In-kernel version would probably allow covering pretty easily this case
> (maybe it already does):
> 
> 	fd = memfd_create(...);
> 	write(fd, malicious_code, sizeof(malicious_code));
> 	mmap(..., PROT_EXEC, ..., fd);

This series doesn't cover it.

> Other memory W^X implementations include S.A.R.A [1] and SELinux
> EXECMEM/EXECSTACK/EXECHEAP protections [2], [3]. SELinux checks
> IS_PRIVATE(file_inode(file)) and vma->anon_vma != NULL, which might be
> useful additions here too (or future extensions if you prefer).

I had a quick look at SELinux and, IIUC, without the execmem permission
it prevents any anonymous mmap(PROT_EXEC). We could probably do
something similar here and check the file permission. I had an attempt
but S_PRIVATE doesn't seem to be set on the memfd_create() allocated
inode. I have to dig a bit more to see how to detect this. If we keep
the check for all files, it won't be able to map any ELF text sections
unless the binary is read-only.

> [1] https://smeso.it/sara/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3708
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3787

-- 
Catalin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-14 13:49     ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-14 13:49 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Kees Cook, Szabolcs Nagy, Mark Brown,
	Jeremy Linton, linux-mm, linux-arm-kernel, linux-kernel,
	linux-abi-devel

Hi Topi,

On Wed, Apr 13, 2022 at 09:39:37PM +0300, Topi Miettinen wrote:
> On 13.4.2022 16.49, Catalin Marinas wrote:
> > The background to this is that systemd has a configuration option called
> > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> > is to prevent a user task from inadvertently creating an executable
> > mapping that is (or was) writeable. Since such BPF filter is stateless,
> > it cannot detect mappings that were previously writeable but
> > subsequently changed to read-only. Therefore the filter simply rejects
> > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> > support (Branch Target Identification), the dynamic loader cannot change
> > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> > For libraries, it can resort to unmapping and re-mapping but for the
> > main executable it does not have a file descriptor. The original bug
> > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> > for libraries - [3].
> > 
> > Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
> > flag, inherited on fork() and execve(). The kernel tracks a previously
> > writeable mapping via a new VM_WAS_WRITE flag (64-bit only
> > architectures).

BTW, we can avoid the VM_WAS_WRITE tracking if we limit the check to the
current permissions. It would allow mprotect(PROT_EXEC) only if the
mapping is already executable. The mmap(PROT_WRITE|PROT_EXEC) would be
rejected, as expected. The difference from the current BPF filter is
that mprotect(PROT_EXEC|PROT_BTI) is allowed if the mapping was
previously mmap(PROT_READ|PROT_EXEC).

I'm open to go in this direction if it fits the systemd requirements
better. It's also less state to track in the kernel.

> > I went for a personality flag by analogy with the
> > READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
> > we don't want more personality flags. A minor downside with the
> > personality flag is that there is no way for the user to query which
> > flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
> > this.
> 
> With systemd there's a BPF construct to block personality changes
> (LockPersonality=yes) but I think prctl() would be easier to lock down
> irrevocably.

The personality flag is not sticky, so one could inadvertently clear it
without LockPersonality=yes. We could add a mask of sticky bits to
sys_personality() (only for new flags), though we might as well go with
a prctl(), we have more flexibility and finer-grained control if we want
to expand this to memfd files.

Would there be any reason to disable such behaviour for an application,
once enabled? I don't think it's currently possible with the BPF filter
but we can add it to a prctl().

> > Posting this as an RFC to start a discussion and cc'ing some of the
> > systemd guys and those involved in the earlier thread around the glibc
> > workaround for dynamic libraries [4]. Before thinking of upstreaming
> > this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
> > BPF filter with the in-kernel one.
> 
> As the author of this feature in systemd (also similar feature in Firejail),
> I'd highly prefer in-kernel version to BPF protection. I'd definitely also
> want to use this in place of BPF on x86_64 and other arches too.

It is generic, so yes, it can be enabled for other architectures. A
minor issue with the VM_WAS_WRITE flag is that we ran out of 32-bit vma
flags. It could be expanded but not sure how much you care about MDWE on
32-bit architectures. An alternative is to drop the VM_WAS_WRITE
approach entirely, just use the current protection for the decision.

> In-kernel version would probably allow covering pretty easily this case
> (maybe it already does):
> 
> 	fd = memfd_create(...);
> 	write(fd, malicious_code, sizeof(malicious_code));
> 	mmap(..., PROT_EXEC, ..., fd);

This series doesn't cover it.

> Other memory W^X implementations include S.A.R.A [1] and SELinux
> EXECMEM/EXECSTACK/EXECHEAP protections [2], [3]. SELinux checks
> IS_PRIVATE(file_inode(file)) and vma->anon_vma != NULL, which might be
> useful additions here too (or future extensions if you prefer).

I had a quick look at SELinux and, IIUC, without the execmem permission
it prevents any anonymous mmap(PROT_EXEC). We could probably do
something similar here and check the file permission. I had an attempt
but S_PRIVATE doesn't seem to be set on the memfd_create() allocated
inode. I have to dig a bit more to see how to detect this. If we keep
the check for all files, it won't be able to map any ELF text sections
unless the binary is read-only.

> [1] https://smeso.it/sara/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3708
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/selinux/hooks.c#n3787

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-13 13:49 ` Catalin Marinas
@ 2022-04-14 18:52   ` Kees Cook
  -1 siblings, 0 replies; 50+ messages in thread
From: Kees Cook @ 2022-04-14 18:52 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	Topi Miettinen, linux-mm, linux-arm-kernel, linux-kernel,
	linux-abi-devel, linux-hardening, Jann Horn, Salvatore Mesoraca,
	Igor Zhbanov

On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> The background to this is that systemd has a configuration option called
> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> is to prevent a user task from inadvertently creating an executable
> mapping that is (or was) writeable. Since such BPF filter is stateless,
> it cannot detect mappings that were previously writeable but
> subsequently changed to read-only. Therefore the filter simply rejects
> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> support (Branch Target Identification), the dynamic loader cannot change
> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> For libraries, it can resort to unmapping and re-mapping but for the
> main executable it does not have a file descriptor. The original bug
> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> for libraries - [3].

Right, so, the systemd filter is a big hammer solution for the kernel
not having a very easy way to provide W^X mapping protections to
userspace. There's stuff in SELinux, and there have been several
attempts[1] at other LSMs to do it too, but nothing stuck.

Given the filter, and the implementation of how to enable BTI, I see two
solutions:

- provide a way to do W^X so systemd can implement the feature differently
- provide a way to turn on BTI separate from mprotect to bypass the filter

I would agree, the latter seems like the greater hack, so I welcome
this RFC, though I think it might need to explore a bit of the feature
space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
it risks being too narrowly implemented. For example, playing well with
JITs should be part of the design, and will likely need some kind of
ELF flags and/or "sealing" mode, and to handle the vma alias case as
Jann Horn pointed out[2].

> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
> flag, inherited on fork() and execve(). The kernel tracks a previously
> writeable mapping via a new VM_WAS_WRITE flag (64-bit only
> architectures). I went for a personality flag by analogy with the
> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
> we don't want more personality flags. A minor downside with the
> personality flag is that there is no way for the user to query which
> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
> this.

My instinct here is to use a prctl(), which maps to other kinds of modern
inherited state (like no_new_privs).

> Posting this as an RFC to start a discussion and cc'ing some of the
> systemd guys and those involved in the earlier thread around the glibc
> workaround for dynamic libraries [4]. Before thinking of upstreaming
> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
> BPF filter with the in-kernel one.
> 
> Thanks,
> 
> Catalin
> 
> [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
> [3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
> [3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com

So, yes, let's do it. It's long long overdue in the kernel. :)

-Kees

[1] https://github.com/KSPP/linux/issues/32
[2] https://github.com/KSPP/linux/issues/32#issuecomment-1084859611

> 
> Catalin Marinas (4):
>   mm: Track previously writeable vma permission
>   mm, personality: Implement memory-deny-write-execute as a personality
>     flag
>   fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
>     flag
>   arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC
> 
>  arch/arm64/Kconfig               |  1 +
>  fs/binfmt_elf.c                  |  2 ++
>  include/linux/mm.h               |  6 ++++++
>  include/linux/mman.h             | 18 +++++++++++++++++-
>  include/uapi/linux/binfmts.h     |  4 ++++
>  include/uapi/linux/personality.h |  1 +
>  mm/Kconfig                       |  4 ++++
>  mm/mmap.c                        |  3 +++
>  mm/mprotect.c                    |  5 +++++
>  9 files changed, 43 insertions(+), 1 deletion(-)
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-14 18:52   ` Kees Cook
  0 siblings, 0 replies; 50+ messages in thread
From: Kees Cook @ 2022-04-14 18:52 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	Topi Miettinen, linux-mm, linux-arm-kernel, linux-kernel,
	linux-abi-devel, linux-hardening, Jann Horn, Salvatore Mesoraca,
	Igor Zhbanov

On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> The background to this is that systemd has a configuration option called
> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> is to prevent a user task from inadvertently creating an executable
> mapping that is (or was) writeable. Since such BPF filter is stateless,
> it cannot detect mappings that were previously writeable but
> subsequently changed to read-only. Therefore the filter simply rejects
> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> support (Branch Target Identification), the dynamic loader cannot change
> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> For libraries, it can resort to unmapping and re-mapping but for the
> main executable it does not have a file descriptor. The original bug
> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> for libraries - [3].

Right, so, the systemd filter is a big hammer solution for the kernel
not having a very easy way to provide W^X mapping protections to
userspace. There's stuff in SELinux, and there have been several
attempts[1] at other LSMs to do it too, but nothing stuck.

Given the filter, and the implementation of how to enable BTI, I see two
solutions:

- provide a way to do W^X so systemd can implement the feature differently
- provide a way to turn on BTI separate from mprotect to bypass the filter

I would agree, the latter seems like the greater hack, so I welcome
this RFC, though I think it might need to explore a bit of the feature
space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
it risks being too narrowly implemented. For example, playing well with
JITs should be part of the design, and will likely need some kind of
ELF flags and/or "sealing" mode, and to handle the vma alias case as
Jann Horn pointed out[2].

> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
> flag, inherited on fork() and execve(). The kernel tracks a previously
> writeable mapping via a new VM_WAS_WRITE flag (64-bit only
> architectures). I went for a personality flag by analogy with the
> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
> we don't want more personality flags. A minor downside with the
> personality flag is that there is no way for the user to query which
> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
> this.

My instinct here is to use a prctl(), which maps to other kinds of modern
inherited state (like no_new_privs).

> Posting this as an RFC to start a discussion and cc'ing some of the
> systemd guys and those involved in the earlier thread around the glibc
> workaround for dynamic libraries [4]. Before thinking of upstreaming
> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
> BPF filter with the in-kernel one.
> 
> Thanks,
> 
> Catalin
> 
> [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
> [3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
> [3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com

So, yes, let's do it. It's long long overdue in the kernel. :)

-Kees

[1] https://github.com/KSPP/linux/issues/32
[2] https://github.com/KSPP/linux/issues/32#issuecomment-1084859611

> 
> Catalin Marinas (4):
>   mm: Track previously writeable vma permission
>   mm, personality: Implement memory-deny-write-execute as a personality
>     flag
>   fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
>     flag
>   arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC
> 
>  arch/arm64/Kconfig               |  1 +
>  fs/binfmt_elf.c                  |  2 ++
>  include/linux/mm.h               |  6 ++++++
>  include/linux/mman.h             | 18 +++++++++++++++++-
>  include/uapi/linux/binfmts.h     |  4 ++++
>  include/uapi/linux/personality.h |  1 +
>  mm/Kconfig                       |  4 ++++
>  mm/mmap.c                        |  3 +++
>  mm/mprotect.c                    |  5 +++++
>  9 files changed, 43 insertions(+), 1 deletion(-)
> 

-- 
Kees Cook

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-14 18:52   ` Kees Cook
@ 2022-04-15 20:01     ` Topi Miettinen
  -1 siblings, 0 replies; 50+ messages in thread
From: Topi Miettinen @ 2022-04-15 20:01 UTC (permalink / raw)
  To: Kees Cook, Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On 14.4.2022 21.52, Kees Cook wrote:
> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
>> The background to this is that systemd has a configuration option called
>> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
>> is to prevent a user task from inadvertently creating an executable
>> mapping that is (or was) writeable. Since such BPF filter is stateless,
>> it cannot detect mappings that were previously writeable but
>> subsequently changed to read-only. Therefore the filter simply rejects
>> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
>> support (Branch Target Identification), the dynamic loader cannot change
>> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
>> For libraries, it can resort to unmapping and re-mapping but for the
>> main executable it does not have a file descriptor. The original bug
>> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
>> for libraries - [3].
> 
> Right, so, the systemd filter is a big hammer solution for the kernel
> not having a very easy way to provide W^X mapping protections to
> userspace. There's stuff in SELinux, and there have been several
> attempts[1] at other LSMs to do it too, but nothing stuck.
> 
> Given the filter, and the implementation of how to enable BTI, I see two
> solutions:
> 
> - provide a way to do W^X so systemd can implement the feature differently
> - provide a way to turn on BTI separate from mprotect to bypass the filter
> 
> I would agree, the latter seems like the greater hack, so I welcome
> this RFC, though I think it might need to explore a bit of the feature
> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> it risks being too narrowly implemented. For example, playing well with
> JITs should be part of the design, and will likely need some kind of
> ELF flags and/or "sealing" mode, and to handle the vma alias case as
> Jann Horn pointed out[2].

Another interesting case from 2006 by Ulrich Drepper is to use a 
temporary file and map it twice, once with PROT_WRITE and once with 
PROT_EXEC [1]. This isn't possible if the mount flags of the file 
systems are also in line with W^X principle. System services (unlike 
user apps) typically don't use /tmp nor /dev/shm (mounted with 
"rw,exec"). With systemd a simple file system W^X policy can be 
implemented for a service for example with NoExecPaths=/ ExecPaths=/usr 
ReadOnlyPaths=/usr. In-kernel MDWE probably could look beyond file 
descriptors and check if the mount flags of the file system containing 
the file being mmap()ed agree with W^X. The use cases for system 
services and user apps may be different: system services are often 
compatible with maximum hardening, while user apps may need various 
compatibility solutions if they use JIT, trampolines or FFI and access 
to W+X file systems may be also needed.

-Topi

[1] https://akkadia.org/drepper/selinux-mem.html

>> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
>> flag, inherited on fork() and execve(). The kernel tracks a previously
>> writeable mapping via a new VM_WAS_WRITE flag (64-bit only
>> architectures). I went for a personality flag by analogy with the
>> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
>> we don't want more personality flags. A minor downside with the
>> personality flag is that there is no way for the user to query which
>> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
>> this.
> 
> My instinct here is to use a prctl(), which maps to other kinds of modern
> inherited state (like no_new_privs).
> 
>> Posting this as an RFC to start a discussion and cc'ing some of the
>> systemd guys and those involved in the earlier thread around the glibc
>> workaround for dynamic libraries [4]. Before thinking of upstreaming
>> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
>> BPF filter with the in-kernel one.
>>
>> Thanks,
>>
>> Catalin
>>
>> [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
>> [3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
>> [3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com
> 
> So, yes, let's do it. It's long long overdue in the kernel. :)
> 
> -Kees
> 
> [1] https://github.com/KSPP/linux/issues/32
> [2] https://github.com/KSPP/linux/issues/32#issuecomment-1084859611
> 
>>
>> Catalin Marinas (4):
>>    mm: Track previously writeable vma permission
>>    mm, personality: Implement memory-deny-write-execute as a personality
>>      flag
>>    fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
>>      flag
>>    arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC
>>
>>   arch/arm64/Kconfig               |  1 +
>>   fs/binfmt_elf.c                  |  2 ++
>>   include/linux/mm.h               |  6 ++++++
>>   include/linux/mman.h             | 18 +++++++++++++++++-
>>   include/uapi/linux/binfmts.h     |  4 ++++
>>   include/uapi/linux/personality.h |  1 +
>>   mm/Kconfig                       |  4 ++++
>>   mm/mmap.c                        |  3 +++
>>   mm/mprotect.c                    |  5 +++++
>>   9 files changed, 43 insertions(+), 1 deletion(-)
>>
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-15 20:01     ` Topi Miettinen
  0 siblings, 0 replies; 50+ messages in thread
From: Topi Miettinen @ 2022-04-15 20:01 UTC (permalink / raw)
  To: Kees Cook, Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On 14.4.2022 21.52, Kees Cook wrote:
> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
>> The background to this is that systemd has a configuration option called
>> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
>> is to prevent a user task from inadvertently creating an executable
>> mapping that is (or was) writeable. Since such BPF filter is stateless,
>> it cannot detect mappings that were previously writeable but
>> subsequently changed to read-only. Therefore the filter simply rejects
>> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
>> support (Branch Target Identification), the dynamic loader cannot change
>> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
>> For libraries, it can resort to unmapping and re-mapping but for the
>> main executable it does not have a file descriptor. The original bug
>> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
>> for libraries - [3].
> 
> Right, so, the systemd filter is a big hammer solution for the kernel
> not having a very easy way to provide W^X mapping protections to
> userspace. There's stuff in SELinux, and there have been several
> attempts[1] at other LSMs to do it too, but nothing stuck.
> 
> Given the filter, and the implementation of how to enable BTI, I see two
> solutions:
> 
> - provide a way to do W^X so systemd can implement the feature differently
> - provide a way to turn on BTI separate from mprotect to bypass the filter
> 
> I would agree, the latter seems like the greater hack, so I welcome
> this RFC, though I think it might need to explore a bit of the feature
> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> it risks being too narrowly implemented. For example, playing well with
> JITs should be part of the design, and will likely need some kind of
> ELF flags and/or "sealing" mode, and to handle the vma alias case as
> Jann Horn pointed out[2].

Another interesting case from 2006 by Ulrich Drepper is to use a 
temporary file and map it twice, once with PROT_WRITE and once with 
PROT_EXEC [1]. This isn't possible if the mount flags of the file 
systems are also in line with W^X principle. System services (unlike 
user apps) typically don't use /tmp nor /dev/shm (mounted with 
"rw,exec"). With systemd a simple file system W^X policy can be 
implemented for a service for example with NoExecPaths=/ ExecPaths=/usr 
ReadOnlyPaths=/usr. In-kernel MDWE probably could look beyond file 
descriptors and check if the mount flags of the file system containing 
the file being mmap()ed agree with W^X. The use cases for system 
services and user apps may be different: system services are often 
compatible with maximum hardening, while user apps may need various 
compatibility solutions if they use JIT, trampolines or FFI and access 
to W+X file systems may be also needed.

-Topi

[1] https://akkadia.org/drepper/selinux-mem.html

>> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
>> flag, inherited on fork() and execve(). The kernel tracks a previously
>> writeable mapping via a new VM_WAS_WRITE flag (64-bit only
>> architectures). I went for a personality flag by analogy with the
>> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
>> we don't want more personality flags. A minor downside with the
>> personality flag is that there is no way for the user to query which
>> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
>> this.
> 
> My instinct here is to use a prctl(), which maps to other kinds of modern
> inherited state (like no_new_privs).
> 
>> Posting this as an RFC to start a discussion and cc'ing some of the
>> systemd guys and those involved in the earlier thread around the glibc
>> workaround for dynamic libraries [4]. Before thinking of upstreaming
>> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
>> BPF filter with the in-kernel one.
>>
>> Thanks,
>>
>> Catalin
>>
>> [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
>> [3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
>> [3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com
> 
> So, yes, let's do it. It's long long overdue in the kernel. :)
> 
> -Kees
> 
> [1] https://github.com/KSPP/linux/issues/32
> [2] https://github.com/KSPP/linux/issues/32#issuecomment-1084859611
> 
>>
>> Catalin Marinas (4):
>>    mm: Track previously writeable vma permission
>>    mm, personality: Implement memory-deny-write-execute as a personality
>>      flag
>>    fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
>>      flag
>>    arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC
>>
>>   arch/arm64/Kconfig               |  1 +
>>   fs/binfmt_elf.c                  |  2 ++
>>   include/linux/mm.h               |  6 ++++++
>>   include/linux/mman.h             | 18 +++++++++++++++++-
>>   include/uapi/linux/binfmts.h     |  4 ++++
>>   include/uapi/linux/personality.h |  1 +
>>   mm/Kconfig                       |  4 ++++
>>   mm/mmap.c                        |  3 +++
>>   mm/mprotect.c                    |  5 +++++
>>   9 files changed, 43 insertions(+), 1 deletion(-)
>>
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-14 18:52   ` Kees Cook
@ 2022-04-20 13:01     ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-20 13:01 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	Topi Miettinen, linux-mm, linux-arm-kernel, linux-kernel,
	linux-abi-devel, linux-hardening, Jann Horn, Salvatore Mesoraca,
	Igor Zhbanov

On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> > The background to this is that systemd has a configuration option called
> > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> > is to prevent a user task from inadvertently creating an executable
> > mapping that is (or was) writeable. Since such BPF filter is stateless,
> > it cannot detect mappings that were previously writeable but
> > subsequently changed to read-only. Therefore the filter simply rejects
> > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> > support (Branch Target Identification), the dynamic loader cannot change
> > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> > For libraries, it can resort to unmapping and re-mapping but for the
> > main executable it does not have a file descriptor. The original bug
> > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> > for libraries - [3].
> 
> Right, so, the systemd filter is a big hammer solution for the kernel
> not having a very easy way to provide W^X mapping protections to
> userspace. There's stuff in SELinux, and there have been several
> attempts[1] at other LSMs to do it too, but nothing stuck.
> 
> Given the filter, and the implementation of how to enable BTI, I see two
> solutions:
> 
> - provide a way to do W^X so systemd can implement the feature differently
> - provide a way to turn on BTI separate from mprotect to bypass the filter
> 
> I would agree, the latter seems like the greater hack,

We discussed such hacks in the past but they are just working around the
fundamental issue - systemd wants W^X but with BPF it can only achieve
it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
was already executable. If we find a better solution for W^X, we
wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).

> so I welcome
> this RFC, though I think it might need to explore a bit of the feature
> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> it risks being too narrowly implemented. For example, playing well with
> JITs should be part of the design, and will likely need some kind of
> ELF flags and/or "sealing" mode, and to handle the vma alias case as
> Jann Horn pointed out[2].

I agree we should look at what we want to cover, though trying to avoid
re-inventing SELinux. With this patchset I went for the minimum that
systemd MDWE does with BPF.

I think JITs get around it using something like memfd with two separate
mappings to the same page. We could try to prevent such aliases but
allow it if an ELF note is detected (or get the JIT to issue a prctl()).

Anyway, with a prctl() we can allow finer-grained control starting with
anonymous and file mappings and later extending to vma aliases,
writeable files etc. On top we can add a seal mask so that a process
cannot disable a control was set. Something like (I'm not good at
names):

	prctl(PR_MDWX_SET, flags, seal_mask);
	prctl(PR_MDWX_GET);

with flags like:

	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
	PR_MDWX_WRITEABLE_FILE

(needs some more thinking)

-- 
Catalin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-20 13:01     ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-20 13:01 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	Topi Miettinen, linux-mm, linux-arm-kernel, linux-kernel,
	linux-abi-devel, linux-hardening, Jann Horn, Salvatore Mesoraca,
	Igor Zhbanov

On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> > The background to this is that systemd has a configuration option called
> > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> > is to prevent a user task from inadvertently creating an executable
> > mapping that is (or was) writeable. Since such BPF filter is stateless,
> > it cannot detect mappings that were previously writeable but
> > subsequently changed to read-only. Therefore the filter simply rejects
> > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> > support (Branch Target Identification), the dynamic loader cannot change
> > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> > For libraries, it can resort to unmapping and re-mapping but for the
> > main executable it does not have a file descriptor. The original bug
> > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> > for libraries - [3].
> 
> Right, so, the systemd filter is a big hammer solution for the kernel
> not having a very easy way to provide W^X mapping protections to
> userspace. There's stuff in SELinux, and there have been several
> attempts[1] at other LSMs to do it too, but nothing stuck.
> 
> Given the filter, and the implementation of how to enable BTI, I see two
> solutions:
> 
> - provide a way to do W^X so systemd can implement the feature differently
> - provide a way to turn on BTI separate from mprotect to bypass the filter
> 
> I would agree, the latter seems like the greater hack,

We discussed such hacks in the past but they are just working around the
fundamental issue - systemd wants W^X but with BPF it can only achieve
it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
was already executable. If we find a better solution for W^X, we
wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).

> so I welcome
> this RFC, though I think it might need to explore a bit of the feature
> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> it risks being too narrowly implemented. For example, playing well with
> JITs should be part of the design, and will likely need some kind of
> ELF flags and/or "sealing" mode, and to handle the vma alias case as
> Jann Horn pointed out[2].

I agree we should look at what we want to cover, though trying to avoid
re-inventing SELinux. With this patchset I went for the minimum that
systemd MDWE does with BPF.

I think JITs get around it using something like memfd with two separate
mappings to the same page. We could try to prevent such aliases but
allow it if an ELF note is detected (or get the JIT to issue a prctl()).

Anyway, with a prctl() we can allow finer-grained control starting with
anonymous and file mappings and later extending to vma aliases,
writeable files etc. On top we can add a seal mask so that a process
cannot disable a control was set. Something like (I'm not good at
names):

	prctl(PR_MDWX_SET, flags, seal_mask);
	prctl(PR_MDWX_GET);

with flags like:

	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
	PR_MDWX_WRITEABLE_FILE

(needs some more thinking)

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-20 13:01     ` Catalin Marinas
@ 2022-04-20 17:44       ` Kees Cook
  -1 siblings, 0 replies; 50+ messages in thread
From: Kees Cook @ 2022-04-20 17:44 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	Topi Miettinen, linux-mm, linux-arm-kernel, linux-kernel,
	linux-abi-devel, linux-hardening, Jann Horn, Salvatore Mesoraca,
	Igor Zhbanov

On Wed, Apr 20, 2022 at 02:01:02PM +0100, Catalin Marinas wrote:
> I agree we should look at what we want to cover, though trying to avoid
> re-inventing SELinux. With this patchset I went for the minimum that
> systemd MDWE does with BPF.

Right -- and I don't think we're at any risk of slippery-sloping into a
full MAC system. :)

I'm fine with doing the implementation in stages, if we've attempted to
design the steps (which I think you've got a good start on below).

> I think JITs get around it using something like memfd with two separate
> mappings to the same page. We could try to prevent such aliases but
> allow it if an ELF note is detected (or get the JIT to issue a prctl()).

Right -- I'd rather JITs carry some hard-coded property (i.e. ELF note)
to indicate the fact that they're expecting to do these kinds of things
rather than leaving it open for all processes.

> Anyway, with a prctl() we can allow finer-grained control starting with
> anonymous and file mappings and later extending to vma aliases,
> writeable files etc. On top we can add a seal mask so that a process
> cannot disable a control was set. Something like (I'm not good at
> names):
> 
> 	prctl(PR_MDWX_SET, flags, seal_mask);
> 	prctl(PR_MDWX_GET);
> 
> with flags like:
> 
> 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> 	PR_MDWX_WRITEABLE_FILE

The SARA proposal lists a lot of behavioral details to consider.
Quoting it[1] here:
>> - W^X enforcement will cause problems to any programs that needs
>>   memory pages mapped both as writable and executable at the same time e.g.
>>   programs with executable stack markings in the PT_GNU_STACK segment.

IMO, executable stack markings should be considered completely
deprecated. In fact, we've been warning about it since 2020:
47a2ebb7f505 ("execve: warn if process starts with executable stack")

So with execstack, under W^X, I think we should either:
- refuse to exec the process (default)
- disable W^X for the process (but not its children)

>> - W!->X restriction will cause problems to any program that
>>   needs to generate executable code at run time or to modify executable
>>   pages e.g. programs with a JIT compiler built-in or linked against a
>>   non-PIC library.

This seems solvable with an ELF flag.

>> - Executable MMAP prevention can work only with programs that have at least
>>   partial RELRO support. It's disabled automatically for programs that
>>   lack this feature. It will cause problems to any program that uses dlopen
>>   or tries to do an executable mmap. Unfortunately this feature is the one
>>   that could create most problems and should be enabled only after careful
>>   evaluation.

This seems like a variation on the execstack case, and we should be
able to detect the state and choose a behavior based on system settings,
and a smarter version (as SARA has) would track RELRO pages waiting for
the loader to make them read-only.

SARA was proposed with a set of feature flags[2]; quoting here:
>> | W^X                          |  0x0008  |

This is the basic property, refusing PROT_WRITE | PROT_EXEC. I note that
SARA also rejects opening /proc/$pid/mem with FMODE_WRITE when this is
enabled for a process. (It likely should extend to process_vm_write()
too.)

>> | W!->X Heap                   |  0x0001  |
>> | W!->X Stack                  |  0x0002  |
>> | W!->X Other memory           |  0x0004  |

This is for the vma history tracking, and I don't think we need to
separate this by memory type? It's nice to have the granularity, but for
a first-pass it seems like overkill? Maybe I'm missing some detail.

>> | Don't enforce, just complain |  0x0010  |
>> | Be Verbose                   |  0x0020  |

Unclear if these would work well with a non-LSM approach.

>> | Executable MMAP prevention   |  0x0040  |

This is the relro detection piece.

>> | Trampoline emulation         |  0x0100  |

This is a more advanced case for emulating execstack, but if we can just
ignore execstack entirely, this can go away?

>> | Children will inherit flags  |  0x0200  |

Should a process have that control?

>> | Force W^X on setprocattr     |  0x0080  |

This is a "seal" trigger, which could be done through prctl().

It looks like a bunch of the features are designed around having as much
as possible enabled at exec time, and then tightening it further as
various things are finished (e.g. execstack, relro, sealing, etc), which
is, I think, what would still be needed for a process launcher to be
able to enable this kind of protection. (i.e. hoping the process calls a
prctl() to enable the protection isn't going to work well with systemd.)

So, I *think* we could have a minimal form with these considerations:
 - execstack: declare it distinctly incompatible.
 - relro: I think this is solved with BIND_NOW. It's been a while since
   I looked deeply at this, but I think under BIND_NOW, the (executable)
   PLT doesn't ever need to be writable (since it points into the GOT),
   and the (initially writable) GOT is already never executable. This
   needs to be verified...
 - JITs can be allowed with a ELF flag and can choose to opt-in with
   a prctl().

-Kees

[1] https://lore.kernel.org/lkml/1562410493-8661-1-git-send-email-s.mesoraca16@gmail.com/
[2] https://lore.kernel.org/lkml/1562410493-8661-2-git-send-email-s.mesoraca16@gmail.com/

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-20 17:44       ` Kees Cook
  0 siblings, 0 replies; 50+ messages in thread
From: Kees Cook @ 2022-04-20 17:44 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	Topi Miettinen, linux-mm, linux-arm-kernel, linux-kernel,
	linux-abi-devel, linux-hardening, Jann Horn, Salvatore Mesoraca,
	Igor Zhbanov

On Wed, Apr 20, 2022 at 02:01:02PM +0100, Catalin Marinas wrote:
> I agree we should look at what we want to cover, though trying to avoid
> re-inventing SELinux. With this patchset I went for the minimum that
> systemd MDWE does with BPF.

Right -- and I don't think we're at any risk of slippery-sloping into a
full MAC system. :)

I'm fine with doing the implementation in stages, if we've attempted to
design the steps (which I think you've got a good start on below).

> I think JITs get around it using something like memfd with two separate
> mappings to the same page. We could try to prevent such aliases but
> allow it if an ELF note is detected (or get the JIT to issue a prctl()).

Right -- I'd rather JITs carry some hard-coded property (i.e. ELF note)
to indicate the fact that they're expecting to do these kinds of things
rather than leaving it open for all processes.

> Anyway, with a prctl() we can allow finer-grained control starting with
> anonymous and file mappings and later extending to vma aliases,
> writeable files etc. On top we can add a seal mask so that a process
> cannot disable a control was set. Something like (I'm not good at
> names):
> 
> 	prctl(PR_MDWX_SET, flags, seal_mask);
> 	prctl(PR_MDWX_GET);
> 
> with flags like:
> 
> 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> 	PR_MDWX_WRITEABLE_FILE

The SARA proposal lists a lot of behavioral details to consider.
Quoting it[1] here:
>> - W^X enforcement will cause problems to any programs that needs
>>   memory pages mapped both as writable and executable at the same time e.g.
>>   programs with executable stack markings in the PT_GNU_STACK segment.

IMO, executable stack markings should be considered completely
deprecated. In fact, we've been warning about it since 2020:
47a2ebb7f505 ("execve: warn if process starts with executable stack")

So with execstack, under W^X, I think we should either:
- refuse to exec the process (default)
- disable W^X for the process (but not its children)

>> - W!->X restriction will cause problems to any program that
>>   needs to generate executable code at run time or to modify executable
>>   pages e.g. programs with a JIT compiler built-in or linked against a
>>   non-PIC library.

This seems solvable with an ELF flag.

>> - Executable MMAP prevention can work only with programs that have at least
>>   partial RELRO support. It's disabled automatically for programs that
>>   lack this feature. It will cause problems to any program that uses dlopen
>>   or tries to do an executable mmap. Unfortunately this feature is the one
>>   that could create most problems and should be enabled only after careful
>>   evaluation.

This seems like a variation on the execstack case, and we should be
able to detect the state and choose a behavior based on system settings,
and a smarter version (as SARA has) would track RELRO pages waiting for
the loader to make them read-only.

SARA was proposed with a set of feature flags[2]; quoting here:
>> | W^X                          |  0x0008  |

This is the basic property, refusing PROT_WRITE | PROT_EXEC. I note that
SARA also rejects opening /proc/$pid/mem with FMODE_WRITE when this is
enabled for a process. (It likely should extend to process_vm_write()
too.)

>> | W!->X Heap                   |  0x0001  |
>> | W!->X Stack                  |  0x0002  |
>> | W!->X Other memory           |  0x0004  |

This is for the vma history tracking, and I don't think we need to
separate this by memory type? It's nice to have the granularity, but for
a first-pass it seems like overkill? Maybe I'm missing some detail.

>> | Don't enforce, just complain |  0x0010  |
>> | Be Verbose                   |  0x0020  |

Unclear if these would work well with a non-LSM approach.

>> | Executable MMAP prevention   |  0x0040  |

This is the relro detection piece.

>> | Trampoline emulation         |  0x0100  |

This is a more advanced case for emulating execstack, but if we can just
ignore execstack entirely, this can go away?

>> | Children will inherit flags  |  0x0200  |

Should a process have that control?

>> | Force W^X on setprocattr     |  0x0080  |

This is a "seal" trigger, which could be done through prctl().

It looks like a bunch of the features are designed around having as much
as possible enabled at exec time, and then tightening it further as
various things are finished (e.g. execstack, relro, sealing, etc), which
is, I think, what would still be needed for a process launcher to be
able to enable this kind of protection. (i.e. hoping the process calls a
prctl() to enable the protection isn't going to work well with systemd.)

So, I *think* we could have a minimal form with these considerations:
 - execstack: declare it distinctly incompatible.
 - relro: I think this is solved with BIND_NOW. It's been a while since
   I looked deeply at this, but I think under BIND_NOW, the (executable)
   PLT doesn't ever need to be writable (since it points into the GOT),
   and the (initially writable) GOT is already never executable. This
   needs to be verified...
 - JITs can be allowed with a ELF flag and can choose to opt-in with
   a prctl().

-Kees

[1] https://lore.kernel.org/lkml/1562410493-8661-1-git-send-email-s.mesoraca16@gmail.com/
[2] https://lore.kernel.org/lkml/1562410493-8661-2-git-send-email-s.mesoraca16@gmail.com/

-- 
Kees Cook

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-20 13:01     ` Catalin Marinas
@ 2022-04-20 19:34       ` Topi Miettinen
  -1 siblings, 0 replies; 50+ messages in thread
From: Topi Miettinen @ 2022-04-20 19:34 UTC (permalink / raw)
  To: Catalin Marinas, Kees Cook
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On 20.4.2022 16.01, Catalin Marinas wrote:
> On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
>> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
>>> The background to this is that systemd has a configuration option called
>>> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
>>> is to prevent a user task from inadvertently creating an executable
>>> mapping that is (or was) writeable. Since such BPF filter is stateless,
>>> it cannot detect mappings that were previously writeable but
>>> subsequently changed to read-only. Therefore the filter simply rejects
>>> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
>>> support (Branch Target Identification), the dynamic loader cannot change
>>> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
>>> For libraries, it can resort to unmapping and re-mapping but for the
>>> main executable it does not have a file descriptor. The original bug
>>> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
>>> for libraries - [3].
>>
>> Right, so, the systemd filter is a big hammer solution for the kernel
>> not having a very easy way to provide W^X mapping protections to
>> userspace. There's stuff in SELinux, and there have been several
>> attempts[1] at other LSMs to do it too, but nothing stuck.
>>
>> Given the filter, and the implementation of how to enable BTI, I see two
>> solutions:
>>
>> - provide a way to do W^X so systemd can implement the feature differently
>> - provide a way to turn on BTI separate from mprotect to bypass the filter
>>
>> I would agree, the latter seems like the greater hack,
> 
> We discussed such hacks in the past but they are just working around the
> fundamental issue - systemd wants W^X but with BPF it can only achieve
> it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
> was already executable. If we find a better solution for W^X, we
> wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).
> 
>> so I welcome
>> this RFC, though I think it might need to explore a bit of the feature
>> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
>> it risks being too narrowly implemented. For example, playing well with
>> JITs should be part of the design, and will likely need some kind of
>> ELF flags and/or "sealing" mode, and to handle the vma alias case as
>> Jann Horn pointed out[2].
> 
> I agree we should look at what we want to cover, though trying to avoid
> re-inventing SELinux. With this patchset I went for the minimum that
> systemd MDWE does with BPF.
> 
> I think JITs get around it using something like memfd with two separate
> mappings to the same page. We could try to prevent such aliases but
> allow it if an ELF note is detected (or get the JIT to issue a prctl()).
> 
> Anyway, with a prctl() we can allow finer-grained control starting with
> anonymous and file mappings and later extending to vma aliases,
> writeable files etc. On top we can add a seal mask so that a process
> cannot disable a control was set. Something like (I'm not good at
> names):
> 
> 	prctl(PR_MDWX_SET, flags, seal_mask);
> 	prctl(PR_MDWX_GET);
> 
> with flags like:
> 
> 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> 	PR_MDWX_WRITEABLE_FILE
> 
> (needs some more thinking)
> 

For systemd, feature compatibility with the BPF version is important so 
that we could automatically switch to the kernel version once available 
without regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) 
should match exactly what MemoryDenyWriteExecute=yes as implemented with 
BPF has: only forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). 
Like BPF, once installed there should be no way to escape and ELF flags 
should be also ignored. ARM BTI should be allowed though (allow 
PROT_EXEC|PROT_BTI if the old flags had PROT_EXEC).

Then we could have improved versions (other PR_MDWX_ prctls) with lots 
more checks. This could be enabled with MemoryDenyWriteExecute=strict or so.

Perhaps also more relaxed versions (like SARA) could be interesting 
(system service running Python with FFI, or perhaps JVM etc), enabled 
with for example MemoryDenyWriteExecute=trampolines. That way even those 
programs would get some protection (though there would be a gap in the 
defences).

-Topi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-20 19:34       ` Topi Miettinen
  0 siblings, 0 replies; 50+ messages in thread
From: Topi Miettinen @ 2022-04-20 19:34 UTC (permalink / raw)
  To: Catalin Marinas, Kees Cook
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On 20.4.2022 16.01, Catalin Marinas wrote:
> On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
>> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
>>> The background to this is that systemd has a configuration option called
>>> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
>>> is to prevent a user task from inadvertently creating an executable
>>> mapping that is (or was) writeable. Since such BPF filter is stateless,
>>> it cannot detect mappings that were previously writeable but
>>> subsequently changed to read-only. Therefore the filter simply rejects
>>> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
>>> support (Branch Target Identification), the dynamic loader cannot change
>>> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
>>> For libraries, it can resort to unmapping and re-mapping but for the
>>> main executable it does not have a file descriptor. The original bug
>>> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
>>> for libraries - [3].
>>
>> Right, so, the systemd filter is a big hammer solution for the kernel
>> not having a very easy way to provide W^X mapping protections to
>> userspace. There's stuff in SELinux, and there have been several
>> attempts[1] at other LSMs to do it too, but nothing stuck.
>>
>> Given the filter, and the implementation of how to enable BTI, I see two
>> solutions:
>>
>> - provide a way to do W^X so systemd can implement the feature differently
>> - provide a way to turn on BTI separate from mprotect to bypass the filter
>>
>> I would agree, the latter seems like the greater hack,
> 
> We discussed such hacks in the past but they are just working around the
> fundamental issue - systemd wants W^X but with BPF it can only achieve
> it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
> was already executable. If we find a better solution for W^X, we
> wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).
> 
>> so I welcome
>> this RFC, though I think it might need to explore a bit of the feature
>> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
>> it risks being too narrowly implemented. For example, playing well with
>> JITs should be part of the design, and will likely need some kind of
>> ELF flags and/or "sealing" mode, and to handle the vma alias case as
>> Jann Horn pointed out[2].
> 
> I agree we should look at what we want to cover, though trying to avoid
> re-inventing SELinux. With this patchset I went for the minimum that
> systemd MDWE does with BPF.
> 
> I think JITs get around it using something like memfd with two separate
> mappings to the same page. We could try to prevent such aliases but
> allow it if an ELF note is detected (or get the JIT to issue a prctl()).
> 
> Anyway, with a prctl() we can allow finer-grained control starting with
> anonymous and file mappings and later extending to vma aliases,
> writeable files etc. On top we can add a seal mask so that a process
> cannot disable a control was set. Something like (I'm not good at
> names):
> 
> 	prctl(PR_MDWX_SET, flags, seal_mask);
> 	prctl(PR_MDWX_GET);
> 
> with flags like:
> 
> 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> 	PR_MDWX_WRITEABLE_FILE
> 
> (needs some more thinking)
> 

For systemd, feature compatibility with the BPF version is important so 
that we could automatically switch to the kernel version once available 
without regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) 
should match exactly what MemoryDenyWriteExecute=yes as implemented with 
BPF has: only forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). 
Like BPF, once installed there should be no way to escape and ELF flags 
should be also ignored. ARM BTI should be allowed though (allow 
PROT_EXEC|PROT_BTI if the old flags had PROT_EXEC).

Then we could have improved versions (other PR_MDWX_ prctls) with lots 
more checks. This could be enabled with MemoryDenyWriteExecute=strict or so.

Perhaps also more relaxed versions (like SARA) could be interesting 
(system service running Python with FFI, or perhaps JVM etc), enabled 
with for example MemoryDenyWriteExecute=trampolines. That way even those 
programs would get some protection (though there would be a gap in the 
defences).

-Topi

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-20 19:34       ` Topi Miettinen
@ 2022-04-20 23:21         ` Kees Cook
  -1 siblings, 0 replies; 50+ messages in thread
From: Kees Cook @ 2022-04-20 23:21 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Catalin Marinas, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> On 20.4.2022 16.01, Catalin Marinas wrote:
> > On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
> > > On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> > > > The background to this is that systemd has a configuration option called
> > > > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> > > > is to prevent a user task from inadvertently creating an executable
> > > > mapping that is (or was) writeable. Since such BPF filter is stateless,
> > > > it cannot detect mappings that were previously writeable but
> > > > subsequently changed to read-only. Therefore the filter simply rejects
> > > > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> > > > support (Branch Target Identification), the dynamic loader cannot change
> > > > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> > > > For libraries, it can resort to unmapping and re-mapping but for the
> > > > main executable it does not have a file descriptor. The original bug
> > > > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> > > > for libraries - [3].
> > > 
> > > Right, so, the systemd filter is a big hammer solution for the kernel
> > > not having a very easy way to provide W^X mapping protections to
> > > userspace. There's stuff in SELinux, and there have been several
> > > attempts[1] at other LSMs to do it too, but nothing stuck.
> > > 
> > > Given the filter, and the implementation of how to enable BTI, I see two
> > > solutions:
> > > 
> > > - provide a way to do W^X so systemd can implement the feature differently
> > > - provide a way to turn on BTI separate from mprotect to bypass the filter
> > > 
> > > I would agree, the latter seems like the greater hack,
> > 
> > We discussed such hacks in the past but they are just working around the
> > fundamental issue - systemd wants W^X but with BPF it can only achieve
> > it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
> > was already executable. If we find a better solution for W^X, we
> > wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).
> > 
> > > so I welcome
> > > this RFC, though I think it might need to explore a bit of the feature
> > > space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> > > it risks being too narrowly implemented. For example, playing well with
> > > JITs should be part of the design, and will likely need some kind of
> > > ELF flags and/or "sealing" mode, and to handle the vma alias case as
> > > Jann Horn pointed out[2].
> > 
> > I agree we should look at what we want to cover, though trying to avoid
> > re-inventing SELinux. With this patchset I went for the minimum that
> > systemd MDWE does with BPF.
> > 
> > I think JITs get around it using something like memfd with two separate
> > mappings to the same page. We could try to prevent such aliases but
> > allow it if an ELF note is detected (or get the JIT to issue a prctl()).
> > 
> > Anyway, with a prctl() we can allow finer-grained control starting with
> > anonymous and file mappings and later extending to vma aliases,
> > writeable files etc. On top we can add a seal mask so that a process
> > cannot disable a control was set. Something like (I'm not good at
> > names):
> > 
> > 	prctl(PR_MDWX_SET, flags, seal_mask);
> > 	prctl(PR_MDWX_GET);
> > 
> > with flags like:
> > 
> > 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> > 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> > 	PR_MDWX_WRITEABLE_FILE
> > 
> > (needs some more thinking)
> > 
> 
> For systemd, feature compatibility with the BPF version is important so that
> we could automatically switch to the kernel version once available without
> regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> installed there should be no way to escape and ELF flags should be also
> ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> old flags had PROT_EXEC).
> 
> Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> 
> Perhaps also more relaxed versions (like SARA) could be interesting (system
> service running Python with FFI, or perhaps JVM etc), enabled with for
> example MemoryDenyWriteExecute=trampolines. That way even those programs
> would get some protection (though there would be a gap in the defences).

Yup, I think we're all on the same page. Catalin, can you respin with a
prctl for enabling MDWE? I propose just:

	prctl(PR_MDWX_SET, flags);
	prctl(PR_MDWX_GET);

	PR_MDWX_FLAG_MMAP
		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
		covering at least: mmap, mprotect, pkey_mprotect, and shmat.

I don't think anything should be allowed to be disabled once set.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-20 23:21         ` Kees Cook
  0 siblings, 0 replies; 50+ messages in thread
From: Kees Cook @ 2022-04-20 23:21 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Catalin Marinas, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> On 20.4.2022 16.01, Catalin Marinas wrote:
> > On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
> > > On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> > > > The background to this is that systemd has a configuration option called
> > > > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> > > > is to prevent a user task from inadvertently creating an executable
> > > > mapping that is (or was) writeable. Since such BPF filter is stateless,
> > > > it cannot detect mappings that were previously writeable but
> > > > subsequently changed to read-only. Therefore the filter simply rejects
> > > > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> > > > support (Branch Target Identification), the dynamic loader cannot change
> > > > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> > > > For libraries, it can resort to unmapping and re-mapping but for the
> > > > main executable it does not have a file descriptor. The original bug
> > > > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> > > > for libraries - [3].
> > > 
> > > Right, so, the systemd filter is a big hammer solution for the kernel
> > > not having a very easy way to provide W^X mapping protections to
> > > userspace. There's stuff in SELinux, and there have been several
> > > attempts[1] at other LSMs to do it too, but nothing stuck.
> > > 
> > > Given the filter, and the implementation of how to enable BTI, I see two
> > > solutions:
> > > 
> > > - provide a way to do W^X so systemd can implement the feature differently
> > > - provide a way to turn on BTI separate from mprotect to bypass the filter
> > > 
> > > I would agree, the latter seems like the greater hack,
> > 
> > We discussed such hacks in the past but they are just working around the
> > fundamental issue - systemd wants W^X but with BPF it can only achieve
> > it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
> > was already executable. If we find a better solution for W^X, we
> > wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).
> > 
> > > so I welcome
> > > this RFC, though I think it might need to explore a bit of the feature
> > > space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> > > it risks being too narrowly implemented. For example, playing well with
> > > JITs should be part of the design, and will likely need some kind of
> > > ELF flags and/or "sealing" mode, and to handle the vma alias case as
> > > Jann Horn pointed out[2].
> > 
> > I agree we should look at what we want to cover, though trying to avoid
> > re-inventing SELinux. With this patchset I went for the minimum that
> > systemd MDWE does with BPF.
> > 
> > I think JITs get around it using something like memfd with two separate
> > mappings to the same page. We could try to prevent such aliases but
> > allow it if an ELF note is detected (or get the JIT to issue a prctl()).
> > 
> > Anyway, with a prctl() we can allow finer-grained control starting with
> > anonymous and file mappings and later extending to vma aliases,
> > writeable files etc. On top we can add a seal mask so that a process
> > cannot disable a control was set. Something like (I'm not good at
> > names):
> > 
> > 	prctl(PR_MDWX_SET, flags, seal_mask);
> > 	prctl(PR_MDWX_GET);
> > 
> > with flags like:
> > 
> > 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> > 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> > 	PR_MDWX_WRITEABLE_FILE
> > 
> > (needs some more thinking)
> > 
> 
> For systemd, feature compatibility with the BPF version is important so that
> we could automatically switch to the kernel version once available without
> regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> installed there should be no way to escape and ELF flags should be also
> ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> old flags had PROT_EXEC).
> 
> Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> 
> Perhaps also more relaxed versions (like SARA) could be interesting (system
> service running Python with FFI, or perhaps JVM etc), enabled with for
> example MemoryDenyWriteExecute=trampolines. That way even those programs
> would get some protection (though there would be a gap in the defences).

Yup, I think we're all on the same page. Catalin, can you respin with a
prctl for enabling MDWE? I propose just:

	prctl(PR_MDWX_SET, flags);
	prctl(PR_MDWX_GET);

	PR_MDWX_FLAG_MMAP
		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
		covering at least: mmap, mprotect, pkey_mprotect, and shmat.

I don't think anything should be allowed to be disabled once set.

-- 
Kees Cook

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-20 23:21         ` Kees Cook
@ 2022-04-21 15:35           ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-21 15:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
> On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> > For systemd, feature compatibility with the BPF version is important so that
> > we could automatically switch to the kernel version once available without
> > regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> > exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> > forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> > installed there should be no way to escape and ELF flags should be also
> > ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> > old flags had PROT_EXEC).

I agree.

> > Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> > checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> > 
> > Perhaps also more relaxed versions (like SARA) could be interesting (system
> > service running Python with FFI, or perhaps JVM etc), enabled with for
> > example MemoryDenyWriteExecute=trampolines. That way even those programs
> > would get some protection (though there would be a gap in the defences).
> 
> Yup, I think we're all on the same page. Catalin, can you respin with a
> prctl for enabling MDWE? I propose just:
> 
> 	prctl(PR_MDWX_SET, flags);
> 	prctl(PR_MDWX_GET);
> 
> 	PR_MDWX_FLAG_MMAP
> 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
> 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.

Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
the vma is not already PROT_EXEC? The latter is closer to the current
systemd approach. The former allows an mprotect(PROT_EXEC) if the
mapping was PROT_READ only for example.

I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
replacement for BPF MDWE.

-- 
Catalin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-21 15:35           ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-21 15:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
> On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> > For systemd, feature compatibility with the BPF version is important so that
> > we could automatically switch to the kernel version once available without
> > regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> > exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> > forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> > installed there should be no way to escape and ELF flags should be also
> > ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> > old flags had PROT_EXEC).

I agree.

> > Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> > checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> > 
> > Perhaps also more relaxed versions (like SARA) could be interesting (system
> > service running Python with FFI, or perhaps JVM etc), enabled with for
> > example MemoryDenyWriteExecute=trampolines. That way even those programs
> > would get some protection (though there would be a gap in the defences).
> 
> Yup, I think we're all on the same page. Catalin, can you respin with a
> prctl for enabling MDWE? I propose just:
> 
> 	prctl(PR_MDWX_SET, flags);
> 	prctl(PR_MDWX_GET);
> 
> 	PR_MDWX_FLAG_MMAP
> 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
> 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.

Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
the vma is not already PROT_EXEC? The latter is closer to the current
systemd approach. The former allows an mprotect(PROT_EXEC) if the
mapping was PROT_READ only for example.

I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
replacement for BPF MDWE.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 15:35           ` Catalin Marinas
@ 2022-04-21 16:42             ` Kees Cook
  -1 siblings, 0 replies; 50+ messages in thread
From: Kees Cook @ 2022-04-21 16:42 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
> > On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> > > For systemd, feature compatibility with the BPF version is important so that
> > > we could automatically switch to the kernel version once available without
> > > regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> > > exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> > > forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> > > installed there should be no way to escape and ELF flags should be also
> > > ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> > > old flags had PROT_EXEC).
> 
> I agree.
> 
> > > Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> > > checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> > > 
> > > Perhaps also more relaxed versions (like SARA) could be interesting (system
> > > service running Python with FFI, or perhaps JVM etc), enabled with for
> > > example MemoryDenyWriteExecute=trampolines. That way even those programs
> > > would get some protection (though there would be a gap in the defences).
> > 
> > Yup, I think we're all on the same page. Catalin, can you respin with a
> > prctl for enabling MDWE? I propose just:
> > 
> > 	prctl(PR_MDWX_SET, flags);
> > 	prctl(PR_MDWX_GET);
> > 
> > 	PR_MDWX_FLAG_MMAP
> > 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
> > 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.
> 
> Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> the vma is not already PROT_EXEC? The latter is closer to the current
> systemd approach. The former allows an mprotect(PROT_EXEC) if the
> mapping was PROT_READ only for example.
> 
> I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> replacement for BPF MDWE.

I think "was PROT_WRITE" is an important part of the defense that
couldn't be done with a simple seccomp filter (which is why the filter
ended up being a problem in the first place).

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-21 16:42             ` Kees Cook
  0 siblings, 0 replies; 50+ messages in thread
From: Kees Cook @ 2022-04-21 16:42 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
> > On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> > > For systemd, feature compatibility with the BPF version is important so that
> > > we could automatically switch to the kernel version once available without
> > > regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> > > exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> > > forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> > > installed there should be no way to escape and ELF flags should be also
> > > ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> > > old flags had PROT_EXEC).
> 
> I agree.
> 
> > > Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> > > checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> > > 
> > > Perhaps also more relaxed versions (like SARA) could be interesting (system
> > > service running Python with FFI, or perhaps JVM etc), enabled with for
> > > example MemoryDenyWriteExecute=trampolines. That way even those programs
> > > would get some protection (though there would be a gap in the defences).
> > 
> > Yup, I think we're all on the same page. Catalin, can you respin with a
> > prctl for enabling MDWE? I propose just:
> > 
> > 	prctl(PR_MDWX_SET, flags);
> > 	prctl(PR_MDWX_GET);
> > 
> > 	PR_MDWX_FLAG_MMAP
> > 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
> > 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.
> 
> Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> the vma is not already PROT_EXEC? The latter is closer to the current
> systemd approach. The former allows an mprotect(PROT_EXEC) if the
> mapping was PROT_READ only for example.
> 
> I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> replacement for BPF MDWE.

I think "was PROT_WRITE" is an important part of the defense that
couldn't be done with a simple seccomp filter (which is why the filter
ended up being a problem in the first place).

-- 
Kees Cook

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 15:35           ` Catalin Marinas
@ 2022-04-21 16:48             ` Topi Miettinen
  -1 siblings, 0 replies; 50+ messages in thread
From: Topi Miettinen @ 2022-04-21 16:48 UTC (permalink / raw)
  To: Catalin Marinas, Kees Cook
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On 21.4.2022 18.35, Catalin Marinas wrote:
> On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
>> On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
>>> For systemd, feature compatibility with the BPF version is important so that
>>> we could automatically switch to the kernel version once available without
>>> regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
>>> exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
>>> forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
>>> installed there should be no way to escape and ELF flags should be also
>>> ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
>>> old flags had PROT_EXEC).
> 
> I agree.
> 
>>> Then we could have improved versions (other PR_MDWX_ prctls) with lots more
>>> checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
>>>
>>> Perhaps also more relaxed versions (like SARA) could be interesting (system
>>> service running Python with FFI, or perhaps JVM etc), enabled with for
>>> example MemoryDenyWriteExecute=trampolines. That way even those programs
>>> would get some protection (though there would be a gap in the defences).
>>
>> Yup, I think we're all on the same page. Catalin, can you respin with a
>> prctl for enabling MDWE? I propose just:
>>
>> 	prctl(PR_MDWX_SET, flags);
>> 	prctl(PR_MDWX_GET);
>>
>> 	PR_MDWX_FLAG_MMAP
>> 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
>> 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.
> 
> Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> the vma is not already PROT_EXEC? The latter is closer to the current
> systemd approach. The former allows an mprotect(PROT_EXEC) if the
> mapping was PROT_READ only for example.
> 
> I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> replacement for BPF MDWE.
> 

I think we'd want existing installations with MemoryDenyWriteExecute=yes 
not start failing when the implementation is changed to in-kernel 
version. The implementation could be very simple and not even check 
existing PROT_ flags (except for BTI case) to be maximally compatible to 
BPF version. So I'd leave "was PROT_WRITE" and other checks to more 
advanced versions, enabled with a different PR_MDWX_FLAG_.

-Topi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-21 16:48             ` Topi Miettinen
  0 siblings, 0 replies; 50+ messages in thread
From: Topi Miettinen @ 2022-04-21 16:48 UTC (permalink / raw)
  To: Catalin Marinas, Kees Cook
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On 21.4.2022 18.35, Catalin Marinas wrote:
> On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
>> On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
>>> For systemd, feature compatibility with the BPF version is important so that
>>> we could automatically switch to the kernel version once available without
>>> regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
>>> exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
>>> forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
>>> installed there should be no way to escape and ELF flags should be also
>>> ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
>>> old flags had PROT_EXEC).
> 
> I agree.
> 
>>> Then we could have improved versions (other PR_MDWX_ prctls) with lots more
>>> checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
>>>
>>> Perhaps also more relaxed versions (like SARA) could be interesting (system
>>> service running Python with FFI, or perhaps JVM etc), enabled with for
>>> example MemoryDenyWriteExecute=trampolines. That way even those programs
>>> would get some protection (though there would be a gap in the defences).
>>
>> Yup, I think we're all on the same page. Catalin, can you respin with a
>> prctl for enabling MDWE? I propose just:
>>
>> 	prctl(PR_MDWX_SET, flags);
>> 	prctl(PR_MDWX_GET);
>>
>> 	PR_MDWX_FLAG_MMAP
>> 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
>> 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.
> 
> Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> the vma is not already PROT_EXEC? The latter is closer to the current
> systemd approach. The former allows an mprotect(PROT_EXEC) if the
> mapping was PROT_READ only for example.
> 
> I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> replacement for BPF MDWE.
> 

I think we'd want existing installations with MemoryDenyWriteExecute=yes 
not start failing when the implementation is changed to in-kernel 
version. The implementation could be very simple and not even check 
existing PROT_ flags (except for BTI case) to be maximally compatible to 
BPF version. So I'd leave "was PROT_WRITE" and other checks to more 
advanced versions, enabled with a different PR_MDWX_FLAG_.

-Topi

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 16:42             ` Kees Cook
@ 2022-04-21 17:24               ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-21 17:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > the vma is not already PROT_EXEC? The latter is closer to the current
> > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > mapping was PROT_READ only for example.
> > 
> > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > replacement for BPF MDWE.
> 
> I think "was PROT_WRITE" is an important part of the defense that
> couldn't be done with a simple seccomp filter (which is why the filter
> ended up being a problem in the first place).

I would say "was PROT_WRITE" is slightly more relaxed than "is not
already PROT_EXEC". The seccomp filter can't do "is not already
PROT_EXEC" either since it only checks the mprotect() arguments, not the
current vma flags.

So we have (with sub-cases):

1. Current BPF filter:

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails

2. "is not already PROT_EXEC":

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mmap(PROT_READ|PROT_EXEC|PROT_BTI);	// passes

   c)	mmap(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// fails

3. "is or was not PROT_WRITE":

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mmap(PROT_READ|PROT_EXEC|PROT_BTI);	// passes

   c)	mmap(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// passes

   d)	mmap(PROT_READ|PROT_WRITE);
	mprotect(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// fails (was write)

The current seccomp filter is the strictest. If we go for (2), it allows
PROT_BTI as in 2.b but prevents 2.c (as would the current seccomp
filter). (3) relaxes 2.c as in 3.c while still preventing 3.d. Both (1)
and (2) prevent 3.d by simply rejecting mprotect(PROT_EXEC).

If we don't care about 3.c, we might as well go for (2). I don't mind,
already went for (3) in this series. I think either of them would not be
a regression on MDWE, unless there is some test that attempts 3.c and
expects it to fail.

-- 
Catalin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-21 17:24               ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-21 17:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > the vma is not already PROT_EXEC? The latter is closer to the current
> > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > mapping was PROT_READ only for example.
> > 
> > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > replacement for BPF MDWE.
> 
> I think "was PROT_WRITE" is an important part of the defense that
> couldn't be done with a simple seccomp filter (which is why the filter
> ended up being a problem in the first place).

I would say "was PROT_WRITE" is slightly more relaxed than "is not
already PROT_EXEC". The seccomp filter can't do "is not already
PROT_EXEC" either since it only checks the mprotect() arguments, not the
current vma flags.

So we have (with sub-cases):

1. Current BPF filter:

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails

2. "is not already PROT_EXEC":

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mmap(PROT_READ|PROT_EXEC|PROT_BTI);	// passes

   c)	mmap(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// fails

3. "is or was not PROT_WRITE":

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mmap(PROT_READ|PROT_EXEC|PROT_BTI);	// passes

   c)	mmap(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// passes

   d)	mmap(PROT_READ|PROT_WRITE);
	mprotect(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// fails (was write)

The current seccomp filter is the strictest. If we go for (2), it allows
PROT_BTI as in 2.b but prevents 2.c (as would the current seccomp
filter). (3) relaxes 2.c as in 3.c while still preventing 3.d. Both (1)
and (2) prevent 3.d by simply rejecting mprotect(PROT_EXEC).

If we don't care about 3.c, we might as well go for (2). I don't mind,
already went for (3) in this series. I think either of them would not be
a regression on MDWE, unless there is some test that attempts 3.c and
expects it to fail.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 16:48             ` Topi Miettinen
@ 2022-04-21 17:28               ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-21 17:28 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Kees Cook, Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 07:48:27PM +0300, Topi Miettinen wrote:
> On 21.4.2022 18.35, Catalin Marinas wrote:
> > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > the vma is not already PROT_EXEC? The latter is closer to the current
> > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > mapping was PROT_READ only for example.
> > 
> > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > replacement for BPF MDWE.
> 
> I think we'd want existing installations with MemoryDenyWriteExecute=yes not
> start failing when the implementation is changed to in-kernel version. The
> implementation could be very simple and not even check existing PROT_ flags
> (except for BTI case) to be maximally compatible to BPF version.

It would have to check the existing flags otherwise this could have been
implemented in the BPF filter. The dynamic loader (or kernel loader)
first mmap(PROT_READ|PROT_EXEC) and, if the BTI note is found, it
switches it to mprotect(PROT_READ|PROT_EXEC|PROT_BTI). If we allowed
this to pass simply because of PROT_BTI, one could create such
executable mapping even if it is (or was) writeable.

So we can either allow mprotect(PROT_EXEC) if the mapping was never
writeable or we allow mprotect(PROT_EXEC) if the mapping is already
PROT_EXEC (and the check for W^X was previously done by mmap()).

> So I'd leave "was PROT_WRITE" and other checks to more advanced
> versions, enabled with a different PR_MDWX_FLAG_.

This works for me as well. See my reply to Kees for the use-cases.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-21 17:28               ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-21 17:28 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Kees Cook, Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 07:48:27PM +0300, Topi Miettinen wrote:
> On 21.4.2022 18.35, Catalin Marinas wrote:
> > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > the vma is not already PROT_EXEC? The latter is closer to the current
> > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > mapping was PROT_READ only for example.
> > 
> > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > replacement for BPF MDWE.
> 
> I think we'd want existing installations with MemoryDenyWriteExecute=yes not
> start failing when the implementation is changed to in-kernel version. The
> implementation could be very simple and not even check existing PROT_ flags
> (except for BTI case) to be maximally compatible to BPF version.

It would have to check the existing flags otherwise this could have been
implemented in the BPF filter. The dynamic loader (or kernel loader)
first mmap(PROT_READ|PROT_EXEC) and, if the BTI note is found, it
switches it to mprotect(PROT_READ|PROT_EXEC|PROT_BTI). If we allowed
this to pass simply because of PROT_BTI, one could create such
executable mapping even if it is (or was) writeable.

So we can either allow mprotect(PROT_EXEC) if the mapping was never
writeable or we allow mprotect(PROT_EXEC) if the mapping is already
PROT_EXEC (and the check for W^X was previously done by mmap()).

> So I'd leave "was PROT_WRITE" and other checks to more advanced
> versions, enabled with a different PR_MDWX_FLAG_.

This works for me as well. See my reply to Kees for the use-cases.

Thanks.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
  2022-04-13 13:49   ` Catalin Marinas
@ 2022-04-21 17:37     ` David Hildenbrand
  -1 siblings, 0 replies; 50+ messages in thread
From: David Hildenbrand @ 2022-04-21 17:37 UTC (permalink / raw)
  To: Catalin Marinas, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

On 13.04.22 15:49, Catalin Marinas wrote:
> The aim of such policy is to prevent a user task from inadvertently
> creating an executable mapping that is or was writeable (and
> subsequently made read-only).
> 
> An example of mmap() returning -EACCESS if the policy is enabled:
> 
> 	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
> 
> Similarly, mprotect() would return -EACCESS below:
> 
> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> 	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
> 
> With the past vma writeable permission tracking, mprotect() below would
> also fail with -EACCESS:
> 
> 	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
> 	mprotect(addr, size, PROT_READ | PROT_EXEC);
> 
> While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
> mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
> systemd MDWE approach via SECCOMP BPF filters), we want the following
> scenario to succeed:
> 
> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> 	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
> 
> where PROT_BTI enables branch tracking identification on arm64.
> 
> The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
> and execve(), was made by analogy to READ_IMPLIES_EXEC.
> 
> Note that it is sufficient to check for VM_WAS_WRITE in
> map_deny_write_exec() as this flag is always set on VM_WRITE mappings.
> 
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>

How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a
VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly?

Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
actually map the PTE writable, but set it dirty. GUP will retry, find a
R/O pte that is dirty and where it knows that it broke COW and will
allow the read access, although the PTE is R/O.

That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
kernel sections, but it's used elsewhere for page pinning as well.

My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
now to bypass that mechanism, I might be wrong.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
@ 2022-04-21 17:37     ` David Hildenbrand
  0 siblings, 0 replies; 50+ messages in thread
From: David Hildenbrand @ 2022-04-21 17:37 UTC (permalink / raw)
  To: Catalin Marinas, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek
  Cc: Will Deacon, Alexander Viro, Eric Biederman, Kees Cook,
	Szabolcs Nagy, Mark Brown, Jeremy Linton, Topi Miettinen,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel

On 13.04.22 15:49, Catalin Marinas wrote:
> The aim of such policy is to prevent a user task from inadvertently
> creating an executable mapping that is or was writeable (and
> subsequently made read-only).
> 
> An example of mmap() returning -EACCESS if the policy is enabled:
> 
> 	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
> 
> Similarly, mprotect() would return -EACCESS below:
> 
> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> 	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
> 
> With the past vma writeable permission tracking, mprotect() below would
> also fail with -EACCESS:
> 
> 	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
> 	mprotect(addr, size, PROT_READ | PROT_EXEC);
> 
> While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
> mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
> systemd MDWE approach via SECCOMP BPF filters), we want the following
> scenario to succeed:
> 
> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> 	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
> 
> where PROT_BTI enables branch tracking identification on arm64.
> 
> The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
> and execve(), was made by analogy to READ_IMPLIES_EXEC.
> 
> Note that it is sufficient to check for VM_WAS_WRITE in
> map_deny_write_exec() as this flag is always set on VM_WRITE mappings.
> 
> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>

How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a
VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly?

Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
actually map the PTE writable, but set it dirty. GUP will retry, find a
R/O pte that is dirty and where it knows that it broke COW and will
allow the read access, although the PTE is R/O.

That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
kernel sections, but it's used elsewhere for page pinning as well.

My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
now to bypass that mechanism, I might be wrong.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 17:24               ` Catalin Marinas
@ 2022-04-21 17:41                 ` Kees Cook
  -1 siblings, 0 replies; 50+ messages in thread
From: Kees Cook @ 2022-04-21 17:41 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 06:24:21PM +0100, Catalin Marinas wrote:
> On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> > On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > > the vma is not already PROT_EXEC? The latter is closer to the current
> > > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > > mapping was PROT_READ only for example.
> > > 
> > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > > replacement for BPF MDWE.
> > 
> > I think "was PROT_WRITE" is an important part of the defense that
> > couldn't be done with a simple seccomp filter (which is why the filter
> > ended up being a problem in the first place).
> 
> I would say "was PROT_WRITE" is slightly more relaxed than "is not
> already PROT_EXEC". The seccomp filter can't do "is not already
> PROT_EXEC" either since it only checks the mprotect() arguments, not the
> current vma flags.
> 
> So we have (with sub-cases):
> 
> 1. Current BPF filter:
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails
>
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
>
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> 2. "is not already PROT_EXEC":
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> 
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
>
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> 3. "is or was not PROT_WRITE":
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> 
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// passes
> 
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails

[edited above to show each case]

restated what was already summarized:
Problem is 1.b. 2 and 3 solve it. 3 is more relaxed (c passes).

> If we don't care about 3.c, we might as well go for (2). I don't mind,
> already went for (3) in this series. I think either of them would not be
> a regression on MDWE, unless there is some test that attempts 3.c and
> expects it to fail.

I should stop arguing for a less restrictive mode. ;) It just feels weird
that the combinations are API-mediated, rather than logically defined:
I can do PROT_READ|PROT_EXEC with mmap but not mprotect under 2. As
opposed to saying "the vma cannot be executable if it is or ever was
writable". I find the latter much easier to reason about as far as the
expectations of system state.

So, I'd still prefer 3, as that was the _goal_ of the systemd MDWE
seccomp filter, but yes, 2 does provide the same protection while
allowing BTI.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-21 17:41                 ` Kees Cook
  0 siblings, 0 replies; 50+ messages in thread
From: Kees Cook @ 2022-04-21 17:41 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 06:24:21PM +0100, Catalin Marinas wrote:
> On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> > On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > > the vma is not already PROT_EXEC? The latter is closer to the current
> > > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > > mapping was PROT_READ only for example.
> > > 
> > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > > replacement for BPF MDWE.
> > 
> > I think "was PROT_WRITE" is an important part of the defense that
> > couldn't be done with a simple seccomp filter (which is why the filter
> > ended up being a problem in the first place).
> 
> I would say "was PROT_WRITE" is slightly more relaxed than "is not
> already PROT_EXEC". The seccomp filter can't do "is not already
> PROT_EXEC" either since it only checks the mprotect() arguments, not the
> current vma flags.
> 
> So we have (with sub-cases):
> 
> 1. Current BPF filter:
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails
>
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
>
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> 2. "is not already PROT_EXEC":
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> 
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
>
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> 3. "is or was not PROT_WRITE":
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> 
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// passes
> 
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails

[edited above to show each case]

restated what was already summarized:
Problem is 1.b. 2 and 3 solve it. 3 is more relaxed (c passes).

> If we don't care about 3.c, we might as well go for (2). I don't mind,
> already went for (3) in this series. I think either of them would not be
> a regression on MDWE, unless there is some test that attempts 3.c and
> expects it to fail.

I should stop arguing for a less restrictive mode. ;) It just feels weird
that the combinations are API-mediated, rather than logically defined:
I can do PROT_READ|PROT_EXEC with mmap but not mprotect under 2. As
opposed to saying "the vma cannot be executable if it is or ever was
writable". I find the latter much easier to reason about as far as the
expectations of system state.

So, I'd still prefer 3, as that was the _goal_ of the systemd MDWE
seccomp filter, but yes, 2 does provide the same protection while
allowing BTI.

-- 
Kees Cook

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 17:41                 ` Kees Cook
@ 2022-04-21 18:33                   ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-21 18:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 10:41:43AM -0700, Kees Cook wrote:
> On Thu, Apr 21, 2022 at 06:24:21PM +0100, Catalin Marinas wrote:
> > On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> > > On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > > > the vma is not already PROT_EXEC? The latter is closer to the current
> > > > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > > > mapping was PROT_READ only for example.
> > > > 
> > > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > > > replacement for BPF MDWE.
> > > 
> > > I think "was PROT_WRITE" is an important part of the defense that
> > > couldn't be done with a simple seccomp filter (which is why the filter
> > > ended up being a problem in the first place).
> > 
> > I would say "was PROT_WRITE" is slightly more relaxed than "is not
> > already PROT_EXEC". The seccomp filter can't do "is not already
> > PROT_EXEC" either since it only checks the mprotect() arguments, not the
> > current vma flags.
> > 
> > So we have (with sub-cases):
> > 
> > 1. Current BPF filter:
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails
> >
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> >
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> > 
> > 2. "is not already PROT_EXEC":
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> > 
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> >
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> > 
> > 3. "is or was not PROT_WRITE":
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> > 
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// passes
> > 
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >	 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> [edited above to show each case]

Thanks, I was in a rush to get home ;).

> restated what was already summarized:
> Problem is 1.b. 2 and 3 solve it. 3 is more relaxed (c passes).
> 
> > If we don't care about 3.c, we might as well go for (2). I don't mind,
> > already went for (3) in this series. I think either of them would not be
> > a regression on MDWE, unless there is some test that attempts 3.c and
> > expects it to fail.
> 
> I should stop arguing for a less restrictive mode. ;) It just feels weird
> that the combinations are API-mediated, rather than logically defined:
> I can do PROT_READ|PROT_EXEC with mmap but not mprotect under 2. As
> opposed to saying "the vma cannot be executable if it is or ever was
> writable". I find the latter much easier to reason about as far as the
> expectations of system state.

I had the same reasoning, hence option 3 in this series. I prefer to
treat mmap(PROT_READ|PROT_EXEC) and mprotect(PROT_READ|PROT_EXEC) in a
similar way.

-- 
Catalin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
@ 2022-04-21 18:33                   ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-21 18:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 10:41:43AM -0700, Kees Cook wrote:
> On Thu, Apr 21, 2022 at 06:24:21PM +0100, Catalin Marinas wrote:
> > On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> > > On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > > > the vma is not already PROT_EXEC? The latter is closer to the current
> > > > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > > > mapping was PROT_READ only for example.
> > > > 
> > > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > > > replacement for BPF MDWE.
> > > 
> > > I think "was PROT_WRITE" is an important part of the defense that
> > > couldn't be done with a simple seccomp filter (which is why the filter
> > > ended up being a problem in the first place).
> > 
> > I would say "was PROT_WRITE" is slightly more relaxed than "is not
> > already PROT_EXEC". The seccomp filter can't do "is not already
> > PROT_EXEC" either since it only checks the mprotect() arguments, not the
> > current vma flags.
> > 
> > So we have (with sub-cases):
> > 
> > 1. Current BPF filter:
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails
> >
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> >
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> > 
> > 2. "is not already PROT_EXEC":
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> > 
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> >
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> > 
> > 3. "is or was not PROT_WRITE":
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> > 
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// passes
> > 
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >	 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> [edited above to show each case]

Thanks, I was in a rush to get home ;).

> restated what was already summarized:
> Problem is 1.b. 2 and 3 solve it. 3 is more relaxed (c passes).
> 
> > If we don't care about 3.c, we might as well go for (2). I don't mind,
> > already went for (3) in this series. I think either of them would not be
> > a regression on MDWE, unless there is some test that attempts 3.c and
> > expects it to fail.
> 
> I should stop arguing for a less restrictive mode. ;) It just feels weird
> that the combinations are API-mediated, rather than logically defined:
> I can do PROT_READ|PROT_EXEC with mmap but not mprotect under 2. As
> opposed to saying "the vma cannot be executable if it is or ever was
> writable". I find the latter much easier to reason about as far as the
> expectations of system state.

I had the same reasoning, hence option 3 in this series. I prefer to
treat mmap(PROT_READ|PROT_EXEC) and mprotect(PROT_READ|PROT_EXEC) in a
similar way.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
  2022-04-21 17:37     ` David Hildenbrand
@ 2022-04-22 10:28       ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-22 10:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Kees Cook, Szabolcs Nagy, Mark Brown,
	Jeremy Linton, Topi Miettinen, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel

On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
> On 13.04.22 15:49, Catalin Marinas wrote:
> > The aim of such policy is to prevent a user task from inadvertently
> > creating an executable mapping that is or was writeable (and
> > subsequently made read-only).
> > 
> > An example of mmap() returning -EACCESS if the policy is enabled:
> > 
> > 	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
> > 
> > Similarly, mprotect() would return -EACCESS below:
> > 
> > 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> > 	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
> > 
> > With the past vma writeable permission tracking, mprotect() below would
> > also fail with -EACCESS:
> > 
> > 	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
> > 	mprotect(addr, size, PROT_READ | PROT_EXEC);
> > 
> > While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
> > mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
> > systemd MDWE approach via SECCOMP BPF filters), we want the following
> > scenario to succeed:
> > 
> > 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> > 	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
> > 
> > where PROT_BTI enables branch tracking identification on arm64.
> > 
> > The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
> > and execve(), was made by analogy to READ_IMPLIES_EXEC.
> > 
> > Note that it is sufficient to check for VM_WAS_WRITE in
> > map_deny_write_exec() as this flag is always set on VM_WRITE mappings.
> > 
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> 
> How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a
> VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly?

For now, that's just about VM_WRITE. Most vmas are VM_MAYWRITE, so we
can't really have MAYWRITE^EXEC. The basic feature aims to avoid user
vulnerabilities where a buffer is mapped both writeable and executable.
Of course, it can be expanded with additional prctl() flags to cover
other cases.

> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
> actually map the PTE writable, but set it dirty. GUP will retry, find a
> R/O pte that is dirty and where it knows that it broke COW and will
> allow the read access, although the PTE is R/O.
> 
> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
> kernel sections, but it's used elsewhere for page pinning as well.
> 
> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
> now to bypass that mechanism, I might be wrong.

GUP can be used to bypass this. But if an attacker can trigger such GUP
paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
checks on those paths (and reject the syscall) rather than on
mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.

Not sure what would break if we prevent GUP(FOLL_WRITE|FOLL_FORCE) when
the vma is !VM_WRITE, basically removing FOLL_FORCE. I guess for
ptrace() and uprobes that's fine. We could also make this only about
VM_EXEC rather than VM_WRITE, though  we'd probably need to set
VM_WAS_WRITE if we ever had a GUP(FOLL_WRITE|FOLL_FORCE) in order to
prevent a subsequent mprotect(PROT_EXEC).

Anyway, this can be a new flag. My first aim is to get the basics
working similarly to systemd's MDWE.

-- 
Catalin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
@ 2022-04-22 10:28       ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-22 10:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Kees Cook, Szabolcs Nagy, Mark Brown,
	Jeremy Linton, Topi Miettinen, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel

On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
> On 13.04.22 15:49, Catalin Marinas wrote:
> > The aim of such policy is to prevent a user task from inadvertently
> > creating an executable mapping that is or was writeable (and
> > subsequently made read-only).
> > 
> > An example of mmap() returning -EACCESS if the policy is enabled:
> > 
> > 	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
> > 
> > Similarly, mprotect() would return -EACCESS below:
> > 
> > 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> > 	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
> > 
> > With the past vma writeable permission tracking, mprotect() below would
> > also fail with -EACCESS:
> > 
> > 	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
> > 	mprotect(addr, size, PROT_READ | PROT_EXEC);
> > 
> > While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
> > mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
> > systemd MDWE approach via SECCOMP BPF filters), we want the following
> > scenario to succeed:
> > 
> > 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
> > 	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
> > 
> > where PROT_BTI enables branch tracking identification on arm64.
> > 
> > The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
> > and execve(), was made by analogy to READ_IMPLIES_EXEC.
> > 
> > Note that it is sufficient to check for VM_WAS_WRITE in
> > map_deny_write_exec() as this flag is always set on VM_WRITE mappings.
> > 
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> 
> How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a
> VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly?

For now, that's just about VM_WRITE. Most vmas are VM_MAYWRITE, so we
can't really have MAYWRITE^EXEC. The basic feature aims to avoid user
vulnerabilities where a buffer is mapped both writeable and executable.
Of course, it can be expanded with additional prctl() flags to cover
other cases.

> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
> actually map the PTE writable, but set it dirty. GUP will retry, find a
> R/O pte that is dirty and where it knows that it broke COW and will
> allow the read access, although the PTE is R/O.
> 
> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
> kernel sections, but it's used elsewhere for page pinning as well.
> 
> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
> now to bypass that mechanism, I might be wrong.

GUP can be used to bypass this. But if an attacker can trigger such GUP
paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
checks on those paths (and reject the syscall) rather than on
mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.

Not sure what would break if we prevent GUP(FOLL_WRITE|FOLL_FORCE) when
the vma is !VM_WRITE, basically removing FOLL_FORCE. I guess for
ptrace() and uprobes that's fine. We could also make this only about
VM_EXEC rather than VM_WRITE, though  we'd probably need to set
VM_WAS_WRITE if we ever had a GUP(FOLL_WRITE|FOLL_FORCE) in order to
prevent a subsequent mprotect(PROT_EXEC).

Anyway, this can be a new flag. My first aim is to get the basics
working similarly to systemd's MDWE.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
  2022-04-22 10:28       ` Catalin Marinas
@ 2022-04-22 11:04         ` David Hildenbrand
  -1 siblings, 0 replies; 50+ messages in thread
From: David Hildenbrand @ 2022-04-22 11:04 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Kees Cook, Szabolcs Nagy, Mark Brown,
	Jeremy Linton, Topi Miettinen, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel

On 22.04.22 12:28, Catalin Marinas wrote:
> On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
>> On 13.04.22 15:49, Catalin Marinas wrote:
>>> The aim of such policy is to prevent a user task from inadvertently
>>> creating an executable mapping that is or was writeable (and
>>> subsequently made read-only).
>>>
>>> An example of mmap() returning -EACCESS if the policy is enabled:
>>>
>>> 	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
>>>
>>> Similarly, mprotect() would return -EACCESS below:
>>>
>>> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
>>> 	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
>>>
>>> With the past vma writeable permission tracking, mprotect() below would
>>> also fail with -EACCESS:
>>>
>>> 	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
>>> 	mprotect(addr, size, PROT_READ | PROT_EXEC);
>>>
>>> While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
>>> mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
>>> systemd MDWE approach via SECCOMP BPF filters), we want the following
>>> scenario to succeed:
>>>
>>> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
>>> 	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
>>>
>>> where PROT_BTI enables branch tracking identification on arm64.
>>>
>>> The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
>>> and execve(), was made by analogy to READ_IMPLIES_EXEC.
>>>
>>> Note that it is sufficient to check for VM_WAS_WRITE in
>>> map_deny_write_exec() as this flag is always set on VM_WRITE mappings.
>>>
>>> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>>> Cc: Christoph Hellwig <hch@infradead.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>
>> How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a
>> VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly?
> 
> For now, that's just about VM_WRITE. Most vmas are VM_MAYWRITE, so we
> can't really have MAYWRITE^EXEC. The basic feature aims to avoid user
> vulnerabilities where a buffer is mapped both writeable and executable.
> Of course, it can be expanded with additional prctl() flags to cover
> other cases.
> 
>> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
>> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
>> actually map the PTE writable, but set it dirty. GUP will retry, find a
>> R/O pte that is dirty and where it knows that it broke COW and will
>> allow the read access, although the PTE is R/O.
>>
>> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
>> kernel sections, but it's used elsewhere for page pinning as well.
>>
>> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
>> now to bypass that mechanism, I might be wrong.
> 
> GUP can be used to bypass this. But if an attacker can trigger such GUP
> paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
> checks on those paths (and reject the syscall) rather than on
> mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.
> 
> 

I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to
unprivileged users.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
@ 2022-04-22 11:04         ` David Hildenbrand
  0 siblings, 0 replies; 50+ messages in thread
From: David Hildenbrand @ 2022-04-22 11:04 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Kees Cook, Szabolcs Nagy, Mark Brown,
	Jeremy Linton, Topi Miettinen, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel

On 22.04.22 12:28, Catalin Marinas wrote:
> On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
>> On 13.04.22 15:49, Catalin Marinas wrote:
>>> The aim of such policy is to prevent a user task from inadvertently
>>> creating an executable mapping that is or was writeable (and
>>> subsequently made read-only).
>>>
>>> An example of mmap() returning -EACCESS if the policy is enabled:
>>>
>>> 	mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0);
>>>
>>> Similarly, mprotect() would return -EACCESS below:
>>>
>>> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
>>> 	mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
>>>
>>> With the past vma writeable permission tracking, mprotect() below would
>>> also fail with -EACCESS:
>>>
>>> 	addr = mmap(0, size, PROT_READ | PROT_WRITE, flags, 0, 0);
>>> 	mprotect(addr, size, PROT_READ | PROT_EXEC);
>>>
>>> While the above could be achieved by checking PROT_WRITE & PROT_EXEC on
>>> mmap/mprotect and denying mprotect(PROT_EXEC) altogether (current
>>> systemd MDWE approach via SECCOMP BPF filters), we want the following
>>> scenario to succeed:
>>>
>>> 	addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0);
>>> 	mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI);
>>>
>>> where PROT_BTI enables branch tracking identification on arm64.
>>>
>>> The choice for a DENY_WRITE_EXEC personality flag, inherited on fork()
>>> and execve(), was made by analogy to READ_IMPLIES_EXEC.
>>>
>>> Note that it is sufficient to check for VM_WAS_WRITE in
>>> map_deny_write_exec() as this flag is always set on VM_WRITE mappings.
>>>
>>> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
>>> Cc: Christoph Hellwig <hch@infradead.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>
>> How does this interact with get_user_pages(FOLL_WRITE|FOLL_FORCE) on a
>> VMA that is VM_MAYWRITE but not VM_WRITE? Is it handled accordingly?
> 
> For now, that's just about VM_WRITE. Most vmas are VM_MAYWRITE, so we
> can't really have MAYWRITE^EXEC. The basic feature aims to avoid user
> vulnerabilities where a buffer is mapped both writeable and executable.
> Of course, it can be expanded with additional prctl() flags to cover
> other cases.
> 
>> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
>> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
>> actually map the PTE writable, but set it dirty. GUP will retry, find a
>> R/O pte that is dirty and where it knows that it broke COW and will
>> allow the read access, although the PTE is R/O.
>>
>> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
>> kernel sections, but it's used elsewhere for page pinning as well.
>>
>> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
>> now to bypass that mechanism, I might be wrong.
> 
> GUP can be used to bypass this. But if an attacker can trigger such GUP
> paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
> checks on those paths (and reject the syscall) rather than on
> mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.
> 
> 

I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to
unprivileged users.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
  2022-04-22 11:04         ` David Hildenbrand
@ 2022-04-22 13:12           ` Catalin Marinas
  -1 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-22 13:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Kees Cook, Szabolcs Nagy, Mark Brown,
	Jeremy Linton, Topi Miettinen, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel

On Fri, Apr 22, 2022 at 01:04:31PM +0200, David Hildenbrand wrote:
> On 22.04.22 12:28, Catalin Marinas wrote:
> > On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
> >> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
> >> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
> >> actually map the PTE writable, but set it dirty. GUP will retry, find a
> >> R/O pte that is dirty and where it knows that it broke COW and will
> >> allow the read access, although the PTE is R/O.
> >>
> >> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
> >> kernel sections, but it's used elsewhere for page pinning as well.
> >>
> >> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
> >> now to bypass that mechanism, I might be wrong.
> > 
> > GUP can be used to bypass this. But if an attacker can trigger such GUP
> > paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
> > checks on those paths (and reject the syscall) rather than on
> > mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.
> 
> I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to
> unprivileged users.

Ah, do they really need this? At a quick search, ib_umem_get() for
example:

	unsigned int gup_flags = FOLL_WRITE;
	...
	if (!umem->writable)
		gup_flags |= FOLL_FORCE;

I guess with a new MDWE flag we can make the GUP code ignore FOLL_FORCE
if VM_EXEC.

-- 
Catalin

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
@ 2022-04-22 13:12           ` Catalin Marinas
  0 siblings, 0 replies; 50+ messages in thread
From: Catalin Marinas @ 2022-04-22 13:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Kees Cook, Szabolcs Nagy, Mark Brown,
	Jeremy Linton, Topi Miettinen, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel

On Fri, Apr 22, 2022 at 01:04:31PM +0200, David Hildenbrand wrote:
> On 22.04.22 12:28, Catalin Marinas wrote:
> > On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
> >> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
> >> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
> >> actually map the PTE writable, but set it dirty. GUP will retry, find a
> >> R/O pte that is dirty and where it knows that it broke COW and will
> >> allow the read access, although the PTE is R/O.
> >>
> >> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
> >> kernel sections, but it's used elsewhere for page pinning as well.
> >>
> >> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
> >> now to bypass that mechanism, I might be wrong.
> > 
> > GUP can be used to bypass this. But if an attacker can trigger such GUP
> > paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
> > checks on those paths (and reject the syscall) rather than on
> > mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.
> 
> I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to
> unprivileged users.

Ah, do they really need this? At a quick search, ib_umem_get() for
example:

	unsigned int gup_flags = FOLL_WRITE;
	...
	if (!umem->writable)
		gup_flags |= FOLL_FORCE;

I guess with a new MDWE flag we can make the GUP code ignore FOLL_FORCE
if VM_EXEC.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
  2022-04-22 13:12           ` Catalin Marinas
@ 2022-04-22 17:41             ` David Hildenbrand
  -1 siblings, 0 replies; 50+ messages in thread
From: David Hildenbrand @ 2022-04-22 17:41 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Kees Cook, Szabolcs Nagy, Mark Brown,
	Jeremy Linton, Topi Miettinen, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel

On 22.04.22 15:12, Catalin Marinas wrote:
> On Fri, Apr 22, 2022 at 01:04:31PM +0200, David Hildenbrand wrote:
>> On 22.04.22 12:28, Catalin Marinas wrote:
>>> On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
>>>> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
>>>> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
>>>> actually map the PTE writable, but set it dirty. GUP will retry, find a
>>>> R/O pte that is dirty and where it knows that it broke COW and will
>>>> allow the read access, although the PTE is R/O.
>>>>
>>>> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
>>>> kernel sections, but it's used elsewhere for page pinning as well.
>>>>
>>>> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
>>>> now to bypass that mechanism, I might be wrong.
>>>
>>> GUP can be used to bypass this. But if an attacker can trigger such GUP
>>> paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
>>> checks on those paths (and reject the syscall) rather than on
>>> mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.
>>
>> I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to
>> unprivileged users.
> 
> Ah, do they really need this? At a quick search, ib_umem_get() for
> example:
> 
> 	unsigned int gup_flags = FOLL_WRITE;
> 	...
> 	if (!umem->writable)
> 		gup_flags |= FOLL_FORCE;
> 
> I guess with a new MDWE flag we can make the GUP code ignore FOLL_FORCE
> if VM_EXEC.
> 

It's, for example, required when you have a MAP_PRIVATE but PROT_READ
mapping and want to take a reliable R/O (!) pin. Without
FOLL_FORCE|FOLL_WRITE you'd be pinning a (shared zeropage, pagecache)
page that will get replaced by an anonymous page in the COW handler,
after mprotect(PROT_READ|PROT_WRITE) followed by a write access. That
was an issue for RDMA in the past, that's why we have that handling in
place IIRC.

Yes, it's ugly.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag
@ 2022-04-22 17:41             ` David Hildenbrand
  0 siblings, 0 replies; 50+ messages in thread
From: David Hildenbrand @ 2022-04-22 17:41 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Kees Cook, Szabolcs Nagy, Mark Brown,
	Jeremy Linton, Topi Miettinen, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel

On 22.04.22 15:12, Catalin Marinas wrote:
> On Fri, Apr 22, 2022 at 01:04:31PM +0200, David Hildenbrand wrote:
>> On 22.04.22 12:28, Catalin Marinas wrote:
>>> On Thu, Apr 21, 2022 at 06:37:49PM +0100, David Hildenbrand wrote:
>>>> Note that in the (FOLL_WRITE|FOLL_FORCE) we only require VM_MAYWRITE on
>>>> the vma and trigger a write fault. As the VMA is not VM_WRITE, we won't
>>>> actually map the PTE writable, but set it dirty. GUP will retry, find a
>>>> R/O pte that is dirty and where it knows that it broke COW and will
>>>> allow the read access, although the PTE is R/O.
>>>>
>>>> That mechanism is required to e.g., set breakpoints in R/O MAP_PRIVATE
>>>> kernel sections, but it's used elsewhere for page pinning as well.
>>>>
>>>> My gut feeling is that GUP(FOLL_WRITE|FOLL_FORCE) could be used right
>>>> now to bypass that mechanism, I might be wrong.
>>>
>>> GUP can be used to bypass this. But if an attacker can trigger such GUP
>>> paths via a syscall (e.g. ptrace(PTRACE_POKEDATA)), I think we need the
>>> checks on those paths (and reject the syscall) rather than on
>>> mmap/mprotect(). This would be covered by something like CAP_SYS_PTRACE.
>>
>> I was told that RDMA uses FOLL_FORCE|FOLL_WRITE and is available to
>> unprivileged users.
> 
> Ah, do they really need this? At a quick search, ib_umem_get() for
> example:
> 
> 	unsigned int gup_flags = FOLL_WRITE;
> 	...
> 	if (!umem->writable)
> 		gup_flags |= FOLL_FORCE;
> 
> I guess with a new MDWE flag we can make the GUP code ignore FOLL_FORCE
> if VM_EXEC.
> 

It's, for example, required when you have a MAP_PRIVATE but PROT_READ
mapping and want to take a reliable R/O (!) pin. Without
FOLL_FORCE|FOLL_WRITE you'd be pinning a (shared zeropage, pagecache)
page that will get replaced by an anonymous page in the COW handler,
after mprotect(PROT_READ|PROT_WRITE) followed by a write access. That
was an issue for RDMA in the past, that's why we have that handling in
place IIRC.

Yes, it's ugly.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2022-04-22 18:00 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-13 13:49 [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE) Catalin Marinas
2022-04-13 13:49 ` Catalin Marinas
2022-04-13 13:49 ` [PATCH RFC 1/4] mm: Track previously writeable vma permission Catalin Marinas
2022-04-13 13:49   ` Catalin Marinas
2022-04-13 13:49 ` [PATCH RFC 2/4] mm, personality: Implement memory-deny-write-execute as a personality flag Catalin Marinas
2022-04-13 13:49   ` Catalin Marinas
2022-04-21 17:37   ` David Hildenbrand
2022-04-21 17:37     ` David Hildenbrand
2022-04-22 10:28     ` Catalin Marinas
2022-04-22 10:28       ` Catalin Marinas
2022-04-22 11:04       ` David Hildenbrand
2022-04-22 11:04         ` David Hildenbrand
2022-04-22 13:12         ` Catalin Marinas
2022-04-22 13:12           ` Catalin Marinas
2022-04-22 17:41           ` David Hildenbrand
2022-04-22 17:41             ` David Hildenbrand
2022-04-13 13:49 ` [PATCH RFC 3/4] fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC " Catalin Marinas
2022-04-13 13:49   ` Catalin Marinas
2022-04-13 13:49 ` [PATCH RFC 4/4] arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC Catalin Marinas
2022-04-13 13:49   ` Catalin Marinas
2022-04-13 18:39 ` [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE) Topi Miettinen
2022-04-13 18:39   ` Topi Miettinen
2022-04-14 13:49   ` Catalin Marinas
2022-04-14 13:49     ` Catalin Marinas
2022-04-14 18:52 ` Kees Cook
2022-04-14 18:52   ` Kees Cook
2022-04-15 20:01   ` Topi Miettinen
2022-04-15 20:01     ` Topi Miettinen
2022-04-20 13:01   ` Catalin Marinas
2022-04-20 13:01     ` Catalin Marinas
2022-04-20 17:44     ` Kees Cook
2022-04-20 17:44       ` Kees Cook
2022-04-20 19:34     ` Topi Miettinen
2022-04-20 19:34       ` Topi Miettinen
2022-04-20 23:21       ` Kees Cook
2022-04-20 23:21         ` Kees Cook
2022-04-21 15:35         ` Catalin Marinas
2022-04-21 15:35           ` Catalin Marinas
2022-04-21 16:42           ` Kees Cook
2022-04-21 16:42             ` Kees Cook
2022-04-21 17:24             ` Catalin Marinas
2022-04-21 17:24               ` Catalin Marinas
2022-04-21 17:41               ` Kees Cook
2022-04-21 17:41                 ` Kees Cook
2022-04-21 18:33                 ` Catalin Marinas
2022-04-21 18:33                   ` Catalin Marinas
2022-04-21 16:48           ` Topi Miettinen
2022-04-21 16:48             ` Topi Miettinen
2022-04-21 17:28             ` Catalin Marinas
2022-04-21 17:28               ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.