All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] exitz syscall
@ 2023-11-11 12:51 York Jasper Niebuhr
  2023-11-11 13:24 ` Willy Tarreau
  0 siblings, 1 reply; 11+ messages in thread
From: York Jasper Niebuhr @ 2023-11-11 12:51 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-api, linux-security-module, torvalds,
	York Jasper Niebuhr

Adds a system call to flag a process' resources to be cleared on
exit (or, in the case of memory, on free). Currently, only zeroing
memory is implemented. This system call is meant to act as an
alternative to CONFIG_INIT_ON_FREE_DEFAULT_ON that does not generally
impact the entire system's performance, but is restricted to occasional
applications for individual processes that require the extra security.

This system call could in the future be extended to clear other
resources such as internal buffers, e.g. of sockets. However, its main
focus for now is process memory. It is supposed to protect systems
against memory forensics attacks that extract system memory (e.g. cold
boot attack) after programs that work with confidential data or keys
already exited. Currently, a process' memory could still be entirely
read until the according pages were reassigned to another process, as
page initialization is by default performed on alloc, not on free. The
configuration CONFIG_INIT_ON_FREE_DEFAULT_ON enables initialization on
free for the entire system and can result in performance deductions,
which are in many cases unnecessary. With sys_exitz, this performance
hit can be restricted to just the processes that actually require this
feature.

To keep track of any memory that contains keys or sensitive data is
incredibly tedious. Especially, when using cryptographic libraries,
developers often don't even know if the memory will be cleared after
certain operations. This system call could be invoked once at the start
of a program and the problem would be dealt with, including crashes or
other unexpected terminations that might otherwise prevent user-space
routines from zeroing memory.

To get proper security, sys_exitz should be used in combination with
mlock.

Signed-off-by: York Jasper Niebuhr <yjnworkstation@gmail.com>

---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/exitz.h                  |  27 ++++++
 include/linux/sched.h                  |   4 +
 include/linux/syscalls.h               |   1 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 kernel/Makefile                        |   2 +
 kernel/exit.c                          |   5 ++
 kernel/exitz.c                         | 119 +++++++++++++++++++++++++
 kernel/sys_ni.c                        |   3 +
 mm/mmap.c                              |  12 +++
 security/Kconfig                       |   9 ++
 12 files changed, 187 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/exitz.h
 create mode 100644 kernel/exitz.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c8fac5205803..8be9d1471b5c 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -461,3 +461,4 @@
 454	i386	futex_wake		sys_futex_wake
 455	i386	futex_wait		sys_futex_wait
 456	i386	futex_requeue		sys_futex_requeue
+457	i386	exitz			sys_exitz
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 8cb8bf68721c..e6aeca443a88 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -378,6 +378,7 @@
 454	common	futex_wake		sys_futex_wake
 455	common	futex_wait		sys_futex_wait
 456	common	futex_requeue		sys_futex_requeue
+457	common	exitz			sys_exitz
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/exitz.h b/include/linux/exitz.h
new file mode 100644
index 000000000000..b1a5ad194839
--- /dev/null
+++ b/include/linux/exitz.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifdef CONFIG_EXITZ_SYSCALL
+
+/*
+ * Zero resource on exit flags
+ */
+#define EZ_NONE			0x00000000
+#define EZ_MEM                  0x00000001      /* Memory pages are cleared on exit */
+#define EZ_FLAGS (EZ_MEM)
+
+/*
+ * Overwrite current process memory range with zeros (end excluded).
+ */
+int memz_range(unsigned long start, unsigned long end);
+
+/*
+ * Overwrite all flagged resources with zeros.
+ */
+void exit_z(void);
+
+/*
+ * Set task_struct flags to zero flagged resources on exit.
+ */
+void do_exitz(int flags);
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c31697248..cbe8c198f28e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -766,6 +766,10 @@ struct task_struct {
 	refcount_t			usage;
 	/* Per task flags (PF_*), defined further below: */
 	unsigned int			flags;
+#ifdef CONFIG_EXITZ_SYSCALL
+	/* Zero resource on exit flags (EZ_*). */
+	unsigned int			ezflags;
+#endif
 	unsigned int			ptrace;
 
 #ifdef CONFIG_SMP
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fd9d12de7e92..8c29b9ea3677 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -949,6 +949,7 @@ asmlinkage long sys_cachestat(unsigned int fd,
 		struct cachestat_range __user *cstat_range,
 		struct cachestat __user *cstat, unsigned int flags);
 asmlinkage long sys_map_shadow_stack(unsigned long addr, unsigned long size, unsigned int flags);
+asmlinkage long sys_exitz(int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 756b013fb832..782222ffa0d7 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -828,9 +828,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake)
 __SYSCALL(__NR_futex_wait, sys_futex_wait)
 #define __NR_futex_requeue 456
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
+#define __NR_exitz 457
+__SYSCALL(__NR_exitz, sys_exitz)
 
 #undef __NR_syscalls
-#define __NR_syscalls 457
+#define __NR_syscalls 458
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/Makefile b/kernel/Makefile
index 3947122d618b..17602af88adc 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -158,3 +158,5 @@ $(obj)/kheaders_data.tar.xz: FORCE
 	$(call cmd,genikh)
 
 clean-files := kheaders_data.tar.xz kheaders.md5
+
+obj-$(CONFIG_EXITZ_SYSCALL) += exitz.o
diff --git a/kernel/exit.c b/kernel/exit.c
index ee9f43bed49a..35469decd9e9 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -69,6 +69,7 @@
 #include <linux/rethook.h>
 #include <linux/sysfs.h>
 #include <linux/user_events.h>
+#include <linux/exitz.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -808,6 +809,10 @@ static void synchronize_group_exit(struct task_struct *tsk, long code)
 
 void __noreturn do_exit(long code)
 {
+#ifdef CONFIG_EXITZ_SYSCALL
+	exit_z();
+#endif
+
 	struct task_struct *tsk = current;
 	int group_dead;
 
diff --git a/kernel/exitz.c b/kernel/exitz.c
new file mode 100644
index 000000000000..33a0b16f93a9
--- /dev/null
+++ b/kernel/exitz.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/exitz.h>
+#include <linux/syscalls.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/mm_types.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/string.h>
+
+#define EZ_MAX_PAGES_ARRAY_COUNT 16
+#define EZ_MAX_KMALLOC_PAGES (PAGE_SIZE * 2)
+#define EZ_MAX_PAGES_PER_LOOP (EZ_MAX_KMALLOC_PAGES / sizeof(struct page *))
+
+/*
+ * Overwrite a range of process memory with zeros (end excluded).
+ */
+int memz_range(unsigned long start, unsigned long end)
+{
+	if (end <= start)
+		return 0;
+
+	unsigned long nr_pages = (end - 1) / PAGE_SIZE - start / PAGE_SIZE + 1;
+
+	struct page *pages_stack[EZ_MAX_PAGES_ARRAY_COUNT];
+	struct page **pages = pages_stack;
+
+	if (nr_pages > EZ_MAX_PAGES_ARRAY_COUNT) {
+		/* For reliability, cap kmalloc size */
+		pages = kmalloc(min_t(size_t, EZ_MAX_KMALLOC_PAGES,
+					sizeof(struct page *) * nr_pages),
+				GFP_KERNEL);
+
+		if (!pages)
+			return -ENOMEM;
+	}
+
+	unsigned long page_address = start & PAGE_MASK;
+
+	while (nr_pages) {
+		long pinned_pages = min(nr_pages, EZ_MAX_PAGES_PER_LOOP);
+
+		pinned_pages = pin_user_pages(page_address, pinned_pages, FOLL_WRITE, pages);
+
+		if (pinned_pages <= 0)
+			return -EFAULT;
+
+		/* Map and zero each page */
+		for (long i = 0; i < pinned_pages; i++) {
+			void *kaddr = kmap_local_page(pages[i]);
+			unsigned long page_offset = 0;
+
+			if (page_address < start)
+				page_offset = min_t(unsigned long, start - page_address, PAGE_SIZE);
+
+			unsigned long page_part =
+				min_t(unsigned long, PAGE_SIZE, end - page_address) - page_offset;
+
+			memset(kaddr + page_offset, 0, page_part);
+
+			kunmap_local(kaddr);
+			page_address += PAGE_SIZE;
+		}
+
+		nr_pages -= pinned_pages;
+
+		unpin_user_pages_dirty_lock(pages, pinned_pages, 1);
+	}
+
+	if (pages != pages_stack)
+		kfree(pages);
+
+	return 0;
+}
+
+/*
+ * Overwrite any memory associated to current process with zeros.
+ */
+static void exit_memz(void)
+{
+	if (!(current->ezflags & EZ_MEM))
+		return;
+
+	struct vm_area_struct *vma;
+
+	VMA_ITERATOR(vmi, current->mm, 0);
+
+	for_each_vma(vmi, vma) {
+		memz_range(vma->vm_start, vma->vm_end);
+	}
+}
+
+/*
+ * Overwrite all flagged resources with zeros.
+ */
+void exit_z(void)
+{
+	exit_memz();
+}
+
+/*
+ * Set task_struct flags to zero flagged resources on exit.
+ */
+void do_exitz(int flags)
+{
+	current->ezflags = flags;
+}
+
+#ifdef CONFIG_EXITZ_SYSCALL
+SYSCALL_DEFINE1(exitz, int, flags)
+{
+	if (flags & ~EZ_FLAGS)
+		return -EINVAL;
+
+	do_exitz(flags);
+	return 0;
+}
+#endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e1a6e3c675c0..ff5468f1d2f2 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -254,6 +254,9 @@ COND_SYSCALL(pkey_free);
 /* memfd_secret */
 COND_SYSCALL(memfd_secret);
 
+/* exitz */
+COND_SYSCALL(exitz);
+
 /*
  * Architecture specific weak syscall entries.
  */
diff --git a/mm/mmap.c b/mm/mmap.c
index 4f1cb814586d..d66bd314aca9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -47,6 +47,7 @@
 #include <linux/oom.h>
 #include <linux/sched/mm.h>
 #include <linux/ksm.h>
+#include <linux/exitz.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -225,6 +226,12 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 
 	/* Always allow shrinking brk. */
 	if (brk <= mm->brk) {
+		/* Overwrite memory with zeros */
+#ifdef CONFIG_EXITZ_SYSCALL
+		if (current->ezflags & EZ_MEM)
+			memz_range(brk, mm->brk);
+#endif
+
 		/* Search one past newbrk */
 		vma_iter_init(&vmi, mm, newbrk);
 		brkvma = vma_find(&vmi, oldbrk);
@@ -3001,6 +3008,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 static int __vm_munmap(unsigned long start, size_t len, bool unlock)
 {
+#ifdef CONFIG_EXITZ_SYSCALL
+	if (current->ezflags & EZ_MEM)
+		memz_range(start, start + len);
+#endif
+
 	int ret;
 	struct mm_struct *mm = current->mm;
 	LIST_HEAD(uf);
diff --git a/security/Kconfig b/security/Kconfig
index 52c9af08ad35..3509bb5fb2f4 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -249,5 +249,14 @@ config LSM
 
 source "security/Kconfig.hardening"
 
+config EXITZ_SYSCALL
+	bool "Exitz syscall" if EXPERT
+	default y
+	help
+	  sys_exitz is a system call to flag a process' resources to be erased
+	  on exit. It can be used to harden the system against memory forensics
+	  attacks after a process has finished. It is meant to be a more fine
+	  grained alternative to CONFIG_INIT_ON_FREE_DEFAULT_ON.
+
 endmenu
 

base-commit: 8728c14129df7a6e29188a2e737b4774fb200953
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] exitz syscall
  2023-11-11 12:51 [PATCH] exitz syscall York Jasper Niebuhr
@ 2023-11-11 13:24 ` Willy Tarreau
  2023-11-12  1:24   ` Linus Torvalds
  2023-11-12  4:52   ` Theodore Ts'o
  0 siblings, 2 replies; 11+ messages in thread
From: Willy Tarreau @ 2023-11-11 13:24 UTC (permalink / raw)
  To: York Jasper Niebuhr
  Cc: akpm, linux-kernel, linux-api, linux-security-module, torvalds

Hello,

On Sat, Nov 11, 2023 at 01:51:26PM +0100, York Jasper Niebuhr wrote:
> Adds a system call to flag a process' resources to be cleared on
> exit (or, in the case of memory, on free). Currently, only zeroing
> memory is implemented.
(...)

IMHO it does not make sense to add a syscall for this, please have a
look at prctl(2) instead, which is already used for similar settings.

Regards,
Willy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] exitz syscall
  2023-11-11 13:24 ` Willy Tarreau
@ 2023-11-12  1:24   ` Linus Torvalds
  2023-11-12 10:03     ` Jasper Niebuhr
  2023-11-13  0:08     ` Andy Lutomirski
  2023-11-12  4:52   ` Theodore Ts'o
  1 sibling, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2023-11-12  1:24 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: York Jasper Niebuhr, akpm, linux-kernel, linux-api,
	linux-security-module

On Sat, 11 Nov 2023 at 05:24, Willy Tarreau <w@1wt.eu> wrote:
>
> IMHO it does not make sense to add a syscall for this, please have a
> look at prctl(2) instead, which is already used for similar settings.

Honestly, I don't think it makes any sense at all.

If the key manager people can't be bothered to keep track of their
keys, the kernel certainly shouldn't be bothered with this kind of
huge hammer.

It looks like an active DoS attack to me, by anybody who just creates
a huge process and then sits there giggling as the machine comes to a
complete halt, with the kernel busy zeroing pointless crap.

Do it in user space. And if your user space randomly crashes, you have
other problems - but you can try to use ptrace to catch even that case
if you care.

          Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] exitz syscall
  2023-11-11 13:24 ` Willy Tarreau
  2023-11-12  1:24   ` Linus Torvalds
@ 2023-11-12  4:52   ` Theodore Ts'o
  2023-11-13  7:37     ` Vlastimil Babka
  1 sibling, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2023-11-12  4:52 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: York Jasper Niebuhr, akpm, linux-kernel, linux-api,
	linux-security-module, torvalds

On Sat, Nov 11, 2023 at 02:24:31PM +0100, Willy Tarreau wrote:
> Hello,
> 
> On Sat, Nov 11, 2023 at 01:51:26PM +0100, York Jasper Niebuhr wrote:
> > Adds a system call to flag a process' resources to be cleared on
> > exit (or, in the case of memory, on free). Currently, only zeroing
> > memory is implemented.
> (...)
> 
> IMHO it does not make sense to add a syscall for this, please have a
> look at prctl(2) instead, which is already used for similar settings.

Another reason to use prctl() is there are other cases when you'd want
to zero a process's memory.  For example, if the process gets killed
to some kind of signal, or when it gets OOM killed (where there is no
system call which forces the process to exit).  Also, if you want to
zero memory when the process exits, you'd want to zero the process
memory on an exec(2).

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] exitz syscall
  2023-11-12  1:24   ` Linus Torvalds
@ 2023-11-12 10:03     ` Jasper Niebuhr
  2023-11-12 15:44       ` Theodore Ts'o
  2023-11-13  0:08     ` Andy Lutomirski
  1 sibling, 1 reply; 11+ messages in thread
From: Jasper Niebuhr @ 2023-11-12 10:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Willy Tarreau, akpm, linux-kernel, linux-security-module, tytso

On Sun, Nov 12, 2023 at 2:24 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, 11 Nov 2023 at 05:24, Willy Tarreau <w@1wt.eu> wrote:
> >
> > IMHO it does not make sense to add a syscall for this, please have a
> > look at prctl(2) instead, which is already used for similar settings.

First of all, I had a look and now absolutely agree with this.

> Honestly, I don't think it makes any sense at all.
>
> If the key manager people can't be bothered to keep track of their
> keys, the kernel certainly shouldn't be bothered with this kind of
> huge hammer.
>
> It looks like an active DoS attack to me, by anybody who just creates
> a huge process and then sits there giggling as the machine comes to a
> complete halt, with the kernel busy zeroing pointless crap.

Fair point.

Do you think it could make sense to enable zeroing on exit only for
ranges of memory (base + len) with a config dictating the max amount
of bytes (or pages or whatever) that a single process is allowed to
flag for zeroing? This way the DoS issue is effectively solved and the
config could always be set to 0 (by default I guess), which would
effectively leave any common system unchanged. The users that actually
want to use this feature could then decide on their own how much it is
worth to them and how big of a task they want the kernel to perform at
max on exit.

> Do it in user space. And if your user space randomly crashes, you have
> other problems - but you can try to use ptrace to catch even that case
> if you care.
>
>           Linus

Unfortunately, ptrace is something many, including me, disable on
systems that are meant to be as secure as possible.

Kind Regards,
Jasper

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] exitz syscall
  2023-11-12 10:03     ` Jasper Niebuhr
@ 2023-11-12 15:44       ` Theodore Ts'o
  2023-11-12 18:50         ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2023-11-12 15:44 UTC (permalink / raw)
  To: Jasper Niebuhr
  Cc: Linus Torvalds, Willy Tarreau, akpm, linux-kernel, linux-security-module

On Sun, Nov 12, 2023 at 11:03:43AM +0100, Jasper Niebuhr wrote:
> 
> Do you think it could make sense to enable zeroing on exit only for
> ranges of memory (base + len) with a config dictating the max amount
> of bytes (or pages or whatever) that a single process is allowed to
> flag for zeroing? This way the DoS issue is effectively solved and the
> config could always be set to 0 (by default I guess), which would
> effectively leave any common system unchanged. The users that actually
> want to use this feature could then decide on their own how much it is
> worth to them and how big of a task they want the kernel to perform at
> max on exit.

How about adding a flag MLOCK_ZERO_ON_FREE used by the mlock2() system
call?  The number of pages which an unprivileged process can lock is
already capped via RLIMIT_MEMLOCK (or else mlock would be it own
denial of service attack).  That way if process dies from crash, the
keys would be zero'ed.

The argument against this is that this functionality could be an
attractive nuisance, since best practice is to zero any keying
information the moment you no longer need it, instead of depending on
magic OS behavior to take care of something that you darned well
should be doing yourself.  I understand that in your case, you're
trying to protect a key manager, whose *job* it is to hold onto keys
for long periods of time.  So this advice might not be as applicable
in your case.

I do suspect that the chances that trying to grab a memory page after
it has been released is one of the least likely way keys would be
exfiltrated.  I'd be much more worried about stack/buffer overflow
attacks in any of the libraries used by the key manager, for example.
Or zero-day attacks on the kernel which then grab the keys from the
key manager process while it is still running....

Cheers,

					- Ted


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] exitz syscall
  2023-11-12 15:44       ` Theodore Ts'o
@ 2023-11-12 18:50         ` Linus Torvalds
  2023-11-12 19:03           ` Theodore Ts'o
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2023-11-12 18:50 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jasper Niebuhr, Willy Tarreau, akpm, linux-kernel, linux-security-module

On Sun, 12 Nov 2023 at 07:44, Theodore Ts'o <tytso@mit.edu> wrote:
>
> How about adding a flag MLOCK_ZERO_ON_FREE used by the mlock2() system
> call?  The number of pages which an unprivileged process can lock is
> already capped via RLIMIT_MEMLOCK (or else mlock would be it own
> denial of service attack).  That way if process dies from crash, the
> keys would be zero'ed.

Yes, that is a lot better as an interface.

However, it still needs to also make sure that the memory in question
is not file-backed etc. Which the patch I saw didn't seem to do
either.

End result: as it was, that exitz patch was not helping security. It
was anything but.

              Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] exitz syscall
  2023-11-12 18:50         ` Linus Torvalds
@ 2023-11-12 19:03           ` Theodore Ts'o
  0 siblings, 0 replies; 11+ messages in thread
From: Theodore Ts'o @ 2023-11-12 19:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jasper Niebuhr, Willy Tarreau, akpm, linux-kernel, linux-security-module

On Sun, Nov 12, 2023 at 10:50:10AM -0800, Linus Torvalds wrote:
> 
> However, it still needs to also make sure that the memory in question
> is not file-backed etc. Which the patch I saw didn't seem to do
> either.

Well, yes.  If the program isn't using a small amount of anonymous
memory, which is also mlock'ed so it doesn't get written to swap, the
rest of it is a total waste of time.

And from what I've seen from the O_PONIES debate (e.g., users
truncating files and rewriting them, and then complaining when the
top-ten score file disappears after the system crashes when they close
the OpenGL connection to the proprietary kernel driver), my basic
assumption is that anything application writers will get wrong, they
probably will....

     	      		  	       - Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] exitz syscall
  2023-11-12  1:24   ` Linus Torvalds
  2023-11-12 10:03     ` Jasper Niebuhr
@ 2023-11-13  0:08     ` Andy Lutomirski
  1 sibling, 0 replies; 11+ messages in thread
From: Andy Lutomirski @ 2023-11-13  0:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Willy Tarreau, York Jasper Niebuhr, akpm, linux-kernel,
	linux-api, linux-security-module

> On Nov 11, 2023, at 5:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Sat, 11 Nov 2023 at 05:24, Willy Tarreau <w@1wt.eu> wrote:
>> IMHO it does not make sense to add a syscall for this, please have a
>> look at prctl(2) instead, which is already used for similar settings.
>
> Honestly, I don't think it makes any sense at all.
>
> If the key manager people can't be bothered to keep track of their
> keys, the kernel certainly shouldn't be bothered with this kind of
> huge hammer.
>
> It looks like an active DoS attack to me, by anybody who just creates
> a huge process and then sits there giggling as the machine comes to a
> complete halt, with the kernel busy zeroing pointless crap.

The implementation in this patch is … bad.  But that aside, the whole
concept seems wrong to me: zeroing memory specifically when an mm
exits seems rather bizarre -- it's the wrong condition. From a
hardening perspective, there are really three concerning cases, IMO:

1. Something sensitive is in memory, and some bug (side channel or
straight up kernel bug) allows an attacker to read it.  Zeroing early
shortens the window but doesn't actually prevent the attack.

2. Something sensitive is in memory, and some bug allocates the memory
before it's freed.  (I.e. allocator state gets corrupted.)  Like #1,
all we can do is shorten the window.  But these kinds of bugs are
quite rare.

3. Something sensitive is in memory, it gets freed, it gets reused
without __GFP_ZERO, and it gets leaked.  This actually seems fairly
plausible.  While __GFP_ZERO is fairly common, there are tons of paths
that don't use it, and bugs happen.

We do have:

commit 6471384af2a6530696fc0203bafe4de41a23c9ef
Author: Alexander Potapenko <glider@google.com>
Date:   Thu Jul 11 20:59:19 2019 -0700

    mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options


And I can easily imagine a useful API to flag some memory as being
sensitive such that either the kernel will always zero it when freed
or will make sure it's zeroed before being reallocated even if
__GFP_ZERO is not set and init_on_alloc and init_on_free are both
zero.

This would be a rather different patch that exitz().

--Andy


>
> Do it in user space. And if your user space randomly crashes, you have
> other problems - but you can try to use ptrace to catch even that case
> if you care.
>
>        Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] exitz syscall
  2023-11-12  4:52   ` Theodore Ts'o
@ 2023-11-13  7:37     ` Vlastimil Babka
  2023-11-19 14:54       ` Jasper Niebuhr
  0 siblings, 1 reply; 11+ messages in thread
From: Vlastimil Babka @ 2023-11-13  7:37 UTC (permalink / raw)
  To: Theodore Ts'o, Willy Tarreau
  Cc: York Jasper Niebuhr, akpm, linux-kernel, linux-api,
	linux-security-module, torvalds

On 11/12/23 05:52, Theodore Ts'o wrote:
> On Sat, Nov 11, 2023 at 02:24:31PM +0100, Willy Tarreau wrote:
>> Hello,
>> 
>> On Sat, Nov 11, 2023 at 01:51:26PM +0100, York Jasper Niebuhr wrote:
>> > Adds a system call to flag a process' resources to be cleared on
>> > exit (or, in the case of memory, on free). Currently, only zeroing
>> > memory is implemented.
>> (...)
>> 
>> IMHO it does not make sense to add a syscall for this, please have a
>> look at prctl(2) instead, which is already used for similar settings.
> 
> Another reason to use prctl() is there are other cases when you'd want
> to zero a process's memory.  For example, if the process gets killed
> to some kind of signal, or when it gets OOM killed (where there is no
> system call which forces the process to exit).  Also, if you want to
> zero memory when the process exits, you'd want to zero the process
> memory on an exec(2).

Probably also munmap() and maybe a number of other ways where the process
can give up its memory voluntarily. Then there are also involuntary ways
where the a copy of the data can end up leaking elsewhere than the pages the
process has mapped - e.g. swapout/swapin of pages, page migration...

So I'm not sure it's feasible to attempt making a whole process "sensitive"
and close all the holes. Instead what we have is to mark specific areas as
sensitive - things like mlock(), madvise(MADV_DONTDUMP / MADV_DONTFORK) and
ultimately memfd_secret().

> Cheers,
> 
> 						- Ted
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] exitz syscall
  2023-11-13  7:37     ` Vlastimil Babka
@ 2023-11-19 14:54       ` Jasper Niebuhr
  0 siblings, 0 replies; 11+ messages in thread
From: Jasper Niebuhr @ 2023-11-19 14:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Theodore Ts'o, Willy Tarreau, akpm, linux-kernel, linux-api,
	linux-security-module, torvalds

I took some time to consider a few options...

1. MLOCK_ZERO_ON_FREE flag for mlock2:

This would achieve what I was looking for. On the other hand, this
feature could not be used to just ensure the memory is zeroed before
reallocation, without locking it to memory. So if you just want a few
regions to be protected from other processes, this would not be ideal.
Also the VM_* flags are all used otherwise (except for a random hole
in the middle).

2. PR_INIT_ON_FREE option for prctl + some cap against DoS:

This could, more generally, be used to "replace" other ways of
choosing initialization behavior. Systems could run with zeroing in
general disabled to improve performance and just use this feature
whenever needed. However, it seems counterintuitive to me to have a
prctl option to set properties of a range of memory. Is there a system
call to set general properties of memory areas?

3. CONFIG_MLOCK_INIT_ON_FREE:

Such a config could be used as an alternative to init_on_free (or its
DEFAULT_ON config) and would be limited to the much smaller amount of
mlocked memory. Again, this could not be used if you didn't want to
lock the pages to memory, but would definitely be one of the easiest
ways to avoid most of the init_on_free overhead with essentially the
same security.

4. PR_INIT_MLOCKED_ON_FREE option for prctl:

This would essentially be option 3. but even further limited to only
the processes that want it and cannot ensure keys are zeroed before an
exit/crash. This prctl option would take no further options except an
enable/disable switch. It could be called once, in the beginning, to
enable the feature. If the process then crashes, any mlocked memory is
cleared and does not make its way to another process. After any key
material has been erased, the program could call prctl again to
disable the feature so no clearing is done when the process exits.

Currently #1, #3 and #4 sound most applicable to me. Options #3 and #4
are probably a lot cleaner to implement, #1 and #4 should be more
performant. From your experience, how often would someone want to
seriously prevent memory from getting to another process without the
option to mlock it?

Is there any arguments I am missing? What's your opinion on these?
Which, if any, do you think would work best?

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-11-19 14:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-11 12:51 [PATCH] exitz syscall York Jasper Niebuhr
2023-11-11 13:24 ` Willy Tarreau
2023-11-12  1:24   ` Linus Torvalds
2023-11-12 10:03     ` Jasper Niebuhr
2023-11-12 15:44       ` Theodore Ts'o
2023-11-12 18:50         ` Linus Torvalds
2023-11-12 19:03           ` Theodore Ts'o
2023-11-13  0:08     ` Andy Lutomirski
2023-11-12  4:52   ` Theodore Ts'o
2023-11-13  7:37     ` Vlastimil Babka
2023-11-19 14:54       ` Jasper Niebuhr

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.