linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec
@ 2023-08-14  8:40 Aleksa Sarai
  2023-08-14  8:40 ` [PATCH v2 1/5] selftests: memfd: error out test process when child test fails Aleksa Sarai
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Aleksa Sarai @ 2023-08-14  8:40 UTC (permalink / raw)
  To: Andrew Morton, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp
  Cc: Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest, Aleksa Sarai

The most critical issue with vm.memfd_noexec=2 (the fact that passing
MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's
tree[2], but there are still some outstanding issues that need to be
addressed:

 * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
   because it will make it far to difficult to ever migrate. Instead it
   should imply MFD_EXEC.

 * The dmesg warnings are pr_warn_once(), which on most systems means
   that they will be used up by systemd or some other boot process and
   userspace developers will never see it.

   - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a
     rate-limited message to the kernel log is necessary to tell
     userspace that they should add the new flags.

     Arguably the most ideal way to deal with the spam concern[3,4]
     while still prompting userspace to switch to the new flags would be
     to only log the warning once per task or something similar.
     However, adding something to task_struct for tracking this would be
     needless bloat for a single pr_warn_ratelimited().

     So just switch to pr_info_ratelimited() to avoid spamming the log
     with something that isn't a real warning. There's lots of
     info-level stuff in dmesg, it seems really unlikely that this
     should be an actual problem. Most programs are already switching to
     the new flags anyway.

   - For the vm.memfd_noexec=2 case, we need to log a warning for every
     failure because otherwise userspace will have no idea why their
     previously working program started returning -EACCES (previously
     -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here.

 * The racheting mechanism for vm.memfd_noexec makes it incredibly
   unappealing for most users to enable the sysctl because enabling it
   on &init_pid_ns means you need a system reboot to unset it. Given the
   actual security threat being protected against, CAP_SYS_ADMIN users
   being restricted in this way makes little sense.

   The argument for this ratcheting by the original author was that it
   allows you to have a hierarchical setting that cannot be unset by
   child pidnses, but this is not accurate -- changing the parent
   pidns's vm.memfd_noexec setting to be more restrictive didn't affect
   children.

   Instead, switch the vm.memfd_noexec sysctl to be properly
   hierarchical and allow CAP_SYS_ADMIN users (in the pidns's owning
   userns) to lower the setting as long as it is not lower than the
   parent's effective setting. This change also makes it so that
   changing a parent pidns's vm.memfd_noexec will affect all
   descendants, providing a properly hierarchical setting. The
   performance impact of this is incredibly minimal since the maximum
   depth of pidns is 32 and it is only checked during memfd_create(2)
   and unshare(CLONE_NEWPID).

 * The memfd selftests would not exit with a non-zero error code when
   certain tests that ran in a forked process (specifically the ones
   related to MFD_EXEC and MFD_NOEXEC_SEAL) failed.

[1]: https://lore.kernel.org/all/ZJwcsU0vI-nzgOB_@codewreck.org/
[2]: https://lore.kernel.org/all/20230705063315.3680666-1-jeffxu@google.com/
[3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
[4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundation.org/

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Changes in v2:
- Make vm.memfd_noexec restrictions properly hierarchical.
- Allow vm.memfd_noexec setting to be lowered by CAP_SYS_ADMIN as long
  as it is not lower than the parent's effective setting.
- Fix the logging behaviour related to the new flags and
  vm.memfd_noexec=2.
- Add more thorough tests for vm.memfd_noexec in selftests.
- v1: <https://lore.kernel.org/r/20230713143406.14342-1-cyphar@cyphar.com>

---
Aleksa Sarai (5):
      selftests: memfd: error out test process when child test fails
      memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2
      memfd: improve userspace warnings for missing exec-related flags
      memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
      selftests: improve vm.memfd_noexec sysctl tests

 include/linux/pid_namespace.h              |  39 ++--
 kernel/pid.c                               |   3 +
 kernel/pid_namespace.c                     |   6 +-
 kernel/pid_sysctl.h                        |  28 ++-
 mm/memfd.c                                 |  33 ++-
 tools/testing/selftests/memfd/memfd_test.c | 332 +++++++++++++++++++++++------
 6 files changed, 322 insertions(+), 119 deletions(-)
---
base-commit: 3ff995246e801ea4de0a30860a1d8da4aeb538e7
change-id: 20230803-memfd-vm-noexec-uapi-fixes-ace725c67b0f

Best regards,
-- 
Aleksa Sarai <cyphar@cyphar.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/5] selftests: memfd: error out test process when child test fails
  2023-08-14  8:40 [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec Aleksa Sarai
@ 2023-08-14  8:40 ` Aleksa Sarai
  2023-08-14  8:40 ` [PATCH v2 2/5] memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2 Aleksa Sarai
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2023-08-14  8:40 UTC (permalink / raw)
  To: Andrew Morton, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp
  Cc: Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest, Aleksa Sarai

Before this change, a test runner using this self test would see a
return code of 0 when the tests using a child process (namely the
MFD_NOEXEC_SEAL and MFD_EXEC tests) failed, masking test failures.

Fixes: 11f75a01448f ("selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC")
Reviewed-by: Jeff Xu <jeffxu@google.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 tools/testing/selftests/memfd/memfd_test.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index dbdd9ec5e397..8eb49204f9ea 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -1207,7 +1207,24 @@ static pid_t spawn_newpid_thread(unsigned int flags, int (*fn)(void *))
 
 static void join_newpid_thread(pid_t pid)
 {
-	waitpid(pid, NULL, 0);
+	int wstatus;
+
+	if (waitpid(pid, &wstatus, 0) < 0) {
+		printf("newpid thread: waitpid() failed: %m\n");
+		abort();
+	}
+
+	if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) != 0) {
+		printf("newpid thread: exited with non-zero error code %d\n",
+		       WEXITSTATUS(wstatus));
+		abort();
+	}
+
+	if (WIFSIGNALED(wstatus)) {
+		printf("newpid thread: killed by signal %d\n",
+		       WTERMSIG(wstatus));
+		abort();
+	}
 }
 
 /*

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/5] memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2
  2023-08-14  8:40 [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec Aleksa Sarai
  2023-08-14  8:40 ` [PATCH v2 1/5] selftests: memfd: error out test process when child test fails Aleksa Sarai
@ 2023-08-14  8:40 ` Aleksa Sarai
  2023-08-14  8:40 ` [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags Aleksa Sarai
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2023-08-14  8:40 UTC (permalink / raw)
  To: Andrew Morton, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp
  Cc: Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest, Aleksa Sarai

Given the difficulty of auditing all of userspace to figure out whether
every memfd_create() user has switched to passing MFD_EXEC and
MFD_NOEXEC_SEAL flags, it seems far less distruptive to make it possible
for older programs that don't make use of executable memfds to run under
vm.memfd_noexec=2. Otherwise, a small dependency change can result in
spurious errors. For programs that don't use executable memfds, passing
MFD_NOEXEC_SEAL is functionally a no-op and thus having the same

In addition, every failure under vm.memfd_noexec=2 needs to print to the
kernel log so that userspace can figure out where the error came from.
The concerns about pr_warn_ratelimited() spam that caused the switch to
pr_warn_once()[1,2] do not apply to the vm.memfd_noexec=2 case.

This is a user-visible API change, but as it allows programs to do
something that would be blocked before, and the sysctl itself was broken
and recently released, it seems unlikely this will cause any issues.

[1]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
[2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/

Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: stable@vger.kernel.org # v6.3+
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/pid_namespace.h              | 16 ++++------------
 mm/memfd.c                                 | 30 +++++++++++-------------------
 tools/testing/selftests/memfd/memfd_test.c | 22 +++++++++++++++++-----
 3 files changed, 32 insertions(+), 36 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index c758809d5bcf..53974d79d98e 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -17,18 +17,10 @@
 struct fs_pin;
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
-/*
- * sysctl for vm.memfd_noexec
- * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
- *	acts like MFD_EXEC was set.
- * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
- *	acts like MFD_NOEXEC_SEAL was set.
- * 2: memfd_create() without MFD_NOEXEC_SEAL will be
- *	rejected.
- */
-#define MEMFD_NOEXEC_SCOPE_EXEC			0
-#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL		1
-#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED	2
+/* modes for vm.memfd_noexec sysctl */
+#define MEMFD_NOEXEC_SCOPE_EXEC			0 /* MFD_EXEC implied if unset */
+#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL		1 /* MFD_NOEXEC_SEAL implied if unset */
+#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED	2 /* same as 1, except MFD_EXEC rejected */
 #endif
 
 struct pid_namespace {
diff --git a/mm/memfd.c b/mm/memfd.c
index 0bdbd2335af7..d65485c762de 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -271,30 +271,22 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 static int check_sysctl_memfd_noexec(unsigned int *flags)
 {
 #ifdef CONFIG_SYSCTL
-	char comm[TASK_COMM_LEN];
-	int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
-	struct pid_namespace *ns;
-
-	ns = task_active_pid_ns(current);
-	if (ns)
-		sysctl = ns->memfd_noexec_scope;
+	int sysctl = task_active_pid_ns(current)->memfd_noexec_scope;
 
 	if (!(*flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
-		if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL)
+		if (sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL)
 			*flags |= MFD_NOEXEC_SEAL;
 		else
 			*flags |= MFD_EXEC;
 	}
 
-	if (*flags & MFD_EXEC && sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) {
-		pr_warn_once(
-			"memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n",
-			task_pid_nr(current), get_task_comm(comm, current));
-
+	if (!(*flags & MFD_NOEXEC_SEAL) && sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED) {
+		pr_err_ratelimited(
+			"%s[%d]: memfd_create() requires MFD_NOEXEC_SEAL with vm.memfd_noexec=%d\n",
+			current->comm, task_pid_nr(current), sysctl);
 		return -EACCES;
 	}
 #endif
-
 	return 0;
 }
 
@@ -302,7 +294,6 @@ SYSCALL_DEFINE2(memfd_create,
 		const char __user *, uname,
 		unsigned int, flags)
 {
-	char comm[TASK_COMM_LEN];
 	unsigned int *file_seals;
 	struct file *file;
 	int fd, error;
@@ -325,12 +316,13 @@ SYSCALL_DEFINE2(memfd_create,
 
 	if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
 		pr_warn_once(
-			"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n",
-			task_pid_nr(current), get_task_comm(comm, current));
+			"%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n",
+			current->comm, task_pid_nr(current));
 	}
 
-	if (check_sysctl_memfd_noexec(&flags) < 0)
-		return -EACCES;
+	error = check_sysctl_memfd_noexec(&flags);
+	if (error < 0)
+		return error;
 
 	/* length includes terminating zero */
 	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index 8eb49204f9ea..8b7390ad81d1 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -1145,11 +1145,23 @@ static void test_sysctl_child(void)
 
 	printf("%s sysctl 2\n", memfd_str);
 	sysctl_assert_write("2");
-	mfd_fail_new("kern_memfd_sysctl_2",
-		MFD_CLOEXEC | MFD_ALLOW_SEALING);
-	mfd_fail_new("kern_memfd_sysctl_2_MFD_EXEC",
-		MFD_CLOEXEC | MFD_EXEC);
-	fd = mfd_assert_new("", 0, MFD_NOEXEC_SEAL);
+	mfd_fail_new("kern_memfd_sysctl_2_exec",
+		     MFD_EXEC | MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+	fd = mfd_assert_new("kern_memfd_sysctl_2_dfl",
+			    mfd_def_size,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_mode(fd, 0666);
+	mfd_assert_has_seals(fd, F_SEAL_EXEC);
+	mfd_fail_chmod(fd, 0777);
+	close(fd);
+
+	fd = mfd_assert_new("kern_memfd_sysctl_2_noexec_seal",
+			    mfd_def_size,
+			    MFD_NOEXEC_SEAL | MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_mode(fd, 0666);
+	mfd_assert_has_seals(fd, F_SEAL_EXEC);
+	mfd_fail_chmod(fd, 0777);
 	close(fd);
 
 	sysctl_fail_write("0");

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags
  2023-08-14  8:40 [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec Aleksa Sarai
  2023-08-14  8:40 ` [PATCH v2 1/5] selftests: memfd: error out test process when child test fails Aleksa Sarai
  2023-08-14  8:40 ` [PATCH v2 2/5] memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2 Aleksa Sarai
@ 2023-08-14  8:40 ` Aleksa Sarai
  2023-08-22  9:10   ` Christian Brauner
  2023-09-01  5:13   ` Damian Tometzki
  2023-08-14  8:41 ` [PATCH v2 4/5] memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy Aleksa Sarai
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Aleksa Sarai @ 2023-08-14  8:40 UTC (permalink / raw)
  To: Andrew Morton, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp
  Cc: Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest, Aleksa Sarai

In order to incentivise userspace to switch to passing MFD_EXEC and
MFD_NOEXEC_SEAL, we need to provide a warning on each attempt to call
memfd_create() without the new flags. pr_warn_once() is not useful
because on most systems the one warning is burned up during the boot
process (on my system, systemd does this within the first second of
boot) and thus userspace will in practice never see the warnings to push
them to switch to the new flags.

The original patchset[1] used pr_warn_ratelimited(), however there were
concerns about the degree of spam in the kernel log[2,3]. The resulting
inability to detect every case was flagged as an issue at the time[4].

While we could come up with an alternative rate-limiting scheme such as
only outputting the message if vm.memfd_noexec has been modified, or
only outputting the message once for a given task, these alternatives
have downsides that don't make sense given how low-stakes a single
kernel warning message is. Switching to pr_info_ratelimited() instead
should be fine -- it's possible some monitoring tool will be unhappy
with a stream of warning-level messages but there's already plenty of
info-level message spam in dmesg.

[1]: https://lore.kernel.org/20221215001205.51969-4-jeffxu@google.com/
[2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/
[3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
[4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundation.org/

Cc: stable@vger.kernel.org # v6.3+
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 mm/memfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memfd.c b/mm/memfd.c
index d65485c762de..aa46521057ab 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -315,7 +315,7 @@ SYSCALL_DEFINE2(memfd_create,
 		return -EINVAL;
 
 	if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
-		pr_warn_once(
+		pr_info_ratelimited(
 			"%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n",
 			current->comm, task_pid_nr(current));
 	}

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 4/5] memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
  2023-08-14  8:40 [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec Aleksa Sarai
                   ` (2 preceding siblings ...)
  2023-08-14  8:40 ` [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags Aleksa Sarai
@ 2023-08-14  8:41 ` Aleksa Sarai
  2023-08-16  5:13   ` Jeff Xu
  2023-08-14  8:41 ` [PATCH v2 5/5] selftests: improve vm.memfd_noexec sysctl tests Aleksa Sarai
  2023-08-16  5:08 ` [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec Jeff Xu
  5 siblings, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2023-08-14  8:41 UTC (permalink / raw)
  To: Andrew Morton, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp
  Cc: Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest, Aleksa Sarai

This sysctl has the very unusual behaviour of not allowing any user (even
CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you
were to set this sysctl to a more restrictive option in the host pidns
you would need to reboot your machine in order to reset it.

The justification given in [1] is that this is a security feature and
thus it should not be possible to disable. Aside from the fact that we
have plenty of security-related sysctls that can be disabled after being
enabled (fs.protected_symlinks for instance), the protection provided by
the sysctl is to stop users from being able to create a binary and then
execute it. A user with CAP_SYS_ADMIN can trivially do this without
memfd_create(2):

  % cat mount-memfd.c
  #include <fcntl.h>
  #include <string.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <linux/mount.h>

  #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"

  int main(void)
  {
  	int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
  	assert(fsfd >= 0);
  	assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));

  	int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
  	assert(dfd >= 0);

  	int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
  	assert(execfd >= 0);
  	assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
  	assert(!close(execfd));

  	char *execpath = NULL;
  	char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
  	execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
  	assert(execfd >= 0);
  	assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
  	assert(!execve(execpath, argv, envp));
  }
  % ./mount-memfd
  this file was executed from this totally private tmpfs: /proc/self/fd/5
  %

Given that it is possible for CAP_SYS_ADMIN users to create executable
binaries without memfd_create(2) and without touching the host
filesystem (not to mention the many other things a CAP_SYS_ADMIN process
would be able to do that would be equivalent or worse), it seems strange
to cause a fair amount of headache to admins when there doesn't appear
to be an actual security benefit to blocking this. There appear to be
concerns about confused-deputy-esque attacks[2] but a confused deputy that
can write to arbitrary sysctls is a bigger security issue than
executable memfds.

/* New API */

The primary requirement from the original author appears to be more
based on the need to be able to restrict an entire system in a
hierarchical manner[3], such that child namespaces cannot re-enable
executable memfds.

So, implement that behaviour explicitly -- the vm.memfd_noexec scope is
evaluated up the pidns tree to &init_pid_ns and you have the most
restrictive value applied to you. The new lower limit you can set
vm.memfd_noexec is whatever limit applies to your parent.

Note that a pidns will inherit a copy of the parent pidns's effective
vm.memfd_noexec setting at unshare() time. This matches the existing
behaviour, and it also ensures that a pidns will never have its
vm.memfd_noexec setting *lowered* behind its back (but it will be raised
if the parent raises theirs).

/* Backwards Compatibility */

As the previous version of the sysctl didn't allow you to lower the
setting at all, there are no backwards compatibility issues with this
aspect of the change.

However it should be noted that now that the setting is completely
hierarchical. Previously, a cloned pidns would just copy the current
pidns setting, meaning that if the parent's vm.memfd_noexec was changed
it wouldn't propoagate to existing pid namespaces. Now, the restriction
applies recursively. This is a uAPI change, however:

 * The sysctl is very new, having been merged in 6.3.
 * Several aspects of the sysctl were broken up until this patchset and
   the other patchset by Jeff Xu last month.

And thus it seems incredibly unlikely that any real users would run into
this issue. In the worst case, if this causes userspace isues we could
make it so that modifying the setting follows the hierarchical rules but
the restriction checking uses the cached copy.

[1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/
[2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/
[3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/

Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: stable@vger.kernel.org # v6.3+
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/pid_namespace.h | 23 ++++++++++++++++++++++-
 kernel/pid.c                  |  3 +++
 kernel/pid_namespace.c        |  6 +++---
 kernel/pid_sysctl.h           | 28 ++++++++++++----------------
 mm/memfd.c                    |  3 ++-
 5 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 53974d79d98e..f9f9931e02d6 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -39,7 +39,6 @@ struct pid_namespace {
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
-	/* sysctl for vm.memfd_noexec */
 	int memfd_noexec_scope;
 #endif
 } __randomize_layout;
@@ -56,6 +55,23 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 	return ns;
 }
 
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
+{
+	int scope = MEMFD_NOEXEC_SCOPE_EXEC;
+
+	for (; ns; ns = ns->parent)
+		scope = max(scope, READ_ONCE(ns->memfd_noexec_scope));
+
+	return scope;
+}
+#else
+static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
+{
+	return 0;
+}
+#endif
+
 extern struct pid_namespace *copy_pid_ns(unsigned long flags,
 	struct user_namespace *user_ns, struct pid_namespace *ns);
 extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
@@ -70,6 +86,11 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 	return ns;
 }
 
+static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
+{
+	return 0;
+}
+
 static inline struct pid_namespace *copy_pid_ns(unsigned long flags,
 	struct user_namespace *user_ns, struct pid_namespace *ns)
 {
diff --git a/kernel/pid.c b/kernel/pid.c
index 6a1d23a11026..fee14a4486a3 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -83,6 +83,9 @@ struct pid_namespace init_pid_ns = {
 #ifdef CONFIG_PID_NS
 	.ns.ops = &pidns_operations,
 #endif
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+	.memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC,
+#endif
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0bf44afe04dd..619972c78774 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -110,9 +110,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	ns->user_ns = get_user_ns(user_ns);
 	ns->ucounts = ucounts;
 	ns->pid_allocated = PIDNS_ADDING;
-
-	initialize_memfd_noexec_scope(ns);
-
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+	ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
+#endif
 	return ns;
 
 out_free_idr:
diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h
index b26e027fc9cd..2ee41a3a1dfd 100644
--- a/kernel/pid_sysctl.h
+++ b/kernel/pid_sysctl.h
@@ -5,33 +5,30 @@
 #include <linux/pid_namespace.h>
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
-static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns)
-{
-	ns->memfd_noexec_scope =
-		task_active_pid_ns(current)->memfd_noexec_scope;
-}
-
 static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table,
 	int write, void *buf, size_t *lenp, loff_t *ppos)
 {
 	struct pid_namespace *ns = task_active_pid_ns(current);
 	struct ctl_table table_copy;
+	int err, scope, parent_scope;
 
 	if (write && !ns_capable(ns->user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
 	table_copy = *table;
-	if (ns != &init_pid_ns)
-		table_copy.data = &ns->memfd_noexec_scope;
 
-	/*
-	 * set minimum to current value, the effect is only bigger
-	 * value is accepted.
-	 */
-	if (*(int *)table_copy.data > *(int *)table_copy.extra1)
-		table_copy.extra1 = table_copy.data;
+	/* You cannot set a lower enforcement value than your parent. */
+	parent_scope = pidns_memfd_noexec_scope(ns->parent);
+	/* Equivalent to pidns_memfd_noexec_scope(ns). */
+	scope = max(READ_ONCE(ns->memfd_noexec_scope), parent_scope);
+
+	table_copy.data = &scope;
+	table_copy.extra1 = &parent_scope;
 
-	return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos);
+	err = proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos);
+	if (!err && write)
+		WRITE_ONCE(ns->memfd_noexec_scope, scope);
+	return err;
 }
 
 static struct ctl_table pid_ns_ctl_table_vm[] = {
@@ -51,7 +48,6 @@ static inline void register_pid_ns_sysctl_table_vm(void)
 	register_sysctl("vm", pid_ns_ctl_table_vm);
 }
 #else
-static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) {}
 static inline void register_pid_ns_sysctl_table_vm(void) {}
 #endif
 
diff --git a/mm/memfd.c b/mm/memfd.c
index aa46521057ab..1cad1904fc26 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -271,7 +271,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 static int check_sysctl_memfd_noexec(unsigned int *flags)
 {
 #ifdef CONFIG_SYSCTL
-	int sysctl = task_active_pid_ns(current)->memfd_noexec_scope;
+	struct pid_namespace *ns = task_active_pid_ns(current);
+	int sysctl = pidns_memfd_noexec_scope(ns);
 
 	if (!(*flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
 		if (sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL)

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 5/5] selftests: improve vm.memfd_noexec sysctl tests
  2023-08-14  8:40 [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec Aleksa Sarai
                   ` (3 preceding siblings ...)
  2023-08-14  8:41 ` [PATCH v2 4/5] memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy Aleksa Sarai
@ 2023-08-14  8:41 ` Aleksa Sarai
  2023-08-16  5:08 ` [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec Jeff Xu
  5 siblings, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2023-08-14  8:41 UTC (permalink / raw)
  To: Andrew Morton, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp
  Cc: Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest, Aleksa Sarai

This adds proper tests for the nesting functionality of vm.memfd_noexec
as well as some minor cleanups to spawn_*_thread().

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 tools/testing/selftests/memfd/memfd_test.c | 339 +++++++++++++++++++++--------
 1 file changed, 254 insertions(+), 85 deletions(-)

diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index 8b7390ad81d1..3df008677239 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -18,6 +18,7 @@
 #include <sys/syscall.h>
 #include <sys/wait.h>
 #include <unistd.h>
+#include <ctype.h>
 
 #include "common.h"
 
@@ -43,7 +44,6 @@
  */
 static size_t mfd_def_size = MFD_DEF_SIZE;
 static const char *memfd_str = MEMFD_STR;
-static pid_t spawn_newpid_thread(unsigned int flags, int (*fn)(void *));
 static int newpid_thread_fn2(void *arg);
 static void join_newpid_thread(pid_t pid);
 
@@ -96,12 +96,12 @@ static void sysctl_assert_write(const char *val)
 	int fd = open("/proc/sys/vm/memfd_noexec", O_WRONLY | O_CLOEXEC);
 
 	if (fd < 0) {
-		printf("open sysctl failed\n");
+		printf("open sysctl failed: %m\n");
 		abort();
 	}
 
 	if (write(fd, val, strlen(val)) < 0) {
-		printf("write sysctl failed\n");
+		printf("write sysctl %s failed: %m\n", val);
 		abort();
 	}
 }
@@ -111,7 +111,7 @@ static void sysctl_fail_write(const char *val)
 	int fd = open("/proc/sys/vm/memfd_noexec", O_WRONLY | O_CLOEXEC);
 
 	if (fd < 0) {
-		printf("open sysctl failed\n");
+		printf("open sysctl failed: %m\n");
 		abort();
 	}
 
@@ -122,6 +122,33 @@ static void sysctl_fail_write(const char *val)
 	}
 }
 
+static void sysctl_assert_equal(const char *val)
+{
+	char *p, buf[128] = {};
+	int fd = open("/proc/sys/vm/memfd_noexec", O_RDONLY | O_CLOEXEC);
+
+	if (fd < 0) {
+		printf("open sysctl failed: %m\n");
+		abort();
+	}
+
+	if (read(fd, buf, sizeof(buf)) < 0) {
+		printf("read sysctl failed: %m\n");
+		abort();
+	}
+
+	/* Strip trailing whitespace. */
+	p = buf;
+	while (!isspace(*p))
+		p++;
+	*p = '\0';
+
+	if (strcmp(buf, val) != 0) {
+		printf("unexpected sysctl value: expected %s, got %s\n", val, buf);
+		abort();
+	}
+}
+
 static int mfd_assert_reopen_fd(int fd_in)
 {
 	int fd;
@@ -736,7 +763,7 @@ static int idle_thread_fn(void *arg)
 	return 0;
 }
 
-static pid_t spawn_idle_thread(unsigned int flags)
+static pid_t spawn_thread(unsigned int flags, int (*fn)(void *), void *arg)
 {
 	uint8_t *stack;
 	pid_t pid;
@@ -747,10 +774,7 @@ static pid_t spawn_idle_thread(unsigned int flags)
 		abort();
 	}
 
-	pid = clone(idle_thread_fn,
-		    stack + STACK_SIZE,
-		    SIGCHLD | flags,
-		    NULL);
+	pid = clone(fn, stack + STACK_SIZE, SIGCHLD | flags, arg);
 	if (pid < 0) {
 		printf("clone() failed: %m\n");
 		abort();
@@ -759,6 +783,33 @@ static pid_t spawn_idle_thread(unsigned int flags)
 	return pid;
 }
 
+static void join_thread(pid_t pid)
+{
+	int wstatus;
+
+	if (waitpid(pid, &wstatus, 0) < 0) {
+		printf("newpid thread: waitpid() failed: %m\n");
+		abort();
+	}
+
+	if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) != 0) {
+		printf("newpid thread: exited with non-zero error code %d\n",
+		       WEXITSTATUS(wstatus));
+		abort();
+	}
+
+	if (WIFSIGNALED(wstatus)) {
+		printf("newpid thread: killed by signal %d\n",
+		       WTERMSIG(wstatus));
+		abort();
+	}
+}
+
+static pid_t spawn_idle_thread(unsigned int flags)
+{
+	return spawn_thread(flags, idle_thread_fn, NULL);
+}
+
 static void join_idle_thread(pid_t pid)
 {
 	kill(pid, SIGTERM);
@@ -1111,42 +1162,69 @@ static void test_noexec_seal(void)
 	close(fd);
 }
 
-static void test_sysctl_child(void)
+static void test_sysctl_sysctl0(void)
 {
 	int fd;
-	int pid;
 
-	printf("%s sysctl 0\n", memfd_str);
-	sysctl_assert_write("0");
-	fd = mfd_assert_new("kern_memfd_sysctl_0",
+	sysctl_assert_equal("0");
+
+	fd = mfd_assert_new("kern_memfd_sysctl_0_dfl",
 			    mfd_def_size,
 			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
-
 	mfd_assert_mode(fd, 0777);
 	mfd_assert_has_seals(fd, 0);
 	mfd_assert_chmod(fd, 0644);
 	close(fd);
+}
 
-	printf("%s sysctl 1\n", memfd_str);
-	sysctl_assert_write("1");
-	fd = mfd_assert_new("kern_memfd_sysctl_1",
+static void test_sysctl_set_sysctl0(void)
+{
+	sysctl_assert_write("0");
+	test_sysctl_sysctl0();
+}
+
+static void test_sysctl_sysctl1(void)
+{
+	int fd;
+
+	sysctl_assert_equal("1");
+
+	fd = mfd_assert_new("kern_memfd_sysctl_1_dfl",
 			    mfd_def_size,
 			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_mode(fd, 0666);
+	mfd_assert_has_seals(fd, F_SEAL_EXEC);
+	mfd_fail_chmod(fd, 0777);
+	close(fd);
 
-	printf("%s child ns\n", memfd_str);
-	pid = spawn_newpid_thread(CLONE_NEWPID, newpid_thread_fn2);
-	join_newpid_thread(pid);
+	fd = mfd_assert_new("kern_memfd_sysctl_1_exec",
+			    mfd_def_size,
+			    MFD_CLOEXEC | MFD_EXEC | MFD_ALLOW_SEALING);
+	mfd_assert_mode(fd, 0777);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_chmod(fd, 0644);
+	close(fd);
 
+	fd = mfd_assert_new("kern_memfd_sysctl_1_noexec",
+			    mfd_def_size,
+			    MFD_CLOEXEC | MFD_NOEXEC_SEAL | MFD_ALLOW_SEALING);
 	mfd_assert_mode(fd, 0666);
 	mfd_assert_has_seals(fd, F_SEAL_EXEC);
 	mfd_fail_chmod(fd, 0777);
-	sysctl_fail_write("0");
 	close(fd);
+}
 
-	printf("%s sysctl 2\n", memfd_str);
-	sysctl_assert_write("2");
-	mfd_fail_new("kern_memfd_sysctl_2_exec",
-		     MFD_EXEC | MFD_CLOEXEC | MFD_ALLOW_SEALING);
+static void test_sysctl_set_sysctl1(void)
+{
+	sysctl_assert_write("1");
+	test_sysctl_sysctl1();
+}
+
+static void test_sysctl_sysctl2(void)
+{
+	int fd;
+
+	sysctl_assert_equal("2");
 
 	fd = mfd_assert_new("kern_memfd_sysctl_2_dfl",
 			    mfd_def_size,
@@ -1156,98 +1234,188 @@ static void test_sysctl_child(void)
 	mfd_fail_chmod(fd, 0777);
 	close(fd);
 
-	fd = mfd_assert_new("kern_memfd_sysctl_2_noexec_seal",
+	mfd_fail_new("kern_memfd_sysctl_2_exec",
+		     MFD_CLOEXEC | MFD_EXEC | MFD_ALLOW_SEALING);
+
+	fd = mfd_assert_new("kern_memfd_sysctl_2_noexec",
 			    mfd_def_size,
-			    MFD_NOEXEC_SEAL | MFD_CLOEXEC | MFD_ALLOW_SEALING);
+			    MFD_CLOEXEC | MFD_NOEXEC_SEAL | MFD_ALLOW_SEALING);
 	mfd_assert_mode(fd, 0666);
 	mfd_assert_has_seals(fd, F_SEAL_EXEC);
 	mfd_fail_chmod(fd, 0777);
 	close(fd);
-
-	sysctl_fail_write("0");
-	sysctl_fail_write("1");
 }
 
-static int newpid_thread_fn(void *arg)
+static void test_sysctl_set_sysctl2(void)
 {
-	test_sysctl_child();
-	return 0;
+	sysctl_assert_write("2");
+	test_sysctl_sysctl2();
 }
 
-static void test_sysctl_child2(void)
+static int sysctl_simple_child(void *arg)
 {
 	int fd;
+	int pid;
 
-	sysctl_fail_write("0");
-	fd = mfd_assert_new("kern_memfd_sysctl_1",
-			    mfd_def_size,
-			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	printf("%s sysctl 0\n", memfd_str);
+	test_sysctl_set_sysctl0();
 
-	mfd_assert_mode(fd, 0666);
-	mfd_assert_has_seals(fd, F_SEAL_EXEC);
-	mfd_fail_chmod(fd, 0777);
-	close(fd);
+	printf("%s sysctl 1\n", memfd_str);
+	test_sysctl_set_sysctl1();
+
+	printf("%s sysctl 0\n", memfd_str);
+	test_sysctl_set_sysctl0();
+
+	printf("%s sysctl 2\n", memfd_str);
+	test_sysctl_set_sysctl2();
+
+	printf("%s sysctl 1\n", memfd_str);
+	test_sysctl_set_sysctl1();
+
+	printf("%s sysctl 0\n", memfd_str);
+	test_sysctl_set_sysctl0();
+
+	return 0;
+}
+
+/*
+ * Test sysctl
+ * A very basic test to make sure the core sysctl semantics work.
+ */
+static void test_sysctl_simple(void)
+{
+	int pid = spawn_thread(CLONE_NEWPID, sysctl_simple_child, NULL);
+
+	join_thread(pid);
 }
 
-static int newpid_thread_fn2(void *arg)
+static int sysctl_nested(void *arg)
 {
-	test_sysctl_child2();
+	void (*fn)(void) = arg;
+
+	fn();
 	return 0;
 }
-static pid_t spawn_newpid_thread(unsigned int flags, int (*fn)(void *))
+
+static int sysctl_nested_wait(void *arg)
 {
-	uint8_t *stack;
-	pid_t pid;
+	/* Wait for a SIGCONT. */
+	kill(getpid(), SIGSTOP);
+	return sysctl_nested(arg);
+}
 
-	stack = malloc(STACK_SIZE);
-	if (!stack) {
-		printf("malloc(STACK_SIZE) failed: %m\n");
-		abort();
-	}
+static void test_sysctl_sysctl1_failset(void)
+{
+	sysctl_fail_write("0");
+	test_sysctl_sysctl1();
+}
 
-	pid = clone(fn,
-		    stack + STACK_SIZE,
-		    SIGCHLD | flags,
-		    NULL);
-	if (pid < 0) {
-		printf("clone() failed: %m\n");
-		abort();
-	}
+static void test_sysctl_sysctl2_failset(void)
+{
+	sysctl_fail_write("1");
+	test_sysctl_sysctl2();
 
-	return pid;
+	sysctl_fail_write("0");
+	test_sysctl_sysctl2();
 }
 
-static void join_newpid_thread(pid_t pid)
+static int sysctl_nested_child(void *arg)
 {
-	int wstatus;
+	int fd;
+	int pid;
 
-	if (waitpid(pid, &wstatus, 0) < 0) {
-		printf("newpid thread: waitpid() failed: %m\n");
-		abort();
-	}
+	printf("%s nested sysctl 0\n", memfd_str);
+	sysctl_assert_write("0");
+	/* A further nested pidns works the same. */
+	pid = spawn_thread(CLONE_NEWPID, sysctl_simple_child, NULL);
+	join_thread(pid);
 
-	if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) != 0) {
-		printf("newpid thread: exited with non-zero error code %d\n",
-		       WEXITSTATUS(wstatus));
-		abort();
-	}
+	printf("%s nested sysctl 1\n", memfd_str);
+	sysctl_assert_write("1");
+	/* Child inherits our setting. */
+	pid = spawn_thread(CLONE_NEWPID, sysctl_nested, test_sysctl_sysctl1);
+	join_thread(pid);
+	/* Child cannot raise the setting. */
+	pid = spawn_thread(CLONE_NEWPID, sysctl_nested,
+			   test_sysctl_sysctl1_failset);
+	join_thread(pid);
+	/* Child can lower the setting. */
+	pid = spawn_thread(CLONE_NEWPID, sysctl_nested,
+			   test_sysctl_set_sysctl2);
+	join_thread(pid);
+	/* Child lowering the setting has no effect on our setting. */
+	test_sysctl_sysctl1();
+
+	printf("%s nested sysctl 2\n", memfd_str);
+	sysctl_assert_write("2");
+	/* Child inherits our setting. */
+	pid = spawn_thread(CLONE_NEWPID, sysctl_nested, test_sysctl_sysctl2);
+	join_thread(pid);
+	/* Child cannot raise the setting. */
+	pid = spawn_thread(CLONE_NEWPID, sysctl_nested,
+			   test_sysctl_sysctl2_failset);
+	join_thread(pid);
+
+	/* Verify that the rules are actually inherited after fork. */
+	printf("%s nested sysctl 0 -> 1 after fork\n", memfd_str);
+	sysctl_assert_write("0");
 
-	if (WIFSIGNALED(wstatus)) {
-		printf("newpid thread: killed by signal %d\n",
-		       WTERMSIG(wstatus));
-		abort();
-	}
+	pid = spawn_thread(CLONE_NEWPID, sysctl_nested_wait,
+			   test_sysctl_sysctl1_failset);
+	sysctl_assert_write("1");
+	kill(pid, SIGCONT);
+	join_thread(pid);
+
+	printf("%s nested sysctl 0 -> 2 after fork\n", memfd_str);
+	sysctl_assert_write("0");
+
+	pid = spawn_thread(CLONE_NEWPID, sysctl_nested_wait,
+			   test_sysctl_sysctl2_failset);
+	sysctl_assert_write("2");
+	kill(pid, SIGCONT);
+	join_thread(pid);
+
+	/*
+	 * Verify that the current effective setting is saved on fork, meaning
+	 * that the parent lowering the sysctl doesn't affect already-forked
+	 * children.
+	 */
+	printf("%s nested sysctl 2 -> 1 after fork\n", memfd_str);
+	sysctl_assert_write("2");
+	pid = spawn_thread(CLONE_NEWPID, sysctl_nested_wait,
+			   test_sysctl_sysctl2);
+	sysctl_assert_write("1");
+	kill(pid, SIGCONT);
+	join_thread(pid);
+
+	printf("%s nested sysctl 2 -> 0 after fork\n", memfd_str);
+	sysctl_assert_write("2");
+	pid = spawn_thread(CLONE_NEWPID, sysctl_nested_wait,
+			   test_sysctl_sysctl2);
+	sysctl_assert_write("0");
+	kill(pid, SIGCONT);
+	join_thread(pid);
+
+	printf("%s nested sysctl 1 -> 0 after fork\n", memfd_str);
+	sysctl_assert_write("1");
+	pid = spawn_thread(CLONE_NEWPID, sysctl_nested_wait,
+			   test_sysctl_sysctl1);
+	sysctl_assert_write("0");
+	kill(pid, SIGCONT);
+	join_thread(pid);
+
+	return 0;
 }
 
 /*
- * Test sysctl
- * A very basic sealing test to see whether setting/retrieving seals works.
+ * Test sysctl with nested pid namespaces
+ * Make sure that the sysctl nesting semantics work correctly.
  */
-static void test_sysctl(void)
+static void test_sysctl_nested(void)
 {
-	int pid = spawn_newpid_thread(CLONE_NEWPID, newpid_thread_fn);
+	int pid = spawn_thread(CLONE_NEWPID, sysctl_nested_child, NULL);
 
-	join_newpid_thread(pid);
+	join_thread(pid);
 }
 
 /*
@@ -1433,6 +1601,9 @@ int main(int argc, char **argv)
 	test_seal_grow();
 	test_seal_resize();
 
+	test_sysctl_simple();
+	test_sysctl_nested();
+
 	test_share_dup("SHARE-DUP", "");
 	test_share_mmap("SHARE-MMAP", "");
 	test_share_open("SHARE-OPEN", "");
@@ -1447,8 +1618,6 @@ int main(int argc, char **argv)
 	test_share_fork("SHARE-FORK", SHARED_FT_STR);
 	join_idle_thread(pid);
 
-	test_sysctl();
-
 	printf("memfd: DONE\n");
 
 	return 0;

-- 
2.41.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec
  2023-08-14  8:40 [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec Aleksa Sarai
                   ` (4 preceding siblings ...)
  2023-08-14  8:41 ` [PATCH v2 5/5] selftests: improve vm.memfd_noexec sysctl tests Aleksa Sarai
@ 2023-08-16  5:08 ` Jeff Xu
  2023-08-19  2:50   ` Aleksa Sarai
  5 siblings, 1 reply; 18+ messages in thread
From: Jeff Xu @ 2023-08-16  5:08 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp,
	Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest

On Mon, Aug 14, 2023 at 1:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> The most critical issue with vm.memfd_noexec=2 (the fact that passing
> MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's
> tree[2], but there are still some outstanding issues that need to be
> addressed:
>
>  * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
>    because it will make it far to difficult to ever migrate. Instead it
>    should imply MFD_EXEC.
>
>  * The dmesg warnings are pr_warn_once(), which on most systems means
>    that they will be used up by systemd or some other boot process and
>    userspace developers will never see it.
>
>    - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a
>      rate-limited message to the kernel log is necessary to tell
>      userspace that they should add the new flags.
>
>      Arguably the most ideal way to deal with the spam concern[3,4]
>      while still prompting userspace to switch to the new flags would be
>      to only log the warning once per task or something similar.
>      However, adding something to task_struct for tracking this would be
>      needless bloat for a single pr_warn_ratelimited().
>
>      So just switch to pr_info_ratelimited() to avoid spamming the log
>      with something that isn't a real warning. There's lots of
>      info-level stuff in dmesg, it seems really unlikely that this
>      should be an actual problem. Most programs are already switching to
>      the new flags anyway.
>
>    - For the vm.memfd_noexec=2 case, we need to log a warning for every
>      failure because otherwise userspace will have no idea why their
>      previously working program started returning -EACCES (previously
>      -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here.
>
>  * The racheting mechanism for vm.memfd_noexec makes it incredibly
>    unappealing for most users to enable the sysctl because enabling it
>    on &init_pid_ns means you need a system reboot to unset it. Given the
>    actual security threat being protected against, CAP_SYS_ADMIN users
>    being restricted in this way makes little sense.
>
>    The argument for this ratcheting by the original author was that it
>    allows you to have a hierarchical setting that cannot be unset by
>    child pidnses, but this is not accurate -- changing the parent
>    pidns's vm.memfd_noexec setting to be more restrictive didn't affect
>    children.
>
That is not exactly what I said though.
From ChromeOS's position,  allowing downgrade is less secure, and this
setting was designed to be set at startup/reboot time from the very
beginning, such that the kernel command line or as part of the
container runtime environment (get passed to sandboxed container)
I understand your viewpoint,  from another distribution point of view,
 the original design might be too restricted, so if the kernel wants
to weigh more on ease of admin, I'm OK with your approach.
Though it is less secure for ChromeOS - i.e. we do try to prevent
arbitrary code execution  as much as possible, even for CAP_SYSADMIN.
And with this change, it is less secure and one more possibility for
us to consider.




>    Instead, switch the vm.memfd_noexec sysctl to be properly
>    hierarchical and allow CAP_SYS_ADMIN users (in the pidns's owning
>    userns) to lower the setting as long as it is not lower than the
>    parent's effective setting. This change also makes it so that
>    changing a parent pidns's vm.memfd_noexec will affect all
>    descendants, providing a properly hierarchical setting. The
>    performance impact of this is incredibly minimal since the maximum
>    depth of pidns is 32 and it is only checked during memfd_create(2)
>    and unshare(CLONE_NEWPID).
>
>  * The memfd selftests would not exit with a non-zero error code when
>    certain tests that ran in a forked process (specifically the ones
>    related to MFD_EXEC and MFD_NOEXEC_SEAL) failed.
>
> [1]: https://lore.kernel.org/all/ZJwcsU0vI-nzgOB_@codewreck.org/
> [2]: https://lore.kernel.org/all/20230705063315.3680666-1-jeffxu@google.com/
> [3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
> [4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundation.org/
>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
> Changes in v2:
> - Make vm.memfd_noexec restrictions properly hierarchical.
> - Allow vm.memfd_noexec setting to be lowered by CAP_SYS_ADMIN as long
>   as it is not lower than the parent's effective setting.
> - Fix the logging behaviour related to the new flags and
>   vm.memfd_noexec=2.
> - Add more thorough tests for vm.memfd_noexec in selftests.
> - v1: <https://lore.kernel.org/r/20230713143406.14342-1-cyphar@cyphar.com>
>
> ---
> Aleksa Sarai (5):
>       selftests: memfd: error out test process when child test fails
>       memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2
>       memfd: improve userspace warnings for missing exec-related flags
>       memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
>       selftests: improve vm.memfd_noexec sysctl tests
>
>  include/linux/pid_namespace.h              |  39 ++--
>  kernel/pid.c                               |   3 +
>  kernel/pid_namespace.c                     |   6 +-
>  kernel/pid_sysctl.h                        |  28 ++-
>  mm/memfd.c                                 |  33 ++-
>  tools/testing/selftests/memfd/memfd_test.c | 332 +++++++++++++++++++++++------
>  6 files changed, 322 insertions(+), 119 deletions(-)
> ---
> base-commit: 3ff995246e801ea4de0a30860a1d8da4aeb538e7
> change-id: 20230803-memfd-vm-noexec-uapi-fixes-ace725c67b0f
>
> Best regards,
> --
> Aleksa Sarai <cyphar@cyphar.com>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/5] memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
  2023-08-14  8:41 ` [PATCH v2 4/5] memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy Aleksa Sarai
@ 2023-08-16  5:13   ` Jeff Xu
  2023-08-16  5:44     ` Dominique Martinet
  0 siblings, 1 reply; 18+ messages in thread
From: Jeff Xu @ 2023-08-16  5:13 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp,
	Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest

On Mon, Aug 14, 2023 at 1:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> This sysctl has the very unusual behaviour of not allowing any user (even
> CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you
> were to set this sysctl to a more restrictive option in the host pidns
> you would need to reboot your machine in order to reset it.
>
> The justification given in [1] is that this is a security feature and
> thus it should not be possible to disable. Aside from the fact that we
> have plenty of security-related sysctls that can be disabled after being
> enabled (fs.protected_symlinks for instance), the protection provided by
> the sysctl is to stop users from being able to create a binary and then
> execute it. A user with CAP_SYS_ADMIN can trivially do this without
> memfd_create(2):
>
>   % cat mount-memfd.c
>   #include <fcntl.h>
>   #include <string.h>
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <unistd.h>
>   #include <linux/mount.h>
>
>   #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"
>
>   int main(void)
>   {
>         int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
>         assert(fsfd >= 0);
>         assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));
>
>         int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
>         assert(dfd >= 0);
>
>         int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
>         assert(execfd >= 0);
>         assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
>         assert(!close(execfd));
>
>         char *execpath = NULL;
>         char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
>         execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
>         assert(execfd >= 0);
>         assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
>         assert(!execve(execpath, argv, envp));
>   }
>   % ./mount-memfd
>   this file was executed from this totally private tmpfs: /proc/self/fd/5
>   %
>
> Given that it is possible for CAP_SYS_ADMIN users to create executable
> binaries without memfd_create(2) and without touching the host
> filesystem (not to mention the many other things a CAP_SYS_ADMIN process
> would be able to do that would be equivalent or worse), it seems strange
> to cause a fair amount of headache to admins when there doesn't appear
> to be an actual security benefit to blocking this. There appear to be
> concerns about confused-deputy-esque attacks[2] but a confused deputy that
> can write to arbitrary sysctls is a bigger security issue than
> executable memfds.
>
Something to point out: The demo code might be enough to prove your
case in other distributions, however, in ChromeOS, you can't run this
code. The executable in ChromeOS are all from known sources and
verified at boot.
If an attacker could run this code in ChromeOS, that means the
attacker already acquired arbitrary code execution through other ways,
at that point, the attacker no longer needs to create/find an
executable memfd, they already have the vehicle. You can't use an
example of an attacker already running arbitrary code to prove that
disable downgrading is useless.
I agree it is a big problem that an attacker already can modify a
sysctl.  Assuming this can happen by controlling arguments passed into
sysctl, at the time, the attacker might not have full arbitrary code
execution yet, that is the reason the original design is so
restrictive.

Best regards,
-Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/5] memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
  2023-08-16  5:13   ` Jeff Xu
@ 2023-08-16  5:44     ` Dominique Martinet
  2023-08-16 22:46       ` Jeff Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Dominique Martinet @ 2023-08-16  5:44 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Aleksa Sarai, Andrew Morton, Shuah Khan, Kees Cook,
	Daniel Verkamp, Christian Brauner, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest

Jeff Xu wrote on Tue, Aug 15, 2023 at 10:13:18PM -0700:
> > Given that it is possible for CAP_SYS_ADMIN users to create executable
> > binaries without memfd_create(2) and without touching the host
> > filesystem (not to mention the many other things a CAP_SYS_ADMIN process
> > would be able to do that would be equivalent or worse), it seems strange
> > to cause a fair amount of headache to admins when there doesn't appear
> > to be an actual security benefit to blocking this. There appear to be
> > concerns about confused-deputy-esque attacks[2] but a confused deputy that
> > can write to arbitrary sysctls is a bigger security issue than
> > executable memfds.
> >
> Something to point out: The demo code might be enough to prove your
> case in other distributions, however, in ChromeOS, you can't run this
> code. The executable in ChromeOS are all from known sources and
> verified at boot.
> If an attacker could run this code in ChromeOS, that means the
> attacker already acquired arbitrary code execution through other ways,
> at that point, the attacker no longer needs to create/find an
> executable memfd, they already have the vehicle. You can't use an
> example of an attacker already running arbitrary code to prove that
> disable downgrading is useless.
> I agree it is a big problem that an attacker already can modify a
> sysctl.  Assuming this can happen by controlling arguments passed into
> sysctl, at the time, the attacker might not have full arbitrary code
> execution yet, that is the reason the original design is so
> restrictive.

I don't understand how you can say an attacker cannot run arbitrary code
within a process here, yet assert that they'd somehow run memfd_create +
execveat on it if this sysctl is lowered -- the two look equivalent to
me?

CAP_SYS_ADMIN is a kludge of a capability that pretty much gives root as
soon as you can run arbitrary code (just have a look at the various
container escape example when the capability is given); I see little
point in trying to harden just this here.
It'd make more sense to limit all sysctl modifications in the context
you're thinking of through e.g. selinux or another LSM.

(in the context of users making their own containers, my suggestion is
always to never use CAP_SYS_ADMIN, or if they must give it to a separate
minimal container where they can limit user interaction)


FWIW, I also think the proposed =2 behaviour makes more sense, but this
is something we already discussed last month so I won't come back to it
as not really involved here.

-- 
Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/5] memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
  2023-08-16  5:44     ` Dominique Martinet
@ 2023-08-16 22:46       ` Jeff Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff Xu @ 2023-08-16 22:46 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Aleksa Sarai, Andrew Morton, Shuah Khan, Kees Cook,
	Daniel Verkamp, Christian Brauner, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest

On Tue, Aug 15, 2023 at 10:44 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Jeff Xu wrote on Tue, Aug 15, 2023 at 10:13:18PM -0700:
> > > Given that it is possible for CAP_SYS_ADMIN users to create executable
> > > binaries without memfd_create(2) and without touching the host
> > > filesystem (not to mention the many other things a CAP_SYS_ADMIN process
> > > would be able to do that would be equivalent or worse), it seems strange
> > > to cause a fair amount of headache to admins when there doesn't appear
> > > to be an actual security benefit to blocking this. There appear to be
> > > concerns about confused-deputy-esque attacks[2] but a confused deputy that
> > > can write to arbitrary sysctls is a bigger security issue than
> > > executable memfds.
> > >
> > Something to point out: The demo code might be enough to prove your
> > case in other distributions, however, in ChromeOS, you can't run this
> > code. The executable in ChromeOS are all from known sources and
> > verified at boot.
> > If an attacker could run this code in ChromeOS, that means the
> > attacker already acquired arbitrary code execution through other ways,
> > at that point, the attacker no longer needs to create/find an
> > executable memfd, they already have the vehicle. You can't use an
> > example of an attacker already running arbitrary code to prove that
> > disable downgrading is useless.
> > I agree it is a big problem that an attacker already can modify a
> > sysctl.  Assuming this can happen by controlling arguments passed into
> > sysctl, at the time, the attacker might not have full arbitrary code
> > execution yet, that is the reason the original design is so
> > restrictive.
>
> I don't understand how you can say an attacker cannot run arbitrary code
> within a process here, yet assert that they'd somehow run memfd_create +
> execveat on it if this sysctl is lowered -- the two look equivalent to
> me?
>
It might require multiple steps for this attack, one possible scenario:
1> control a write primitive in CAP_SYSADMIN process's memory,  change
arguments of sysctl call, and downgrade the setting for memfd, e.g. change
it=0 to revert to old behavior (by default creating executable memfd)
2> control a non-privileged process that creates and writes to
memfd, and write the contents with the binary that the
attacker wants. This process just needs non-executable memfd, but
isn't updated yet.
3> Confuse a non-privilege process to execute the memfd the attacker
wrote in step 2.

In chromeOS, because all the executables are from verified sources,
attackers typically can't easily use the step 3 alone (without step
2),  and memfd was such a hole that enables an unverified executable.

In the original design, downgrading is not allowed, the attack chain
of 2/3 is completely blocked.  With this new approach, attackers will
try to find an additional step (step 1) to make the old attack (step 2
and 3) working again. It is difficult but I can't say it is
impossible.

> CAP_SYS_ADMIN is a kludge of a capability that pretty much gives root as
> soon as you can run arbitrary code (just have a look at the various
> container escape example when the capability is given); I see little
> point in trying to harden just this here.

I'm not an expert in containers, if the industry is giving up on
privileged containers, then the reasoning makes sense.
From ChromeOS point of view, we don't use runc currently, so I think
it makes more sense for runc users to drive these features.  The
original design is with runc's in mind, and even privileged containers
can't downgrade its own setting.

> It'd make more sense to limit all sysctl modifications in the context
> you're thinking of through e.g. selinux or another LSM.
>
I agree,  when I think more about this.
Security features fit LSM better, LSM can do additional "allow/deny"
on otherwise allowed behavior from user space code. Based on that,
"disallow downgrading" fits LSM better.  Also from the same reasoning,
I have second thoughts on the "=2", originally the "MEMFD_EXE was left
out due to the thinking, if user code explicitly setting MEMFD_EXE,
sysctl should not block it, it is the work of LSM. However, the "=2"
has evolved to block MEMFD_EXE completely ... alas .. it might be too
late to revert this, if this is what devs want, it can be that way.

Thanks
Best regards,
-Jeff




-Jeff

> (in the context of users making their own containers, my suggestion is
> always to never use CAP_SYS_ADMIN, or if they must give it to a separate
> minimal container where they can limit user interaction)
>
>
> FWIW, I also think the proposed =2 behaviour makes more sense, but this
> is something we already discussed last month so I won't come back to it
> as not really involved here.
>
> --
> Dominique Martinet | Asmadeus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec
  2023-08-16  5:08 ` [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec Jeff Xu
@ 2023-08-19  2:50   ` Aleksa Sarai
  2023-08-21 19:04     ` Jeff Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2023-08-19  2:50 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp,
	Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 4659 bytes --]

On 2023-08-15, Jeff Xu <jeffxu@google.com> wrote:
> On Mon, Aug 14, 2023 at 1:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > The most critical issue with vm.memfd_noexec=2 (the fact that passing
> > MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's
> > tree[2], but there are still some outstanding issues that need to be
> > addressed:
> >
> >  * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
> >    because it will make it far to difficult to ever migrate. Instead it
> >    should imply MFD_EXEC.
> >
> >  * The dmesg warnings are pr_warn_once(), which on most systems means
> >    that they will be used up by systemd or some other boot process and
> >    userspace developers will never see it.
> >
> >    - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a
> >      rate-limited message to the kernel log is necessary to tell
> >      userspace that they should add the new flags.
> >
> >      Arguably the most ideal way to deal with the spam concern[3,4]
> >      while still prompting userspace to switch to the new flags would be
> >      to only log the warning once per task or something similar.
> >      However, adding something to task_struct for tracking this would be
> >      needless bloat for a single pr_warn_ratelimited().
> >
> >      So just switch to pr_info_ratelimited() to avoid spamming the log
> >      with something that isn't a real warning. There's lots of
> >      info-level stuff in dmesg, it seems really unlikely that this
> >      should be an actual problem. Most programs are already switching to
> >      the new flags anyway.
> >
> >    - For the vm.memfd_noexec=2 case, we need to log a warning for every
> >      failure because otherwise userspace will have no idea why their
> >      previously working program started returning -EACCES (previously
> >      -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here.
> >
> >  * The racheting mechanism for vm.memfd_noexec makes it incredibly
> >    unappealing for most users to enable the sysctl because enabling it
> >    on &init_pid_ns means you need a system reboot to unset it. Given the
> >    actual security threat being protected against, CAP_SYS_ADMIN users
> >    being restricted in this way makes little sense.
> >
> >    The argument for this ratcheting by the original author was that it
> >    allows you to have a hierarchical setting that cannot be unset by
> >    child pidnses, but this is not accurate -- changing the parent
> >    pidns's vm.memfd_noexec setting to be more restrictive didn't affect
> >    children.
> >
> That is not exactly what I said though.

Sorry, I probably should've phrased this as "one of the main arguments".
In the last discussion thread we had in the v1 of this patch, it was my
impression that this was the primary sticking point.

> From ChromeOS's position,  allowing downgrade is less secure, and this
> setting was designed to be set at startup/reboot time from the very
> beginning, such that the kernel command line or as part of the
> container runtime environment (get passed to sandboxed container)

If this had been implemented as a cmdline flag, it would be completely
reasonable that you need to reboot to change it. However, it was
implemented as a sysctl and the behaviour of sysctls is that admins can
(generally) change them after they've been set -- even for
security-related sysctls such as the fs.protected_* sysctls. The only
counter-example I know if the YAMA one, and if I'm being honest I think
that behaviour is also weird.

> I understand your viewpoint,  from another distribution point of view,
>  the original design might be too restricted, so if the kernel wants
> to weigh more on ease of admin, I'm OK with your approach.
> Though it is less secure for ChromeOS - i.e. we do try to prevent
> arbitrary code execution  as much as possible, even for CAP_SYSADMIN.
> And with this change, it is less secure and one more possibility for
> us to consider.

FWIW I still think the threat model where a &init_user_ns-privileged
CAP_SYS_ADMIN process can be tricked into writing a sysctl should be
protected against by memfd_create(MFD_EXEC) doesn't really make sense
for the vast majority of systems (if any).

If ChromeOS really wants the old vm.memfd_noexec=2 behaviour to be
enforced, this can be done with a very simple seccomp filter. If applied
to pid1, this would also not be possible to unset without a reboot.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec
  2023-08-19  2:50   ` Aleksa Sarai
@ 2023-08-21 19:04     ` Jeff Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Jeff Xu @ 2023-08-21 19:04 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Andrew Morton, Shuah Khan, Kees Cook, Daniel Verkamp,
	Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest

On Fri, Aug 18, 2023 at 7:50 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2023-08-15, Jeff Xu <jeffxu@google.com> wrote:
> > On Mon, Aug 14, 2023 at 1:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > >
> > > The most critical issue with vm.memfd_noexec=2 (the fact that passing
> > > MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's
> > > tree[2], but there are still some outstanding issues that need to be
> > > addressed:
> > >
> > >  * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
> > >    because it will make it far to difficult to ever migrate. Instead it
> > >    should imply MFD_EXEC.
> > >
> > >  * The dmesg warnings are pr_warn_once(), which on most systems means
> > >    that they will be used up by systemd or some other boot process and
> > >    userspace developers will never see it.
> > >
> > >    - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a
> > >      rate-limited message to the kernel log is necessary to tell
> > >      userspace that they should add the new flags.
> > >
> > >      Arguably the most ideal way to deal with the spam concern[3,4]
> > >      while still prompting userspace to switch to the new flags would be
> > >      to only log the warning once per task or something similar.
> > >      However, adding something to task_struct for tracking this would be
> > >      needless bloat for a single pr_warn_ratelimited().
> > >
> > >      So just switch to pr_info_ratelimited() to avoid spamming the log
> > >      with something that isn't a real warning. There's lots of
> > >      info-level stuff in dmesg, it seems really unlikely that this
> > >      should be an actual problem. Most programs are already switching to
> > >      the new flags anyway.
> > >
> > >    - For the vm.memfd_noexec=2 case, we need to log a warning for every
> > >      failure because otherwise userspace will have no idea why their
> > >      previously working program started returning -EACCES (previously
> > >      -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here.
> > >
> > >  * The racheting mechanism for vm.memfd_noexec makes it incredibly
> > >    unappealing for most users to enable the sysctl because enabling it
> > >    on &init_pid_ns means you need a system reboot to unset it. Given the
> > >    actual security threat being protected against, CAP_SYS_ADMIN users
> > >    being restricted in this way makes little sense.
> > >
> > >    The argument for this ratcheting by the original author was that it
> > >    allows you to have a hierarchical setting that cannot be unset by
> > >    child pidnses, but this is not accurate -- changing the parent
> > >    pidns's vm.memfd_noexec setting to be more restrictive didn't affect
> > >    children.
> > >
> > That is not exactly what I said though.
>
> Sorry, I probably should've phrased this as "one of the main arguments".
> In the last discussion thread we had in the v1 of this patch, it was my
> impression that this was the primary sticking point.
>
> > From ChromeOS's position,  allowing downgrade is less secure, and this
> > setting was designed to be set at startup/reboot time from the very
> > beginning, such that the kernel command line or as part of the
> > container runtime environment (get passed to sandboxed container)
>
> If this had been implemented as a cmdline flag, it would be completely
> reasonable that you need to reboot to change it. However, it was

You might already know that sysctl can be set in kernel command line,
thanks to Vlastimil Babka from SUSE. [1]
[1] https://lore.kernel.org/lkml/20200325120345.12946-1-vbabka@suse.cz/

> implemented as a sysctl and the behaviour of sysctls is that admins can
> (generally) change them after they've been set -- even for
> security-related sysctls such as the fs.protected_* sysctls. The only
> counter-example I know if the YAMA one, and if I'm being honest I think
> that behaviour is also weird.
>

> > I understand your viewpoint,  from another distribution point of view,
> >  the original design might be too restricted, so if the kernel wants
> > to weigh more on ease of admin, I'm OK with your approach.
> > Though it is less secure for ChromeOS - i.e. we do try to prevent
> > arbitrary code execution  as much as possible, even for CAP_SYSADMIN.
> > And with this change, it is less secure and one more possibility for
> > us to consider.
>
> FWIW I still think the threat model where a &init_user_ns-privileged
> CAP_SYS_ADMIN process can be tricked into writing a sysctl should be
> protected against by memfd_create(MFD_EXEC) doesn't really make sense
> for the vast majority of systems (if any).
>
I agree other distributions might not care much about running
arbitrary code on the host for CAP_SYS_ADMIN, similar to traditional
unix in this aspect. ChromeOS has some unique security features.

> If ChromeOS really wants the old vm.memfd_noexec=2 behaviour to be
> enforced, this can be done with a very simple seccomp filter. If applied
> to pid1, this would also not be possible to unset without a reboot.
>
In practice, host and process can have different values for
vm.memfd_noexec, it can't easily be implemented through seccomp.
Seccomp also requires no-new-priv set, there are implications if we
set it to pid 1 and apply to all its children.


> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>

Thanks
Best regards,
-Jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags
  2023-08-14  8:40 ` [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags Aleksa Sarai
@ 2023-08-22  9:10   ` Christian Brauner
  2023-09-01  5:13   ` Damian Tometzki
  1 sibling, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2023-08-22  9:10 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Andrew Morton, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp,
	Dominique Martinet, stable, linux-api, linux-kernel, linux-mm,
	linux-kselftest

On Mon, Aug 14, 2023 at 06:40:59PM +1000, Aleksa Sarai wrote:
> In order to incentivise userspace to switch to passing MFD_EXEC and
> MFD_NOEXEC_SEAL, we need to provide a warning on each attempt to call
> memfd_create() without the new flags. pr_warn_once() is not useful
> because on most systems the one warning is burned up during the boot
> process (on my system, systemd does this within the first second of
> boot) and thus userspace will in practice never see the warnings to push
> them to switch to the new flags.
> 
> The original patchset[1] used pr_warn_ratelimited(), however there were
> concerns about the degree of spam in the kernel log[2,3]. The resulting
> inability to detect every case was flagged as an issue at the time[4].
> 
> While we could come up with an alternative rate-limiting scheme such as
> only outputting the message if vm.memfd_noexec has been modified, or
> only outputting the message once for a given task, these alternatives
> have downsides that don't make sense given how low-stakes a single
> kernel warning message is. Switching to pr_info_ratelimited() instead
> should be fine -- it's possible some monitoring tool will be unhappy
> with a stream of warning-level messages but there's already plenty of
> info-level message spam in dmesg.
> 
> [1]: https://lore.kernel.org/20221215001205.51969-4-jeffxu@google.com/
> [2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/
> [3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
> [4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundation.org/
> 
> Cc: stable@vger.kernel.org # v6.3+
> Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---

Reviewed-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags
  2023-08-14  8:40 ` [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags Aleksa Sarai
  2023-08-22  9:10   ` Christian Brauner
@ 2023-09-01  5:13   ` Damian Tometzki
  2023-09-02 22:58     ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: Damian Tometzki @ 2023-09-01  5:13 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Andrew Morton, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp,
	Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest

On Mon, 14. Aug 18:40, Aleksa Sarai wrote:
> In order to incentivise userspace to switch to passing MFD_EXEC and
> MFD_NOEXEC_SEAL, we need to provide a warning on each attempt to call
> memfd_create() without the new flags. pr_warn_once() is not useful
> because on most systems the one warning is burned up during the boot
> process (on my system, systemd does this within the first second of
> boot) and thus userspace will in practice never see the warnings to push
> them to switch to the new flags.
> 
> The original patchset[1] used pr_warn_ratelimited(), however there were
> concerns about the degree of spam in the kernel log[2,3]. The resulting
> inability to detect every case was flagged as an issue at the time[4].
> 
> While we could come up with an alternative rate-limiting scheme such as
> only outputting the message if vm.memfd_noexec has been modified, or
> only outputting the message once for a given task, these alternatives
> have downsides that don't make sense given how low-stakes a single
> kernel warning message is. Switching to pr_info_ratelimited() instead
> should be fine -- it's possible some monitoring tool will be unhappy
> with a stream of warning-level messages but there's already plenty of
> info-level message spam in dmesg.
> 
> [1]: https://lore.kernel.org/20221215001205.51969-4-jeffxu@google.com/
> [2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/
> [3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
> [4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundation.org/
> 
> Cc: stable@vger.kernel.org # v6.3+
> Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  mm/memfd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memfd.c b/mm/memfd.c
> index d65485c762de..aa46521057ab 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -315,7 +315,7 @@ SYSCALL_DEFINE2(memfd_create,
>  		return -EINVAL;
>  
>  	if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
> -		pr_warn_once(
> +		pr_info_ratelimited(
>  			"%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n",
>  			current->comm, task_pid_nr(current));
>  	}
> 
> -- 
> 2.41.0
>
Hello Sarai,

i got a lot of messages in dmesg with this. DMESG is unuseable with
this. 
[ 1390.349462] __do_sys_memfd_create: 5 callbacks suppressed
[ 1390.349468] pipewire-pulse[2930]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1390.350106] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1390.350366] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1390.359390] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1390.359453] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1390.848813] pipewire-pulse[2930]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1390.849425] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1390.849673] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1390.857629] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1390.857674] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1404.819637] __do_sys_memfd_create: 105 callbacks suppressed
[ 1404.819641] pipewire-pulse[2930]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1404.819950] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1404.820054] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1404.824240] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1404.824279] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1430.373186] pipewire-pulse[2930]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1430.373906] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1430.374131] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1430.382397] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1430.382485] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1430.499581] pipewire-pulse[2930]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1430.500077] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1430.500265] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1430.512772] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1430.512840] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1444.388519] __do_sys_memfd_create: 60 callbacks suppressed
[ 1444.388525] pipewire-pulse[2930]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1444.389061] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1444.389335] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1444.397909] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1444.397965] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1444.503514] pipewire-pulse[2930]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1444.503658] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1444.503726] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1444.507841] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1444.507870] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 1449.707966] __do_sys_memfd_create: 25 callbacks suppressed

Best regards
Damian
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags
  2023-09-01  5:13   ` Damian Tometzki
@ 2023-09-02 22:58     ` Andrew Morton
  2023-09-04  7:09       ` Aleksa Sarai
  2023-09-05 16:20       ` Florian Weimer
  0 siblings, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2023-09-02 22:58 UTC (permalink / raw)
  To: Damian Tometzki
  Cc: Aleksa Sarai, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp,
	Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest

On Fri, 1 Sep 2023 07:13:45 +0200 Damian Tometzki <dtometzki@fedoraproject.org> wrote:

> >  	if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
> > -		pr_warn_once(
> > +		pr_info_ratelimited(
> >  			"%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n",
> >  			current->comm, task_pid_nr(current));
> >  	}
> > 
> > -- 
> > 2.41.0
> >
> Hello Sarai,
> 
> i got a lot of messages in dmesg with this. DMESG is unuseable with
> this. 
> [ 1390.349462] __do_sys_memfd_create: 5 callbacks suppressed
> [ 1390.349468] pipewire-pulse[2930]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
> [ 1390.350106] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set

OK, thanks, I'll revert this.  Spamming everyone even harder isn't a
good way to get developers to fix their stuff.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags
  2023-09-02 22:58     ` Andrew Morton
@ 2023-09-04  7:09       ` Aleksa Sarai
  2023-09-05 16:20       ` Florian Weimer
  1 sibling, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2023-09-04  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Damian Tometzki, Shuah Khan, Jeff Xu, Kees Cook, Daniel Verkamp,
	Christian Brauner, Dominique Martinet, stable, linux-api,
	linux-kernel, linux-mm, linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 1841 bytes --]

On 2023-09-02, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Fri, 1 Sep 2023 07:13:45 +0200 Damian Tometzki <dtometzki@fedoraproject.org> wrote:
> 
> > >  	if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
> > > -		pr_warn_once(
> > > +		pr_info_ratelimited(
> > >  			"%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n",
> > >  			current->comm, task_pid_nr(current));
> > >  	}
> > > 
> > > -- 
> > > 2.41.0
> > >
> > Hello Sarai,
> > 
> > i got a lot of messages in dmesg with this. DMESG is unuseable with
> > this. 
> > [ 1390.349462] __do_sys_memfd_create: 5 callbacks suppressed
> > [ 1390.349468] pipewire-pulse[2930]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
> > [ 1390.350106] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
> 
> OK, thanks, I'll revert this.  Spamming everyone even harder isn't a
> good way to get developers to fix their stuff.

Sorry, I'm on vacation. I will send a follow-up patch to remove this
logging entirely -- if we can't do rate-limited logging then logging a
single message effectively at boot time makes no sense. I had hoped that
this wouldn't be too much (given there is a fair amount of INFO-level
spam in the kernel log) but I guess the default ratelimit (5Hz) is too
liberal.

Perhaps we can re-consider adding some logging in the future, when more
programs have migrated. The only other "reasonable" way to reduce the
logging would be to add something to task_struct so we only log once per
task, but obviously that's massively overkill.

(FWIW, I don't think the logging was ever necessary. There's nothing
wrong with running an older program that doesn't pass the flags.)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags
  2023-09-02 22:58     ` Andrew Morton
  2023-09-04  7:09       ` Aleksa Sarai
@ 2023-09-05 16:20       ` Florian Weimer
  2023-09-06  6:58         ` Aleksa Sarai
  1 sibling, 1 reply; 18+ messages in thread
From: Florian Weimer @ 2023-09-05 16:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Damian Tometzki, Aleksa Sarai, Shuah Khan, Jeff Xu, Kees Cook,
	Daniel Verkamp, Christian Brauner, Dominique Martinet, stable,
	linux-api, linux-kernel, linux-mm, linux-kselftest

* Andrew Morton:

> OK, thanks, I'll revert this.  Spamming everyone even harder isn't a
> good way to get developers to fix their stuff.

Is this really buggy userspace?  Are future kernels going to require
some of these flags?

That's going to break lots of applications which use memfd_create to
enable run-time code generation on locked-down systems because it looked
like a stable interface (“don't break userspace” and all that).

Thanks,
Florian


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags
  2023-09-05 16:20       ` Florian Weimer
@ 2023-09-06  6:58         ` Aleksa Sarai
  0 siblings, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2023-09-06  6:58 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andrew Morton, Damian Tometzki, Shuah Khan, Jeff Xu, Kees Cook,
	Daniel Verkamp, Christian Brauner, Dominique Martinet, stable,
	linux-api, linux-kernel, linux-mm, linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 1591 bytes --]

On 2023-09-05, Florian Weimer <fweimer@redhat.com> wrote:
> * Andrew Morton:
> 
> > OK, thanks, I'll revert this.  Spamming everyone even harder isn't a
> > good way to get developers to fix their stuff.
> 
> Is this really buggy userspace?  Are future kernels going to require
> some of these flags?
> 
> That's going to break lots of applications which use memfd_create to
> enable run-time code generation on locked-down systems because it looked
> like a stable interface (“don't break userspace” and all that).

There is no userspace breakage with the current behaviour and obviously
actually requiring these flags to be passed by default would be a pretty
clear userspace breakage and would never be merged.

The original intention (as far as I can tell -- the logging behaviour
came from the original patchset) was to try to incentivise userspace to
start passing the flags so that if distributions decide to set
vm.memfd_noexec=1 as a default setting you won't end up with programs
that _need_ executable memfds (such as container runtimes) crashing
unexpectedly. I also suspect there was an aspect of "well, userspace
*should* be passing these flags after we've introduced them".

I'm sending a patch to just remove this part of the logging because I
don't think it makes sense if you can't rate-limit it sanely, and
there's probably an argument to be made that it doesn't make sense at
all (at least for the default vm.memfd_noexec=0 setting).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-09-06  6:59 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-14  8:40 [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec Aleksa Sarai
2023-08-14  8:40 ` [PATCH v2 1/5] selftests: memfd: error out test process when child test fails Aleksa Sarai
2023-08-14  8:40 ` [PATCH v2 2/5] memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2 Aleksa Sarai
2023-08-14  8:40 ` [PATCH v2 3/5] memfd: improve userspace warnings for missing exec-related flags Aleksa Sarai
2023-08-22  9:10   ` Christian Brauner
2023-09-01  5:13   ` Damian Tometzki
2023-09-02 22:58     ` Andrew Morton
2023-09-04  7:09       ` Aleksa Sarai
2023-09-05 16:20       ` Florian Weimer
2023-09-06  6:58         ` Aleksa Sarai
2023-08-14  8:41 ` [PATCH v2 4/5] memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy Aleksa Sarai
2023-08-16  5:13   ` Jeff Xu
2023-08-16  5:44     ` Dominique Martinet
2023-08-16 22:46       ` Jeff Xu
2023-08-14  8:41 ` [PATCH v2 5/5] selftests: improve vm.memfd_noexec sysctl tests Aleksa Sarai
2023-08-16  5:08 ` [PATCH v2 0/5] memfd: cleanups for vm.memfd_noexec Jeff Xu
2023-08-19  2:50   ` Aleksa Sarai
2023-08-21 19:04     ` Jeff Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).