linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
@ 2024-04-26 13:33 Stas Sergeev
  2024-04-26 13:33 ` [PATCH v5 1/3] fs: reorganize path_openat() Stas Sergeev
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Stas Sergeev @ 2024-04-26 13:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Stas Sergeev, Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton,
	Chuck Lever, Alexander Aring, David Laight, linux-fsdevel,
	linux-api, Paolo Bonzini, Christian Göttsche

This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
It is needed to perform an open operation with the creds that were in
effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
flag. This allows the process to pre-open some dirs and switch eUID
(and other UIDs/GIDs) to the less-privileged user, while still retaining
the possibility to open/create files within the pre-opened directory set.

The sand-boxing is security-oriented: symlinks leading outside of a
sand-box are rejected. /proc magic links are rejected. fds opened with
O_CRED_ALLOW are always closed on exec() and cannot be passed via unix
socket.
The more detailed description (including security considerations)
is available in the log messages of individual patches.

Changes in v5:
- rename OA2_INHERIT_CRED to OA2_CRED_INHERIT
- add an "opt-in" flag O_CRED_ALLOW as was suggested by many reviewers
- stop using 64bit types, as suggested by
  Christian Brauner <brauner@kernel.org>
- add BUILD_BUG_ON() for VALID_OPENAT2_FLAGS, based on Christian Brauner's
  comments
- fixed problems reported by patch-testing bot
- made O_CRED_ALLOW fds not passable via unix sockets and exec(),
  based on Christian Brauner's comments

Changes in v4:
- add optimizations suggested by David Laight <David.Laight@ACULAB.COM>
- move security checks to build_open_flags()
- force RESOLVE_NO_MAGICLINKS as suggested by Andy Lutomirski <luto@kernel.org>

Changes in v3:
- partially revert v2 changes to avoid overriding capabilities.
  Only the bare minimum is overridden: fsuid, fsgid and group_info.
  Document the fact the full cred override is unwanted, as it may
  represent an unneeded security risk.

Changes in v2:
- capture full struct cred instead of just fsuid/fsgid.
  Suggested by Stefan Metzmacher <metze@samba.org>

CC: Stefan Metzmacher <metze@samba.org>
CC: Eric Biederman <ebiederm@xmission.com>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Andy Lutomirski <luto@kernel.org>
CC: Christian Brauner <brauner@kernel.org>
CC: Jan Kara <jack@suse.cz>
CC: Jeff Layton <jlayton@kernel.org>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Alexander Aring <alex.aring@gmail.com>
CC: David Laight <David.Laight@ACULAB.COM>
CC: linux-fsdevel@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-api@vger.kernel.org
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Christian Göttsche <cgzones@googlemail.com>

-- 
2.44.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 1/3] fs: reorganize path_openat()
  2024-04-26 13:33 [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Stas Sergeev
@ 2024-04-26 13:33 ` Stas Sergeev
  2024-04-26 13:33 ` [PATCH v5 2/3] open: add O_CRED_ALLOW flag Stas Sergeev
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Stas Sergeev @ 2024-04-26 13:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Stas Sergeev, Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton,
	Chuck Lever, Alexander Aring, David Laight, linux-fsdevel,
	linux-api, Paolo Bonzini, Christian Göttsche

This patch moves the call to alloc_empty_file() down the if branches.
That changes is needed for the next patch, which adds a cred override
for alloc_empty_file(). The cred override is only needed in one branch,
i.e. it is not needed for O_PATH and O_TMPFILE..

No functional changes are intended by that patch.

Signed-off-by: Stas Sergeev <stsp2@yandex.ru>

CC: Eric Biederman <ebiederm@xmission.com>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Christian Brauner <brauner@kernel.org>
CC: Jan Kara <jack@suse.cz>
CC: Andy Lutomirski <luto@kernel.org>
CC: David Laight <David.Laight@ACULAB.COM>
CC: linux-fsdevel@vger.kernel.org
CC: linux-kernel@vger.kernel.org
---
 fs/namei.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c5b2a25be7d0..dd50345f7260 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3781,17 +3781,24 @@ static struct file *path_openat(struct nameidata *nd,
 {
 	struct file *file;
 	int error;
+	int open_flags = op->open_flag;
 
-	file = alloc_empty_file(op->open_flag, current_cred());
-	if (IS_ERR(file))
-		return file;
-
-	if (unlikely(file->f_flags & __O_TMPFILE)) {
-		error = do_tmpfile(nd, flags, op, file);
-	} else if (unlikely(file->f_flags & O_PATH)) {
-		error = do_o_path(nd, flags, file);
+	if (unlikely(open_flags & (__O_TMPFILE | O_PATH))) {
+		file = alloc_empty_file(open_flags, current_cred());
+		if (IS_ERR(file))
+			return file;
+		if (open_flags & __O_TMPFILE)
+			error = do_tmpfile(nd, flags, op, file);
+		else
+			error = do_o_path(nd, flags, file);
 	} else {
-		const char *s = path_init(nd, flags);
+		const char *s;
+
+		file = alloc_empty_file(open_flags, current_cred());
+		if (IS_ERR(file))
+			return file;
+
+		s = path_init(nd, flags);
 		while (!(error = link_path_walk(s, nd)) &&
 		       (s = open_last_lookups(nd, file, op)) != NULL)
 			;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 2/3] open: add O_CRED_ALLOW flag
  2024-04-26 13:33 [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Stas Sergeev
  2024-04-26 13:33 ` [PATCH v5 1/3] fs: reorganize path_openat() Stas Sergeev
@ 2024-04-26 13:33 ` Stas Sergeev
  2024-04-27  2:12   ` kernel test robot
  2024-04-26 13:33 ` [PATCH v5 3/3] openat2: add OA2_CRED_INHERIT flag Stas Sergeev
  2024-04-28 16:41 ` [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Andy Lutomirski
  3 siblings, 1 reply; 22+ messages in thread
From: Stas Sergeev @ 2024-04-26 13:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Stas Sergeev, Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton,
	Chuck Lever, Alexander Aring, David Laight, linux-fsdevel,
	linux-api, Paolo Bonzini, Christian Göttsche, Arnd Bergmann,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jens Axboe, Kuniyuki Iwashima, Pavel Begunkov, linux-arch,
	netdev

This flag prevents an fd from being passed via unix socket, and
makes it to be always closed on exec().
It is needed for the subsequent OA2_CRED_INHERIT addition, to work
as an "opt-in" for the new cred-inherit functionality. Without using
O_CRED_ALLOW when opening dir fd, it won't be possible to use
OA2_CRED_INHERIT on that dir fd.

Signed-off-by: Stas Sergeev <stsp2@yandex.ru>

CC: Eric Biederman <ebiederm@xmission.com>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Christian Brauner <brauner@kernel.org>
CC: Jan Kara <jack@suse.cz>
CC: Andy Lutomirski <luto@kernel.org>
CC: David Laight <David.Laight@ACULAB.COM>
CC: Arnd Bergmann <arnd@arndb.de>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Kuniyuki Iwashima <kuniyu@amazon.com>
CC: Pavel Begunkov <asml.silence@gmail.com>
CC: linux-arch@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-api@vger.kernel.org
---
 fs/fcntl.c                       |  2 +-
 fs/file.c                        | 15 ++++++++-------
 include/linux/fcntl.h            |  2 +-
 include/uapi/asm-generic/fcntl.h |  4 ++++
 net/core/scm.c                   |  5 +++++
 5 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 54cc85d3338e..78c96b1293c2 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1039,7 +1039,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/file.c b/fs/file.c
index 3b683b9101d8..2a09d5276676 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -827,22 +827,23 @@ void do_close_on_exec(struct files_struct *files)
 	/* exec unshares first */
 	spin_lock(&files->file_lock);
 	for (i = 0; ; i++) {
+		int j;
 		unsigned long set;
 		unsigned fd = i * BITS_PER_LONG;
 		fdt = files_fdtable(files);
 		if (fd >= fdt->max_fds)
 			break;
 		set = fdt->close_on_exec[i];
-		if (!set)
-			continue;
 		fdt->close_on_exec[i] = 0;
-		for ( ; set ; fd++, set >>= 1) {
-			struct file *file;
-			if (!(set & 1))
-				continue;
-			file = fdt->fd[fd];
+		for (j = 0; j < BITS_PER_LONG; j++, fd++, set >>= 1) {
+			struct file *file = fdt->fd[fd];
 			if (!file)
 				continue;
+			/* Close all cred-allow files. */
+			if (file->f_flags & O_CRED_ALLOW)
+				set |= 1;
+			if (!(set & 1))
+				continue;
 			rcu_assign_pointer(fdt->fd[fd], NULL);
 			__put_unused_fd(files, fd);
 			spin_unlock(&files->file_lock);
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index a332e79b3207..e074ee9c1e36 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_CRED_ALLOW)
 
 /* List of all valid flags for the how->resolve argument: */
 #define VALID_RESOLVE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 80f37a0d40d7..ee8c2267c516 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CRED_ALLOW
+#define O_CRED_ALLOW		040000000
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 
diff --git a/net/core/scm.c b/net/core/scm.c
index 9cd4b0a01cd6..f54fb0ee9727 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -111,6 +111,11 @@ static int scm_fp_copy(struct cmsghdr *cmsg, struct scm_fp_list **fplp)
 			fput(file);
 			return -EINVAL;
 		}
+		/* don't allow files with creds */
+		if (file->f_flags & O_CRED_ALLOW) {
+			fput(file);
+			return -EPERM;
+		}
 		if (unix_get_socket(file))
 			fpl->count_unix++;
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 3/3] openat2: add OA2_CRED_INHERIT flag
  2024-04-26 13:33 [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Stas Sergeev
  2024-04-26 13:33 ` [PATCH v5 1/3] fs: reorganize path_openat() Stas Sergeev
  2024-04-26 13:33 ` [PATCH v5 2/3] open: add O_CRED_ALLOW flag Stas Sergeev
@ 2024-04-26 13:33 ` Stas Sergeev
  2024-04-28 16:41 ` [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Andy Lutomirski
  3 siblings, 0 replies; 22+ messages in thread
From: Stas Sergeev @ 2024-04-26 13:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: Stas Sergeev, Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton,
	Chuck Lever, Alexander Aring, David Laight, linux-fsdevel,
	linux-api, Paolo Bonzini, Christian Göttsche

This flag performs the open operation with the fs credentials
(fsuid, fsgid, group_info) that were in effect when dir_fd was opened.
dir_fd must be opened with O_CRED_ALLOW flag for this to work.
This allows the process to pre-open some directories and then
change eUID (and all other UIDs/GIDs) to a less-privileged user,
retaining the ability to open/create files within these directories.

Design goal:
The idea is to provide a very light-weight sandboxing, where the
process, without the use of any heavy-weight techniques like chroot
within namespaces, can restrict the access to the set of pre-opened
directories.
This patch is just a first step to such sandboxing. If things go
well, in the future the same extension can be added to more syscalls.
These should include at least unlinkat(), renameat2() and the
not-yet-upstreamed setxattrat().

Security considerations:
- Only the bare minimal set of credentials is overridden:
  fsuid, fsgid and group_info. The rest, for example capabilities,
  are not overridden to avoid unneeded security risks.
- To avoid sandboxing escape, this patch makes sure the restricted
  lookup modes are used. Namely, RESOLVE_BENEATH or RESOLVE_IN_ROOT.
- Magic /proc symlinks are discarded, as suggested by
  Andy Lutomirski <luto@kernel.org>
- O_CRED_ALLOW fds cannot be passed via unix socket and are always
  closed on exec() to prevent "unsuspecting userspace" from not being
  able to fully drop privs.

Use cases:
Virtual machines that deal with untrusted code, can use that
instead of a more heavy-weighted approaches.
Currently the approach is being tested on a dosemu2 VM.

Signed-off-by: Stas Sergeev <stsp2@yandex.ru>

CC: Stefan Metzmacher <metze@samba.org>
CC: Eric Biederman <ebiederm@xmission.com>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Andy Lutomirski <luto@kernel.org>
CC: Christian Brauner <brauner@kernel.org>
CC: Jan Kara <jack@suse.cz>
CC: Jeff Layton <jlayton@kernel.org>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Alexander Aring <alex.aring@gmail.com>
CC: linux-fsdevel@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Christian Göttsche <cgzones@googlemail.com>
---
 fs/fcntl.c                   |  2 ++
 fs/namei.c                   | 56 ++++++++++++++++++++++++++++++++++--
 fs/open.c                    | 10 ++++++-
 include/linux/fcntl.h        |  2 ++
 include/uapi/linux/openat2.h |  2 ++
 5 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 78c96b1293c2..283c2e65fc2c 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1043,6 +1043,8 @@ static int __init fcntl_init(void)
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
+	BUILD_BUG_ON(HWEIGHT32(VALID_OPENAT2_FLAGS) !=
+			HWEIGHT32(VALID_OPEN_FLAGS) + 1);
 
 	fasync_cache = kmem_cache_create("fasync_cache",
 					 sizeof(struct fasync_struct), 0,
diff --git a/fs/namei.c b/fs/namei.c
index dd50345f7260..aa5dcf57851b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3776,6 +3776,43 @@ static int do_o_path(struct nameidata *nd, unsigned flags, struct file *file)
 	return error;
 }
 
+static const struct cred *openat2_init_creds(int dfd)
+{
+	struct cred *cred;
+	struct fd f;
+
+	if (dfd == AT_FDCWD)
+		return ERR_PTR(-EINVAL);
+
+	f = fdget_raw(dfd);
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+
+	cred = ERR_PTR(-EPERM);
+	if (!(f.file->f_flags & O_CRED_ALLOW))
+		goto done;
+
+	cred = prepare_creds();
+	if (!cred) {
+		cred = ERR_PTR(-ENOMEM);
+		goto done;
+	}
+
+	cred->fsuid = f.file->f_cred->fsuid;
+	cred->fsgid = f.file->f_cred->fsgid;
+	cred->group_info = get_group_info(f.file->f_cred->group_info);
+
+done:
+	fdput(f);
+	return cred;
+}
+
+static void openat2_done_creds(const struct cred *cred)
+{
+	put_group_info(cred->group_info);
+	put_cred(cred);
+}
+
 static struct file *path_openat(struct nameidata *nd,
 			const struct open_flags *op, unsigned flags)
 {
@@ -3793,18 +3830,33 @@ static struct file *path_openat(struct nameidata *nd,
 			error = do_o_path(nd, flags, file);
 	} else {
 		const char *s;
+		const struct cred *old_cred = NULL, *cred = NULL;
 
-		file = alloc_empty_file(open_flags, current_cred());
-		if (IS_ERR(file))
+		if (open_flags & OA2_CRED_INHERIT) {
+			cred = openat2_init_creds(nd->dfd);
+			if (IS_ERR(cred))
+				return ERR_CAST(cred);
+		}
+		file = alloc_empty_file(open_flags, cred ?: current_cred());
+		if (IS_ERR(file)) {
+			if (cred)
+				openat2_done_creds(cred);
 			return file;
+		}
 
 		s = path_init(nd, flags);
+		if (cred)
+			old_cred = override_creds(cred);
 		while (!(error = link_path_walk(s, nd)) &&
 		       (s = open_last_lookups(nd, file, op)) != NULL)
 			;
 		if (!error)
 			error = do_open(nd, file, op);
+		if (old_cred)
+			revert_creds(old_cred);
 		terminate_walk(nd);
+		if (cred)
+			openat2_done_creds(cred);
 	}
 	if (likely(!error)) {
 		if (likely(file->f_mode & FMODE_OPENED))
diff --git a/fs/open.c b/fs/open.c
index ee8460c83c77..dd4fab536135 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1225,7 +1225,7 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 	 * values before calling build_open_flags(), but openat2(2) checks all
 	 * of its arguments.
 	 */
-	if (flags & ~VALID_OPEN_FLAGS)
+	if (flags & ~VALID_OPENAT2_FLAGS)
 		return -EINVAL;
 	if (how->resolve & ~VALID_RESOLVE_FLAGS)
 		return -EINVAL;
@@ -1326,6 +1326,14 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 		lookup_flags |= LOOKUP_CACHED;
 	}
 
+	if (flags & OA2_CRED_INHERIT) {
+		/* Inherit creds only with scoped look-up modes. */
+		if (!(lookup_flags & LOOKUP_IS_SCOPED))
+			return -EPERM;
+		/* Reject /proc "magic" links if inheriting creds. */
+		lookup_flags |= LOOKUP_NO_MAGICLINKS;
+	}
+
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index e074ee9c1e36..33b9c7ad056b 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -12,6 +12,8 @@
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
 	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_CRED_ALLOW)
 
+#define VALID_OPENAT2_FLAGS (VALID_OPEN_FLAGS | OA2_CRED_INHERIT)
+
 /* List of all valid flags for the how->resolve argument: */
 #define VALID_RESOLVE_FLAGS \
 	(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
index a5feb7604948..f803558ad62f 100644
--- a/include/uapi/linux/openat2.h
+++ b/include/uapi/linux/openat2.h
@@ -40,4 +40,6 @@ struct open_how {
 					return -EAGAIN if that's not
 					possible. */
 
+#define OA2_CRED_INHERIT		(1UL << 28)
+
 #endif /* _UAPI_LINUX_OPENAT2_H */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 2/3] open: add O_CRED_ALLOW flag
  2024-04-26 13:33 ` [PATCH v5 2/3] open: add O_CRED_ALLOW flag Stas Sergeev
@ 2024-04-27  2:12   ` kernel test robot
  0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2024-04-27  2:12 UTC (permalink / raw)
  To: Stas Sergeev, linux-kernel
  Cc: oe-kbuild-all, Stas Sergeev, Stefan Metzmacher, Eric Biederman,
	Alexander Viro, Andy Lutomirski, Christian Brauner, Jan Kara,
	Jeff Layton, Chuck Lever, Alexander Aring, David Laight,
	linux-fsdevel, linux-api, Paolo Bonzini, Christian Göttsche,
	Arnd Bergmann, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jens Axboe, Kuniyuki Iwashima, Pavel Begunkov, linux-arch,
	netdev

Hi Stas,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.9-rc5 next-20240426]
[cannot apply to arnd-asm-generic/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Stas-Sergeev/fs-reorganize-path_openat/20240426-214030
base:   linus/master
patch link:    https://lore.kernel.org/r/20240426133310.1159976-3-stsp2%40yandex.ru
patch subject: [PATCH v5 2/3] open: add O_CRED_ALLOW flag
config: parisc-allnoconfig (https://download.01.org/0day-ci/archive/20240427/202404270923.bAeBIJt1-lkp@intel.com/config)
compiler: hppa-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240427/202404270923.bAeBIJt1-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202404270923.bAeBIJt1-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from <command-line>:
   fs/fcntl.c: In function 'fcntl_init':
>> include/linux/compiler_types.h:449:45: error: call to '__compiletime_assert_297' declared with attribute error: BUILD_BUG_ON failed: 22 - 1 != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY)
     449 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:430:25: note: in definition of macro '__compiletime_assert'
     430 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:449:9: note: in expansion of macro '_compiletime_assert'
     449 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   fs/fcntl.c:1042:9: note: in expansion of macro 'BUILD_BUG_ON'
    1042 |         BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
         |         ^~~~~~~~~~~~


vim +/__compiletime_assert_297 +449 include/linux/compiler_types.h

eb5c2d4b45e3d2 Will Deacon 2020-07-21  435  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  436  #define _compiletime_assert(condition, msg, prefix, suffix) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21  437  	__compiletime_assert(condition, msg, prefix, suffix)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  438  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  439  /**
eb5c2d4b45e3d2 Will Deacon 2020-07-21  440   * compiletime_assert - break build and emit msg if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  441   * @condition: a compile-time constant condition to check
eb5c2d4b45e3d2 Will Deacon 2020-07-21  442   * @msg:       a message to emit if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  443   *
eb5c2d4b45e3d2 Will Deacon 2020-07-21  444   * In tradition of POSIX assert, this macro will break the build if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  445   * supplied condition is *false*, emitting the supplied error message if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  446   * compiler has support to do so.
eb5c2d4b45e3d2 Will Deacon 2020-07-21  447   */
eb5c2d4b45e3d2 Will Deacon 2020-07-21  448  #define compiletime_assert(condition, msg) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21 @449  	_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  450  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-04-26 13:33 [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Stas Sergeev
                   ` (2 preceding siblings ...)
  2024-04-26 13:33 ` [PATCH v5 3/3] openat2: add OA2_CRED_INHERIT flag Stas Sergeev
@ 2024-04-28 16:41 ` Andy Lutomirski
  2024-04-28 17:39   ` stsp
                     ` (2 more replies)
  3 siblings, 3 replies; 22+ messages in thread
From: Andy Lutomirski @ 2024-04-28 16:41 UTC (permalink / raw)
  To: Stas Sergeev, Aleksa Sarai, Serge E. Hallyn
  Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton,
	Chuck Lever, Alexander Aring, David Laight, linux-fsdevel,
	linux-api, Paolo Bonzini, Christian Göttsche

> On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
> This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
> It is needed to perform an open operation with the creds that were in
> effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
> flag. This allows the process to pre-open some dirs and switch eUID
> (and other UIDs/GIDs) to the less-privileged user, while still retaining
> the possibility to open/create files within the pre-opened directory set.
>

I’ve been contemplating this, and I want to propose a different solution.

First, the problem Stas is solving is quite narrow and doesn’t
actually need kernel support: if I want to write a user program that
sandboxes itself, I have at least three solutions already.  I can make
a userns and a mountns; I can use landlock; and I can have a separate
process that brokers filesystem access using SCM_RIGHTS.

But what if I want to run a container, where the container can access
a specific host directory, and the contained application is not aware
of the exact technology being used?  I recently started using
containers in anger in a production setting, and “anger” was
definitely the right word: binding part of a filesystem in is
*miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.
Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
actually gave up on making one of my use cases work on a Fedora
system.

Here’s what I wanted to do, logically, in production: pick a host
directory, pick a host *principal* (UID, GID, label, etc), and have
the *entire container* access the directory as that principal. This is
what happens automatically if I run the whole container as a userns
with only a single UID mapped, but I don’t really want to do that for
a whole variety and of reasons.

So maybe reimagining Stas’ feature a bit can actually solve this
problem.  Instead of a special dirfd, what if there was a special
subtree (in the sense of open_tree) that captures a set of creds and
does all opens inside the subtree using those creds?

This isn’t a fully formed proposal, but I *think* it should be
generally fairly safe for even an unprivileged user to clone a subtree
with a specific flag set to do this. Maybe a capability would be
needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
this to a daemon if a privilege is needed, and getting the API right
might be a bit tricky.

Then two different things could be done:

1. The subtree could be used unmounted or via /proc magic links. This
would be for programs that are aware of this interface.

2. The subtree could be mounted, and accessed through the mount would
use the captured creds.

(Hmm. What would a new open_tree() pointing at this special subtree do?)


With all this done, if userspace wired it up, a container user could
do something like:

—bind-capture-creds source=dest

And the contained program would access source *as the user who started
the container*, and this would just work without relabeling or
fiddling with owner uids or gids or ACLs, and it would continue to
work even if the container has multiple dynamically allocated subuids
mapped (e.g. one for “root” and one for the actual application).

Bonus points for the ability to revoke the creds in an already opened
subtree. Or even for the creds to automatically revoke themselves when
the opener exits (or maybe when a specific cred-pinning fd goes away).

(This should work for single files as well as for directories.)

New LSM hooks or extensions of existing hooks might be needed to make
LSMs comfortable with this.

What do you all think?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-04-28 16:41 ` [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Andy Lutomirski
@ 2024-04-28 17:39   ` stsp
  2024-04-28 19:15     ` stsp
  2024-04-28 20:19     ` Andy Lutomirski
  2024-04-29  9:12   ` Christian Brauner
  2024-05-06  7:13   ` Aleksa Sarai
  2 siblings, 2 replies; 22+ messages in thread
From: stsp @ 2024-04-28 17:39 UTC (permalink / raw)
  To: Andy Lutomirski, Aleksa Sarai, Serge E. Hallyn
  Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton,
	Chuck Lever, Alexander Aring, David Laight, linux-fsdevel,
	linux-api, Paolo Bonzini, Christian Göttsche

28.04.2024 19:41, Andy Lutomirski пишет:
>> On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
>> This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
>> It is needed to perform an open operation with the creds that were in
>> effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
>> flag. This allows the process to pre-open some dirs and switch eUID
>> (and other UIDs/GIDs) to the less-privileged user, while still retaining
>> the possibility to open/create files within the pre-opened directory set.
>>
> Then two different things could be done:
>
> 1. The subtree could be used unmounted or via /proc magic links. This
> would be for programs that are aware of this interface.
>
> 2. The subtree could be mounted, and accessed through the mount would
> use the captured creds.
Doesn't this have the same problem
that was pointed to me? Namely (explaining
my impl first), that if someone puts the cred
fd to an unaware process's fd table, such
process can't fully drop its privs. He may not
want to access these dirs, but once its hacked,
the hacker will access these dirs with the
creds came from an outside.
My solution was to close such fds on
exec and disallowing SCM_RIGHTS passage.
SCM_RIGHTS can be allowed in the future,
but the receiver will need to use some
new flag to indicate that he is willing to
get such an fd. Passage via exec() can
probably never be allowed however.

If I understand your model correctly, you
put a magic sub-tree to the fs scope of some
unaware process. He may not want to access
it, but once hacked, the hacker will access
it with the creds from an outside.
And, unlike in my impl, in yours there is
probably no way to prevent that?

In short: my impl confines the hassle within
the single process. It can be extended, and
then the receiver will need to explicitly allow
adding such fds to his fd table.
But your idea seems to inherently require
2 processes, and there is probably no way
for the second process to say "ok, I allow
such sub-tree in my fs scope". And even if
he could, in my impl he can just close the
cred fd, while in yours it seems to persist.

Sorry if I misunderstood your idea.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-04-28 17:39   ` stsp
@ 2024-04-28 19:15     ` stsp
  2024-04-28 20:19     ` Andy Lutomirski
  1 sibling, 0 replies; 22+ messages in thread
From: stsp @ 2024-04-28 19:15 UTC (permalink / raw)
  To: Andy Lutomirski, Aleksa Sarai, Serge E. Hallyn
  Cc: linux-kernel, Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton,
	Chuck Lever, Alexander Aring, David Laight, linux-fsdevel,
	linux-api, Paolo Bonzini, Christian Göttsche

28.04.2024 20:39, stsp пишет:
> In short: my impl confines the hassle within
> the single process. It can be extended, and
> then the receiver will need to explicitly allow
> adding such fds to his fd table.
> But your idea seems to inherently require
> 2 processes, and there is probably no way
> for the second process to say "ok, I allow
> such sub-tree in my fs scope".
There is probably 1 more detail I had
to make explicit: if my model is extended
to allow SCM_RIGHTS passage by the use
of a new flag on a receiver's side, the kernel
will be able to check that the receiver
currently has the same creds as the sender.
Please note that my model is needed for the
process that does setuid() to a less-privileged
user. He would need to receive the cred fd
_before_ such setuid, he would need to use
the special flag to receive it, and the kernel
will also need to check that he has the same
creds anyway, so no escalation can ever happen.
Only that sequence of events makes the
SCM_RIGHTS passage safe, AFAICT. But if
you don't allow SCM_RIGHTS, as my current
impl does, then you are sure the process
can raise the creds only to his own ones.

Now talking about your sub-tree idea, would
it be possible to check that the process had
initially enough creds to access it w/o inheriting
anything? That part looks important to me,
i.e. the process must not access anything
that he couldn't access before dropping privs.
But I am not sure if your idea even includes
the priv drop.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-04-28 17:39   ` stsp
  2024-04-28 19:15     ` stsp
@ 2024-04-28 20:19     ` Andy Lutomirski
  2024-04-28 21:14       ` stsp
  1 sibling, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2024-04-28 20:19 UTC (permalink / raw)
  To: stsp
  Cc: Aleksa Sarai, Serge E. Hallyn, linux-kernel, Stefan Metzmacher,
	Eric Biederman, Alexander Viro, Andy Lutomirski,
	Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

> On Apr 28, 2024, at 10:39 AM, stsp <stsp2@yandex.ru> wrote:
>
> 28.04.2024 19:41, Andy Lutomirski пишет:
>>>> On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
>>> This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
>>> It is needed to perform an open operation with the creds that were in
>>> effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
>>> flag. This allows the process to pre-open some dirs and switch eUID
>>> (and other UIDs/GIDs) to the less-privileged user, while still retaining
>>> the possibility to open/create files within the pre-opened directory set.
>>>
>> Then two different things could be done:
>>
>> 1. The subtree could be used unmounted or via /proc magic links. This
>> would be for programs that are aware of this interface.
>>
>> 2. The subtree could be mounted, and accessed through the mount would
>> use the captured creds.
> Doesn't this have the same problem
> that was pointed to me? Namely (explaining
> my impl first), that if someone puts the cred
> fd to an unaware process's fd table, such
> process can't fully drop its privs. He may not
> want to access these dirs, but once its hacked,
> the hacker will access these dirs with the
> creds came from an outside.

This is not a real problem. If I have a writable fd for /etc/shadow or
an fd for /dev/mem, etc, then I need close them to fully drop privs.

The problem is that, in current kernels, directory fds don’t allow
access using their f_cred, and changing that means that existing
software that does not intend to create a capability will start to do
so.

> My solution was to close such fds on
> exec and disallowing SCM_RIGHTS passage.

I don't see what problem this solves.

> SCM_RIGHTS can be allowed in the future,
> but the receiver will need to use some
> new flag to indicate that he is willing to
> get such an fd. Passage via exec() can
> probably never be allowed however.
>
> If I understand your model correctly, you
> put a magic sub-tree to the fs scope of some
> unaware process.

Only if I want that process to have access!

> He may not want to access
> it, but once hacked, the hacker will access
> it with the creds from an outside.
> And, unlike in my impl, in yours there is
> probably no way to prevent that?

This is fundamental to the whole model. If I stick a FAT formatted USB
drive in the system and mount it, then any process that can find its
way to the mountpoint can write to it.  And if I open a dirfd, any
process with that dirfd can write it.  This is old news and isn't a
problem.


>
> In short: my impl confines the hassle within
> the single process. It can be extended, and
> then the receiver will need to explicitly allow
> adding such fds to his fd table.
> But your idea seems to inherently require
> 2 processes, and there is probably no way
> for the second process to say "ok, I allow
> such sub-tree in my fs scope".

A process could use my proposal to open_tree directory, make a whole
new mountns that is otherwise empty, switch to the mountns, mount the
directory, then change its UID and drop all privs, and still have
access to that one directory.

But even right now, a process could unshare userns and mountns, map
its uid as some nonzero number, rearrange its mountns so it only has
access to that one directory, drop all privs, and seccomp itself, and
only have access to that directory, as it still has the same UID.
Take your pick.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-04-28 20:19     ` Andy Lutomirski
@ 2024-04-28 21:14       ` stsp
  2024-04-28 21:30         ` Andy Lutomirski
  0 siblings, 1 reply; 22+ messages in thread
From: stsp @ 2024-04-28 21:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Serge E. Hallyn, linux-kernel, Stefan Metzmacher,
	Eric Biederman, Alexander Viro, Andy Lutomirski,
	Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

28.04.2024 23:19, Andy Lutomirski пишет:
>> Doesn't this have the same problem
>> that was pointed to me? Namely (explaining
>> my impl first), that if someone puts the cred
>> fd to an unaware process's fd table, such
>> process can't fully drop its privs. He may not
>> want to access these dirs, but once its hacked,
>> the hacker will access these dirs with the
>> creds came from an outside.
> This is not a real problem. If I have a writable fd for /etc/shadow or
> an fd for /dev/mem, etc, then I need close them to fully drop privs.

But isn't that becoming a problem once
you are (maliciously) passed such fds via
exec() or SCM_RIGHTS? You may not know
about them (or about their creds), so you
won't close them. Or?

> The problem is that, in current kernels, directory fds don’t allow
> access using their f_cred, and changing that means that existing
> software that does not intend to create a capability will start to do
> so.

Which is what I am trying to do. :)

>> My solution was to close such fds on
>> exec and disallowing SCM_RIGHTS passage.
> I don't see what problem this solves.

That the process that received them,
doesn't know they have O_CRED_ALLOW
within. So it won't deduce to close them
in time.
Again, this is not the only possible solution.
The receiver can indicate its will to receive
them anyway, and the kernel can check if
such transaction is safe. But it was simpler
to just disallow, who needs to pass those?

>> SCM_RIGHTS can be allowed in the future,
>> but the receiver will need to use some
>> new flag to indicate that he is willing to
>> get such an fd. Passage via exec() can
>> probably never be allowed however.
>>
>> If I understand your model correctly, you
>> put a magic sub-tree to the fs scope of some
>> unaware process.
> Only if I want that process to have access!

Who is "I" in that context?
Can it be some malicious entity?

>> He may not want to access
>> it, but once hacked, the hacker will access
>> it with the creds from an outside.
>> And, unlike in my impl, in yours there is
>> probably no way to prevent that?
> This is fundamental to the whole model. If I stick a FAT formatted USB
> drive in the system and mount it, then any process that can find its
> way to the mountpoint can write to it.  And if I open a dirfd, any
> process with that dirfd can write it.  This is old news and isn't a
> problem.

But IIRC O_DIRECTORY only allows O_RDONLY.
I even re-checked now, and O_DIRECTORY|O_RDWR
gives EISDIR. So is it actually true that
whoever has dir_fd, can write to it?

>> In short: my impl confines the hassle within
>> the single process. It can be extended, and
>> then the receiver will need to explicitly allow
>> adding such fds to his fd table.
>> But your idea seems to inherently require
>> 2 processes, and there is probably no way
>> for the second process to say "ok, I allow
>> such sub-tree in my fs scope".
> A process could use my proposal to open_tree directory, make a whole
> new mountns that is otherwise empty, switch to the mountns, mount the
> directory, then change its UID and drop all privs, and still have
> access to that one directory.

Ok, if that requires actions that can't
be done after priv drop, then it can
indeed fully drop privs w/o mounting
anything, if he doesn't want such access.
Then the only security-related difference
with my approach is that mine guarantees
nothing new can be accessed, no matter
who passes what. Currently nothing can
be passed at all, but if it can - my approach
would still guarantee only the same creds
can be passed, not a new ones.
Can the same restriction be applied in
your case?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-04-28 21:14       ` stsp
@ 2024-04-28 21:30         ` Andy Lutomirski
  2024-04-28 22:12           ` stsp
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2024-04-28 21:30 UTC (permalink / raw)
  To: stsp
  Cc: Aleksa Sarai, Serge E. Hallyn, linux-kernel, Stefan Metzmacher,
	Eric Biederman, Alexander Viro, Andy Lutomirski,
	Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

On Sun, Apr 28, 2024 at 2:15 PM stsp <stsp2@yandex.ru> wrote:
>
> 28.04.2024 23:19, Andy Lutomirski пишет:
> >> Doesn't this have the same problem
> >> that was pointed to me? Namely (explaining
> >> my impl first), that if someone puts the cred
> >> fd to an unaware process's fd table, such
> >> process can't fully drop its privs. He may not
> >> want to access these dirs, but once its hacked,
> >> the hacker will access these dirs with the
> >> creds came from an outside.
> > This is not a real problem. If I have a writable fd for /etc/shadow or
> > an fd for /dev/mem, etc, then I need close them to fully drop privs.
>
> But isn't that becoming a problem once
> you are (maliciously) passed such fds via
> exec() or SCM_RIGHTS? You may not know
> about them (or about their creds), so you
> won't close them. Or?

Wait, who's the malicious party?  Anyone who can open a directory has,
at the time they do so, permission to do so.  If you send that fd to
someone via SCM_RIGHTS, all you accomplish is that they now have the
fd.

In my scenario, the malicious party attacks an *existing* program that
opens an fd for purposes that it doesn't think are dangerous.  And
then it gives the fd *to the malicious program* by whatever means
(could be as simple as dropping privs then doing dlopen).  Then the
malicious program does OA2_INHERIT_CREDS and gets privileges it
shouldn't have.

But if the *whole point* of opening the fd was to capture privileges
and preserve them across a privilege drop, and the program loads
malicious code after dropping privs, then that's a risk that's taken
intentionally.  This is like how, if you do curl
http://whatever.com/foo.sh | bash, you are granting all kinds of
permissions to unknown code.

> >> My solution was to close such fds on
> >> exec and disallowing SCM_RIGHTS passage.
> > I don't see what problem this solves.
>
> That the process that received them,
> doesn't know they have O_CRED_ALLOW
> within. So it won't deduce to close them
> in time.

Hold on -- what exactly are you talking about?  A process does
recvmsg() and doesn't trust the party at the other end.  Then it
doesn't close the received fd.  Then it does setuid(getuid()).  Then
it does dlopen or exec of an untrusted program.

Okay, so the program now has a completely unknown fd.  This is already
part of the thread model.  It could be a cred-capturing fd, it could
be a device node, it could be a socket, it could be a memfd -- it
could be just about anything.  How do any of your proposals or my
proposals cause an actual new problem here?

> > This is fundamental to the whole model. If I stick a FAT formatted USB
> > drive in the system and mount it, then any process that can find its
> > way to the mountpoint can write to it.  And if I open a dirfd, any
> > process with that dirfd can write it.  This is old news and isn't a
> > problem.
>
> But IIRC O_DIRECTORY only allows O_RDONLY.
> I even re-checked now, and O_DIRECTORY|O_RDWR
> gives EISDIR. So is it actually true that
> whoever has dir_fd, can write to it?

If the filesystem grants that UID permission to write, then it can write.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-04-28 21:30         ` Andy Lutomirski
@ 2024-04-28 22:12           ` stsp
  2024-04-29  1:12             ` stsp
  0 siblings, 1 reply; 22+ messages in thread
From: stsp @ 2024-04-28 22:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Serge E. Hallyn, linux-kernel, Stefan Metzmacher,
	Eric Biederman, Alexander Viro, Andy Lutomirski,
	Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

29.04.2024 00:30, Andy Lutomirski пишет:
> On Sun, Apr 28, 2024 at 2:15 PM stsp <stsp2@yandex.ru> wrote:
>> But isn't that becoming a problem once
>> you are (maliciously) passed such fds via
>> exec() or SCM_RIGHTS? You may not know
>> about them (or about their creds), so you
>> won't close them. Or?
> Wait, who's the malicious party?

Someone who opens an fd with O_CRED_ALLOW
and passes it to an unsuspecting process. This
is at least how I understood the Christian Brauner's
point about "unsuspecting userspace".


>    Anyone who can open a directory has,
> at the time they do so, permission to do so.  If you send that fd to
> someone via SCM_RIGHTS, all you accomplish is that they now have the
> fd.

Normally yes.
But fd with O_CRED_ALLOW prevents the
receiver from fully dropping his privs, even
if he doesn't want to deal with it.

> In my scenario, the malicious party attacks an *existing* program that
> opens an fd for purposes that it doesn't think are dangerous.  And
> then it gives the fd *to the malicious program* by whatever means
> (could be as simple as dropping privs then doing dlopen).  Then the
> malicious program does OA2_INHERIT_CREDS and gets privileges it
> shouldn't have.

But what about an inverse scenario?
Malicious program passes an fd to the
"unaware" program, putting it under a
risk. That program probably never cared
about security, since it doesn't play with
privs. But suddenly it has privs, passed
out of nowhere (via exec() for example),
and someone who hacks it, takes them.

>>>> My solution was to close such fds on
>>>> exec and disallowing SCM_RIGHTS passage.
>>> I don't see what problem this solves.
>> That the process that received them,
>> doesn't know they have O_CRED_ALLOW
>> within. So it won't deduce to close them
>> in time.
> Hold on -- what exactly are you talking about?  A process does
> recvmsg() and doesn't trust the party at the other end.  Then it
> doesn't close the received fd.  Then it does setuid(getuid()).  Then
> it does dlopen or exec of an untrusted program.
>
> Okay, so the program now has a completely unknown fd.  This is already
> part of the thread model.  It could be a cred-capturing fd, it could
> be a device node, it could be a socket, it could be a memfd -- it
> could be just about anything.  How do any of your proposals or my
> proposals cause an actual new problem here?

I am not actually sure how widely
does this spread. I.e. /dev/mem is
restricted these days, but if you can
freely pass device nodes around, then
perhaps the ability to pass an r/o dir fd
that can suddenly give creds, is probably
not something new...
But I really don't like to add to this
particular set of cases. I don't think
its safe, I just think its legacy, so while
it is done that way currently, doesn't
mean I can do the same thing and
call it "secure" just because something
like this was already possible.
Or is this actually completely safe?
Does it hurt to have O_CRED_ALLOW
non-passable?

>>> This is fundamental to the whole model. If I stick a FAT formatted USB
>>> drive in the system and mount it, then any process that can find its
>>> way to the mountpoint can write to it.  And if I open a dirfd, any
>>> process with that dirfd can write it.  This is old news and isn't a
>>> problem.
>> But IIRC O_DIRECTORY only allows O_RDONLY.
>> I even re-checked now, and O_DIRECTORY|O_RDWR
>> gives EISDIR. So is it actually true that
>> whoever has dir_fd, can write to it?
> If the filesystem grants that UID permission to write, then it can write.

Which to me sounds like owning an
O_DIRECTORY fd only gives you the
ability to skip the permission checks
of the outer path components, but not
the inner ones. So passing it w/o O_CRED_ALLOW
was quite safe and didn't give you any
new abilities.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-04-28 22:12           ` stsp
@ 2024-04-29  1:12             ` stsp
  0 siblings, 0 replies; 22+ messages in thread
From: stsp @ 2024-04-29  1:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Serge E. Hallyn, linux-kernel, Stefan Metzmacher,
	Eric Biederman, Alexander Viro, Andy Lutomirski,
	Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

29.04.2024 01:12, stsp пишет:
> freely pass device nodes around, then
> perhaps the ability to pass an r/o dir fd
> that can suddenly give creds, is probably
> not something new...
Actually there is probably something
new anyway. The process knows to close
all sensitive files before dropping privs.
But this may not be the case with dirs,
because dir_fd pretty much invalidates
itself when you drop privs: you'll get
EPERM from openat() if you no longer
have rights to open the file, and dir_fd
doesn't help.
Which is why someone may neglect
to close dir_fd before dropping privs,
but with O_CRED_ALLOW that would
be a mistake.

Or what about this scenario: receiver
gets this fd thinking its some useful and
harmless file fd that would be needed
even after priv drop. So he makes sure
with F_GETFL that this fd is O_RDONLY
and doesn't close it. But it appears to be
malicious O_CRED_ALLOW dir_fd, which
basically exposes many files for a write.

So I am concerned about the cases where
an fd was not closed before a priv drop,
because the process didn't realize the threat.

> But if the*whole point*  of opening the fd was to capture privileges
> and preserve them across a privilege drop, and the program loads
> malicious code after dropping privs, then that's a risk that's taken
> intentionally.
If you opened an fd yourself, then yes.
You know what have you opened, and
you care to close sensitive fds before
dropping privs, or you take the risk.
But if you requested something from
another process and got some fd as
the result, the kernel doesn't guarantee
you got an fd to what you have requested.
You can get a malicious fd, which is not
"so" possible when you open an fd yourself.
So if you want to keep such an fd open,
you will probably at least make sure its
read-only, its not a device node (with fstat)
and so on.
All checks pass, and oops, O_CRED_ALLOW...

So why my patch makes O_CRED_ALLOW
non-passable? Because the receiver has
no way to see the potential harm within
such fd. So if he intended to keep an fd
open after some basic safety checks, he
will be tricked.
So while I think its a rather bad idea to
leave the received fds open after priv drop,
I don't completely exclude the possibility
that someone did that, together with a few
safety checks which would be tricked by
O_CRED_ALLOW.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-04-28 16:41 ` [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Andy Lutomirski
  2024-04-28 17:39   ` stsp
@ 2024-04-29  9:12   ` Christian Brauner
  2024-05-06  7:13   ` Aleksa Sarai
  2 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2024-04-29  9:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stas Sergeev, Aleksa Sarai, Serge E. Hallyn, linux-kernel,
	Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

On Sun, Apr 28, 2024 at 09:41:20AM -0700, Andy Lutomirski wrote:
> > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
> > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
> > It is needed to perform an open operation with the creds that were in
> > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
> > flag. This allows the process to pre-open some dirs and switch eUID
> > (and other UIDs/GIDs) to the less-privileged user, while still retaining
> > the possibility to open/create files within the pre-opened directory set.
> >
> 
> I’ve been contemplating this, and I want to propose a different solution.
> 
> First, the problem Stas is solving is quite narrow and doesn’t
> actually need kernel support: if I want to write a user program that
> sandboxes itself, I have at least three solutions already.  I can make
> a userns and a mountns; I can use landlock; and I can have a separate
> process that brokers filesystem access using SCM_RIGHTS.
> 
> But what if I want to run a container, where the container can access
> a specific host directory, and the contained application is not aware
> of the exact technology being used?  I recently started using
> containers in anger in a production setting, and “anger” was
> definitely the right word: binding part of a filesystem in is
> *miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.

Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and
move_mount(). I rewrote the bind-mount logic in systemd based on that
and util-linux uses that as well now.
https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html

> Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
> actually gave up on making one of my use cases work on a Fedora
> system.
> 
> Here’s what I wanted to do, logically, in production: pick a host
> directory, pick a host *principal* (UID, GID, label, etc), and have
> the *entire container* access the directory as that principal. This is
> what happens automatically if I run the whole container as a userns
> with only a single UID mapped, but I don’t really want to do that for
> a whole variety and of reasons.

You're describing idmapped mounts for the most part which are upstream
and are used in exactly that way by a lot of userspace.

> 
> So maybe reimagining Stas’ feature a bit can actually solve this
> problem.  Instead of a special dirfd, what if there was a special
> subtree (in the sense of open_tree) that captures a set of creds and
> does all opens inside the subtree using those creds?

That would mean override creds in the VFS layer when accessing a
specific subtree which is a terrible idea imho. Not just because it will
quickly become a potential dos when you do that with a lot of subtrees
it will also have complex interactions with overlayfs.

> 
> This isn’t a fully formed proposal, but I *think* it should be
> generally fairly safe for even an unprivileged user to clone a subtree
> with a specific flag set to do this. Maybe a capability would be
> needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
> this to a daemon if a privilege is needed, and getting the API right
> might be a bit tricky.
> 
> Then two different things could be done:
> 
> 1. The subtree could be used unmounted or via /proc magic links. This
> would be for programs that are aware of this interface.
> 
> 2. The subtree could be mounted, and accessed through the mount would
> use the captured creds.
> 
> (Hmm. What would a new open_tree() pointing at this special subtree do?)
> 
> 
> With all this done, if userspace wired it up, a container user could
> do something like:
> 
> —bind-capture-creds source=dest
> 
> And the contained program would access source *as the user who started
> the container*, and this would just work without relabeling or
> fiddling with owner uids or gids or ACLs, and it would continue to
> work even if the container has multiple dynamically allocated subuids
> mapped (e.g. one for “root” and one for the actual application).
> 
> Bonus points for the ability to revoke the creds in an already opened
> subtree. Or even for the creds to automatically revoke themselves when
> the opener exits (or maybe when a specific cred-pinning fd goes away).
> 
> (This should work for single files as well as for directories.)
> 
> New LSM hooks or extensions of existing hooks might be needed to make
> LSMs comfortable with this.
> 
> What do you all think?

I think the problem you're describing is already mostly solved.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-04-28 16:41 ` [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Andy Lutomirski
  2024-04-28 17:39   ` stsp
  2024-04-29  9:12   ` Christian Brauner
@ 2024-05-06  7:13   ` Aleksa Sarai
  2024-05-06 17:29     ` Andy Lutomirski
  2 siblings, 1 reply; 22+ messages in thread
From: Aleksa Sarai @ 2024-05-06  7:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stas Sergeev, Serge E. Hallyn, linux-kernel, Stefan Metzmacher,
	Eric Biederman, Alexander Viro, Andy Lutomirski,
	Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

[-- Attachment #1: Type: text/plain, Size: 5369 bytes --]

On 2024-04-28, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
> > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
> > It is needed to perform an open operation with the creds that were in
> > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
> > flag. This allows the process to pre-open some dirs and switch eUID
> > (and other UIDs/GIDs) to the less-privileged user, while still retaining
> > the possibility to open/create files within the pre-opened directory set.
> >
> 
> I’ve been contemplating this, and I want to propose a different solution.
> 
> First, the problem Stas is solving is quite narrow and doesn’t
> actually need kernel support: if I want to write a user program that
> sandboxes itself, I have at least three solutions already.  I can make
> a userns and a mountns; I can use landlock; and I can have a separate
> process that brokers filesystem access using SCM_RIGHTS.
> 
> But what if I want to run a container, where the container can access
> a specific host directory, and the contained application is not aware
> of the exact technology being used?  I recently started using
> containers in anger in a production setting, and “anger” was
> definitely the right word: binding part of a filesystem in is
> *miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.
> Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
> actually gave up on making one of my use cases work on a Fedora
> system.
> 
> Here’s what I wanted to do, logically, in production: pick a host
> directory, pick a host *principal* (UID, GID, label, etc), and have
> the *entire container* access the directory as that principal. This is
> what happens automatically if I run the whole container as a userns
> with only a single UID mapped, but I don’t really want to do that for
> a whole variety and of reasons.
> 
> So maybe reimagining Stas’ feature a bit can actually solve this
> problem.  Instead of a special dirfd, what if there was a special
> subtree (in the sense of open_tree) that captures a set of creds and
> does all opens inside the subtree using those creds?
> 
> This isn’t a fully formed proposal, but I *think* it should be
> generally fairly safe for even an unprivileged user to clone a subtree
> with a specific flag set to do this. Maybe a capability would be
> needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
> this to a daemon if a privilege is needed, and getting the API right
> might be a bit tricky.

Tying this to an actual mount rather than a file handle sounds like a
more plausible proposal than OA2_CRED_INHERIT, but it just seems that
this is going to re-create all of the work that went into id-mapped
mounts but with the extra-special step of making the generic VFS
permissions no longer work normally (unless the idea is that everything
would pretend to be owned by current_fsuid()?).

IMHO it also isn't enough to just make open work, you need to make all
operations work (which leads to a non-trivial amount of
filesystem-specific handling), which is just idmapped mounts. A lot of
work was put into making sure that is safe, and collapsing owners seems
like it will cause a lot of headaches.

I also find it somewhat amusing that this proposal is to basically give
up on multi-user permissions for this one directory tree because it's
too annoying to deal with. In that case, isn't chmod 777 a simpler
solution? (I'm being a bit flippant, of course there is a difference,
but the net result is that all users in the container would have the
same permissions with all of the fun issues that implies.)

In short, AFAICS idmapped mounts pretty much solve this problem (minus
the ability to collapse users, which I suspect is not a good idea in
general)?

> Then two different things could be done:
> 
> 1. The subtree could be used unmounted or via /proc magic links. This
> would be for programs that are aware of this interface.
> 
> 2. The subtree could be mounted, and accessed through the mount would
> use the captured creds.
> 
> (Hmm. What would a new open_tree() pointing at this special subtree do?)
> 
> 
> With all this done, if userspace wired it up, a container user could
> do something like:
> 
> —bind-capture-creds source=dest
> 
> And the contained program would access source *as the user who started
> the container*, and this would just work without relabeling or
> fiddling with owner uids or gids or ACLs, and it would continue to
> work even if the container has multiple dynamically allocated subuids
> mapped (e.g. one for “root” and one for the actual application).
> 
> Bonus points for the ability to revoke the creds in an already opened
> subtree. Or even for the creds to automatically revoke themselves when
> the opener exits (or maybe when a specific cred-pinning fd goes away).
> 
> (This should work for single files as well as for directories.)
> 
> New LSM hooks or extensions of existing hooks might be needed to make
> LSMs comfortable with this.
> 
> What do you all think?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-05-06  7:13   ` Aleksa Sarai
@ 2024-05-06 17:29     ` Andy Lutomirski
  2024-05-06 17:34       ` Andy Lutomirski
                         ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Andy Lutomirski @ 2024-05-06 17:29 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Stas Sergeev, Serge E. Hallyn, linux-kernel, Stefan Metzmacher,
	Eric Biederman, Alexander Viro, Andy Lutomirski,
	Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

Replying to a couple emails at once...

On Mon, May 6, 2024 at 12:14 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2024-04-28, Andy Lutomirski <luto@amacapital.net> wrote:
> > > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
> > > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
> > > It is needed to perform an open operation with the creds that were in
> > > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
> > > flag. This allows the process to pre-open some dirs and switch eUID
> > > (and other UIDs/GIDs) to the less-privileged user, while still retaining
> > > the possibility to open/create files within the pre-opened directory set.
> > >
> >
> > I’ve been contemplating this, and I want to propose a different solution.
> >
> > First, the problem Stas is solving is quite narrow and doesn’t
> > actually need kernel support: if I want to write a user program that
> > sandboxes itself, I have at least three solutions already.  I can make
> > a userns and a mountns; I can use landlock; and I can have a separate
> > process that brokers filesystem access using SCM_RIGHTS.
> >
> > But what if I want to run a container, where the container can access
> > a specific host directory, and the contained application is not aware
> > of the exact technology being used?  I recently started using
> > containers in anger in a production setting, and “anger” was
> > definitely the right word: binding part of a filesystem in is
> > *miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.
> > Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
> > actually gave up on making one of my use cases work on a Fedora
> > system.
> >
> > Here’s what I wanted to do, logically, in production: pick a host
> > directory, pick a host *principal* (UID, GID, label, etc), and have
> > the *entire container* access the directory as that principal. This is
> > what happens automatically if I run the whole container as a userns
> > with only a single UID mapped, but I don’t really want to do that for
> > a whole variety and of reasons.
> >
> > So maybe reimagining Stas’ feature a bit can actually solve this
> > problem.  Instead of a special dirfd, what if there was a special
> > subtree (in the sense of open_tree) that captures a set of creds and
> > does all opens inside the subtree using those creds?
> >
> > This isn’t a fully formed proposal, but I *think* it should be
> > generally fairly safe for even an unprivileged user to clone a subtree
> > with a specific flag set to do this. Maybe a capability would be
> > needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
> > this to a daemon if a privilege is needed, and getting the API right
> > might be a bit tricky.
>
> Tying this to an actual mount rather than a file handle sounds like a
> more plausible proposal than OA2_CRED_INHERIT, but it just seems that
> this is going to re-create all of the work that went into id-mapped
> mounts but with the extra-special step of making the generic VFS
> permissions no longer work normally (unless the idea is that everything
> would pretend to be owned by current_fsuid()?).

I was assuming that the owner uid and gid would be show to stat, etc
as usual.  But the permission checks would be done against the
captured creds.

>
> IMHO it also isn't enough to just make open work, you need to make all
> operations work (which leads to a non-trivial amount of
> filesystem-specific handling), which is just idmapped mounts. A lot of
> work was put into making sure that is safe, and collapsing owners seems
> like it will cause a lot of headaches.
>
> I also find it somewhat amusing that this proposal is to basically give
> up on multi-user permissions for this one directory tree because it's
> too annoying to deal with. In that case, isn't chmod 777 a simpler
> solution? (I'm being a bit flippant, of course there is a difference,
> but the net result is that all users in the container would have the
> same permissions with all of the fun issues that implies.)
>
> In short, AFAICS idmapped mounts pretty much solve this problem (minus
> the ability to collapse users, which I suspect is not a good idea in
> general)?
>

With my kernel hat on, maybe I agree.  But with my *user* hat on, I
think I pretty strongly disagree.  Look, idmapis lousy for
unprivileged use:

$ install -m 0700 -d test_directory
$ echo 'hi there' >test_directory/file
$ podman run -it --rm
--mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
# cat /tmp/file
hi there

<-- Hey, look, this kind of works!

# setpriv --reuid=1 ls /tmp
ls: cannot open directory '/tmp': Permission denied

<-- Gee, thanks, Linux!


Obviously this is a made up example.  But it's quite analogous to a
real example.  Suppose I want to make a directory that will contain
some MySQL data.  I don't want to share this directory with anyone
else, so I set its mode to 0700.  Then I want to fire up an
unprivileged MySQL container, so I build or download it, and then I
run it and bind my directory to /var/lib/mysql and I run it.  I don't
need to think about UIDs or anything because it's 2024 and containers
just work.  Okay, I need to setenforce 0 because I'm on Fedora and
SELinux makes absolutely no sense in a container world, but I can live
with that.

Except that it doesn't work!  Because unless I want to manually futz
with the idmaps to get mysql to have access to the directory inside
the container, only *root* gets to get in.  But I bet that even
futzing with the idmap doesn't work, because software like mysql often
expects that root *and* a user can access data.  And some software
even does privilege separation and uses more than one UID.

So I want a way to give *an entire container* access to a directory.
Classic UNIX DAC is just *wrong* for this use case.  Maybe idmaps
could learn a way to squash multiple ids down to one.  Or maybe
something like my silly credential-capturing mount proposal could
work.  But the status quo is not actually amazing IMO.

I haven't looked at the idmap implementation nearly enough to have any
opinion as to whether squashing UID is practical or whether there's
any sensible way to specify it in the configuration.

> On Apr 29, 2024, at 2:12 AM, Christian Brauner <brauner@kernel.org> wrote:
>
> Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and
> move_mount(). I rewrote the bind-mount logic in systemd based on that
> and util-linux uses that as well now.
> https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html
>

Yep, I remember that.

>> Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
>> actually gave up on making one of my use cases work on a Fedora
>> system.
>>
>> Here’s what I wanted to do, logically, in production: pick a host
>> directory, pick a host *principal* (UID, GID, label, etc), and have
>> the *entire container* access the directory as that principal. This is
>> what happens automatically if I run the whole container as a userns
>> with only a single UID mapped, but I don’t really want to do that for
>> a whole variety and of reasons.
>
> You're describing idmapped mounts for the most part which are upstream
> and are used in exactly that way by a lot of userspace.
>

See above...

>>
>> So maybe reimagining Stas’ feature a bit can actually solve this
>> problem.  Instead of a special dirfd, what if there was a special
>> subtree (in the sense of open_tree) that captures a set of creds and
>> does all opens inside the subtree using those creds?
>
> That would mean override creds in the VFS layer when accessing a
> specific subtree which is a terrible idea imho. Not just because it will
> quickly become a potential dos when you do that with a lot of subtrees
> it will also have complex interactions with overlayfs.

I was deliberately talking about semantics, not implementation. This
may well be impossible to implement straightforwardly.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-05-06 17:29     ` Andy Lutomirski
@ 2024-05-06 17:34       ` Andy Lutomirski
  2024-05-06 19:34       ` David Laight
  2024-05-07  7:42       ` Christian Brauner
  2 siblings, 0 replies; 22+ messages in thread
From: Andy Lutomirski @ 2024-05-06 17:34 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Stas Sergeev, Serge E. Hallyn, linux-kernel, Stefan Metzmacher,
	Eric Biederman, Alexander Viro, Andy Lutomirski,
	Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

On Mon, May 6, 2024 at 10:29 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> Replying to a couple emails at once...
>
> On Mon, May 6, 2024 at 12:14 AM Aleksa Sarai <cyphar@cyphar.com> wrote:

> > I also find it somewhat amusing that this proposal is to basically give
> > up on multi-user permissions for this one directory tree because it's
> > too annoying to deal with. In that case, isn't chmod 777 a simpler
> > solution? (I'm being a bit flippant, of course there is a difference,
> > but the net result is that all users in the container would have the
> > same permissions with all of the fun issues that implies.)
> >
> > In short, AFAICS idmapped mounts pretty much solve this problem (minus
> > the ability to collapse users, which I suspect is not a good idea in
> > general)?
> >
>
> With my kernel hat on, maybe I agree.  But with my *user* hat on, I
> think I pretty strongly disagree.  Look, idmapis lousy for
> unprivileged use:
>
> $ install -m 0700 -d test_directory
> $ echo 'hi there' >test_directory/file
> $ podman run -it --rm
> --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
> # cat /tmp/file
> hi there
>
> <-- Hey, look, this kind of works!
>
> # setpriv --reuid=1 ls /tmp
> ls: cannot open directory '/tmp': Permission denied
>
> <-- Gee, thanks, Linux!

I should add: this is lousy even for privileged use.  On a normal
non-containerized system:

$ ls -ld /var/lib/mysql
drwxr-xr-x. 3 mysql mysql 4096 Sep 20  2023 /var/lib/mysql

This makes perfect sense.

But if I want to run mysql in a container in a sane way, my only real
choice is either to trust the container manager quite strongly (so it
only maps this directory into the correct container) or, if I want to
separate out management of this container into its own UID (which is a
good practice), then I'm forced to do some kind of fragile hack like
making a directory only accessible to the correct UID and then
creating a 0777 directory inside it and bind-mounting *that*.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* RE: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-05-06 17:29     ` Andy Lutomirski
  2024-05-06 17:34       ` Andy Lutomirski
@ 2024-05-06 19:34       ` David Laight
  2024-05-06 21:53         ` Andy Lutomirski
  2024-05-07  7:42       ` Christian Brauner
  2 siblings, 1 reply; 22+ messages in thread
From: David Laight @ 2024-05-06 19:34 UTC (permalink / raw)
  To: 'Andy Lutomirski', Aleksa Sarai
  Cc: Stas Sergeev, Serge E. Hallyn, linux-kernel, Stefan Metzmacher,
	Eric Biederman, Alexander Viro, Andy Lutomirski,
	Christian Brauner, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, linux-fsdevel, linux-api, Paolo Bonzini,
	Christian Göttsche

...
> So I want a way to give *an entire container* access to a directory.
> Classic UNIX DAC is just *wrong* for this use case.  Maybe idmaps
> could learn a way to squash multiple ids down to one.  Or maybe
> something like my silly credential-capturing mount proposal could
> work.  But the status quo is not actually amazing IMO.

Isn't that what gids are for :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-05-06 19:34       ` David Laight
@ 2024-05-06 21:53         ` Andy Lutomirski
  0 siblings, 0 replies; 22+ messages in thread
From: Andy Lutomirski @ 2024-05-06 21:53 UTC (permalink / raw)
  To: David Laight
  Cc: Aleksa Sarai, Stas Sergeev, Serge E. Hallyn, linux-kernel,
	Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Christian Brauner, Jan Kara, Jeff Layton,
	Chuck Lever, Alexander Aring, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

On Mon, May 6, 2024 at 12:35 PM David Laight <David.Laight@aculab.com> wrote:
>
> ...
> > So I want a way to give *an entire container* access to a directory.
> > Classic UNIX DAC is just *wrong* for this use case.  Maybe idmaps
> > could learn a way to squash multiple ids down to one.  Or maybe
> > something like my silly credential-capturing mount proposal could
> > work.  But the status quo is not actually amazing IMO.
>
> Isn't that what gids are for :-)

I dunno.  How, exactly, is a regular non-root user of a Linux computer
supposed to configure gids in their home directory so that a container
(which uses subgids, possibly dynamically allocated) gets access to
the correct thing?  And why should that poor user need to think about
this at all?

--Andy

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-05-06 17:29     ` Andy Lutomirski
  2024-05-06 17:34       ` Andy Lutomirski
  2024-05-06 19:34       ` David Laight
@ 2024-05-07  7:42       ` Christian Brauner
  2024-05-07 20:38         ` Andy Lutomirski
  2 siblings, 1 reply; 22+ messages in thread
From: Christian Brauner @ 2024-05-07  7:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Stas Sergeev, Serge E. Hallyn, linux-kernel,
	Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

> With my kernel hat on, maybe I agree.  But with my *user* hat on, I
> think I pretty strongly disagree.  Look, idmapis lousy for
> unprivileged use:
> 
> $ install -m 0700 -d test_directory
> $ echo 'hi there' >test_directory/file
> $ podman run -it --rm
> --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]

$ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]

as an unprivileged user doesn't use idmapped mounts at all. So I'm not
sure what this is showing. I suppose you're talking about idmaps in
general.

> # cat /tmp/file
> hi there
> 
> <-- Hey, look, this kind of works!
> 
> # setpriv --reuid=1 ls /tmp
> ls: cannot open directory '/tmp': Permission denied
> 
> <-- Gee, thanks, Linux!
> 
> 
> Obviously this is a made up example.  But it's quite analogous to a
> real example.  Suppose I want to make a directory that will contain
> some MySQL data.  I don't want to share this directory with anyone
> else, so I set its mode to 0700.  Then I want to fire up an
> unprivileged MySQL container, so I build or download it, and then I
> run it and bind my directory to /var/lib/mysql and I run it.  I don't
> need to think about UIDs or anything because it's 2024 and containers
> just work.  Okay, I need to setenforce 0 because I'm on Fedora and
> SELinux makes absolutely no sense in a container world, but I can live
> with that.
> 
> Except that it doesn't work!  Because unless I want to manually futz
> with the idmaps to get mysql to have access to the directory inside
> the container, only *root* gets to get in.  But I bet that even
> futzing with the idmap doesn't work, because software like mysql often
> expects that root *and* a user can access data.  And some software
> even does privilege separation and uses more than one UID.

If the directory is 700 and it's owned by say root:root on the host and
you want to share that with arbitrary container users then this isn't
something you can do today (ignoring group permissions and ACLs for the
sake of your argument) even on the host so that's not a limitation of
userns or idmapped mounts. That means many to one mappings of uids/gids.

> So I want a way to give *an entire container* access to a directory.
> Classic UNIX DAC is just *wrong* for this use case.  Maybe idmaps
> could learn a way to squash multiple ids down to one.  Or maybe

Many idmappings to one is in principle possible and I've noted that idea
down as a possible extension at
https://github.com/uapi-group/kernel-features quite a while (2 years?) ago.

> I haven't looked at the idmap implementation nearly enough to have any
> opinion as to whether squashing UID is practical or whether there's

It's doable. The interesting bit to me was that if we want to allow
writes we'd need a way to determine what the uid/gid would be to write
down. Imho, that's not super difficult to solve though. The most obvious
one is that userspace can just determine it when creating the idmapped
mount.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-05-07  7:42       ` Christian Brauner
@ 2024-05-07 20:38         ` Andy Lutomirski
  2024-05-08  7:32           ` Christian Brauner
  0 siblings, 1 reply; 22+ messages in thread
From: Andy Lutomirski @ 2024-05-07 20:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Aleksa Sarai, Stas Sergeev, Serge E. Hallyn, linux-kernel,
	Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

On Tue, May 7, 2024 at 12:42 AM Christian Brauner <brauner@kernel.org> wrote:
>
> > With my kernel hat on, maybe I agree.  But with my *user* hat on, I
> > think I pretty strongly disagree.  Look, idmapis lousy for
> > unprivileged use:
> >
> > $ install -m 0700 -d test_directory
> > $ echo 'hi there' >test_directory/file
> > $ podman run -it --rm
> > --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
>
> $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
>
> as an unprivileged user doesn't use idmapped mounts at all. So I'm not
> sure what this is showing. I suppose you're talking about idmaps in
> general.

Meh, fair enough.  But I don't think this would have worked any better
with privilege.

Can idmaps be programmed by an otherwise unprivileged owner of a
userns and a mountns inside?

> Many idmappings to one is in principle possible and I've noted that idea
> down as a possible extension at
> https://github.com/uapi-group/kernel-features quite a while (2 years?) ago.
>
> > I haven't looked at the idmap implementation nearly enough to have any
> > opinion as to whether squashing UID is practical or whether there's
>
> It's doable. The interesting bit to me was that if we want to allow
> writes we'd need a way to determine what the uid/gid would be to write
> down. Imho, that's not super difficult to solve though. The most obvious
> one is that userspace can just determine it when creating the idmapped
> mount.

Seems reasonable to me.  If this is set up by someone unprivileged, it
would need to be that uid/gid.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2()
  2024-05-07 20:38         ` Andy Lutomirski
@ 2024-05-08  7:32           ` Christian Brauner
  0 siblings, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2024-05-08  7:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Stas Sergeev, Serge E. Hallyn, linux-kernel,
	Stefan Metzmacher, Eric Biederman, Alexander Viro,
	Andy Lutomirski, Jan Kara, Jeff Layton, Chuck Lever,
	Alexander Aring, David Laight, linux-fsdevel, linux-api,
	Paolo Bonzini, Christian Göttsche

On Tue, May 07, 2024 at 01:38:42PM -0700, Andy Lutomirski wrote:
> On Tue, May 7, 2024 at 12:42 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > > With my kernel hat on, maybe I agree.  But with my *user* hat on, I
> > > think I pretty strongly disagree.  Look, idmapis lousy for
> > > unprivileged use:
> > >
> > > $ install -m 0700 -d test_directory
> > > $ echo 'hi there' >test_directory/file
> > > $ podman run -it --rm
> > > --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
> >
> > $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
> >
> > as an unprivileged user doesn't use idmapped mounts at all. So I'm not
> > sure what this is showing. I suppose you're talking about idmaps in
> > general.
> 
> Meh, fair enough.  But I don't think this would have worked any better
> with privilege.
> 
> Can idmaps be programmed by an otherwise unprivileged owner of a
> userns and a mountns inside?

Yes, but only for userns mountable filesystems that support idmapped
mounts. IOW, you need privilege over the superblock and the idmapping
you're trying to use.

> 
> > Many idmappings to one is in principle possible and I've noted that idea
> > down as a possible extension at
> > https://github.com/uapi-group/kernel-features quite a while (2 years?) ago.
> >
> > > I haven't looked at the idmap implementation nearly enough to have any
> > > opinion as to whether squashing UID is practical or whether there's
> >
> > It's doable. The interesting bit to me was that if we want to allow
> > writes we'd need a way to determine what the uid/gid would be to write
> > down. Imho, that's not super difficult to solve though. The most obvious
> > one is that userspace can just determine it when creating the idmapped
> > mount.
> 
> Seems reasonable to me.  If this is set up by someone unprivileged, it
> would need to be that uid/gid.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-05-08  7:32 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-26 13:33 [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Stas Sergeev
2024-04-26 13:33 ` [PATCH v5 1/3] fs: reorganize path_openat() Stas Sergeev
2024-04-26 13:33 ` [PATCH v5 2/3] open: add O_CRED_ALLOW flag Stas Sergeev
2024-04-27  2:12   ` kernel test robot
2024-04-26 13:33 ` [PATCH v5 3/3] openat2: add OA2_CRED_INHERIT flag Stas Sergeev
2024-04-28 16:41 ` [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Andy Lutomirski
2024-04-28 17:39   ` stsp
2024-04-28 19:15     ` stsp
2024-04-28 20:19     ` Andy Lutomirski
2024-04-28 21:14       ` stsp
2024-04-28 21:30         ` Andy Lutomirski
2024-04-28 22:12           ` stsp
2024-04-29  1:12             ` stsp
2024-04-29  9:12   ` Christian Brauner
2024-05-06  7:13   ` Aleksa Sarai
2024-05-06 17:29     ` Andy Lutomirski
2024-05-06 17:34       ` Andy Lutomirski
2024-05-06 19:34       ` David Laight
2024-05-06 21:53         ` Andy Lutomirski
2024-05-07  7:42       ` Christian Brauner
2024-05-07 20:38         ` Andy Lutomirski
2024-05-08  7:32           ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).