All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/25] user_namespace: introduce fsid mappings
@ 2020-02-18 14:33 Christian Brauner
  2020-02-18 14:33 ` [PATCH v3 01/25] user_namespace: introduce fsid mappings infrastructure Christian Brauner
                   ` (28 more replies)
  0 siblings, 29 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Hey everyone,

This is v3 after (off- and online) discussions with Jann the following
changes were made:
- To handle nested user namespaces cleanly, efficiently, and with full
  backwards compatibility for non fsid-mapping aware workloads we only
  allow writing fsid mappings as long as the corresponding id mapping
  type has not been written.
- Split the patch which adds the internal ability in
  kernel/user_namespace to verify and write fsid mappings into tree
  patches:
  1. [PATCH v3 04/25] fsuidgid: add fsid mapping helpers
     patch to implement core helpers for fsid translations (i.e.
     make_kfs*id(), from_kfs*id{_munged}(), kfs*id_to_k*id(),
     k*id_to_kfs*id()
  2. [PATCH v3 05/25] user_namespace: refactor map_write()
     patch to refactor map_write() in order to prepare for actual fsid
     mappings changes in the following patch. (This should make it
     easier to review.)
  3. [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings
     patch to implement actual fsid mappings support in mape_write()
- Let the keyctl infrastructure only operate on kfsid which are always
  mapped/looked up in the id mappings similar to what we do for
  filesystems that have the same superblock visible in multiple user
  namespaces.

This version also comes with minimal tests which I intend to expand in
the future.

From pings and off-list questions and discussions at Google Container
Security Summit there seems to be quite a lot of interest in this
patchset with use-cases ranging from layer sharing for app containers
and k8s, as well as data sharing between containers with different id
mappings. I haven't Cced all people because I don't have all the email
adresses at hand but I've at least added Phil now. :)

This is the implementation of shiftfs which was cooked up during lunch at
Linux Plumbers 2019 the day after the container's microconference. The
idea is a design-stew from Stéphane, Aleksa, Eric, and myself (and by
now also Jann.
Back then we all were quite busy with other work and couldn't really sit
down and implement it. But I took a few days last week to do this work,
including demos and performance testing.
This implementation does not require us to touch the VFS substantially
at all. Instead, we implement shiftfs via fsid mappings.
With this patch, it took me 20 mins to port both LXD and LXC to support
shiftfs via fsid mappings.

For anyone wanting to play with this the branch can be pulled from:
https://github.com/brauner/linux/tree/fsid_mappings
https://gitlab.com/brauner/linux/-/tree/fsid_mappings
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings

The main use case for shiftfs for us is in allowing shared writable
storage to multiple containers using non-overlapping id mappings.
In such a scenario you want the fsids to be valid and identical in both
containers for the shared mount. A demo for this exists in [3].
If you don't want to read on, go straight to the other demos below in
[1] and [2].

People not as familiar with user namespaces might not be aware that fsid
mappings already exist. Right now, fsid mappings are always identical to
id mappings. Specifically, the kernel will lookup fsuids in the uid
mappings and fsgids in the gid mappings of the relevant user namespace.

With this patch series we simply introduce the ability to create fsid
mappings that are different from the id mappings of a user namespace.
The whole feature set is placed under a config option that defaults to
false.

In the usual case of running an unprivileged container we will have
setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
correspond to this id mapping, i.e. all files which we want to appear as
0:0 inside the user namespace will be chowned to 100000:100000 on the
host. This works, because whenever the kernel needs to do a filesystem
access it will lookup the corresponding uid and gid in the idmapping
tables of the container.
Now think about the case where we want to have an id mapping of 0 100000
100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
share a single on-disk mapping with multiple containers that all have
different id mappings.
This will be problematic. Whenever a filesystem access is requested, the
kernel will now try to lookup a mapping for 300000 in the id mapping
tables of the user namespace but since there is none the files will
appear to be owned by the overflow id, i.e. usually 65534:65534 or
nobody:nogroup.

With fsid mappings we can solve this by writing an id mapping of 0
100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
access the kernel will now lookup the mapping for 300000 in the fsid
mapping tables of the user namespace. And since such a mapping exists,
the corresponding files will have correct ownership.

A note on proc (and sys), the proc filesystem is special in sofar as it
only has a single superblock that is (currently but might be about to
change) visible in all user namespaces (same goes for sys). This means
it has special semantics in many ways, including how file ownership and
access works. The fsid mapping implementation does not alter how proc
(and sys) ownership works. proc and sys will both continue to lookup
filesystem access in id mapping tables.

When Writing fsid mappings the same rules apply as when writing id
mappings so I won't reiterate them here. The limit of fs id mappings is
the same as for id mappings, i.e. 340 lines.

# Performance
Back when I extended the range of possible id mappings to 340 I did
performance testing by booting into single user mode, creating 1,000,000
files to fstat()ing them and calculated the mean fstat() time per file.
(Back when Linux was still fast. I won't mention that the stat
 numbers have (thanks microcode!) doubled since then...)
I did the same test for this patchset: one vanilla kernel, one kernel
with my fsid mapping patches but CONFIG_USER_NS_FSID set to n and one
with fsid mappings patches enabled. I then ran the same test on all
three kernels and compared the numbers. The implementation does not
introduce overhead. That's all I can say. Here are the numbers:

             | vanilla v5.5 | fsid mappings       | fsid mappings      | fsid mappings      |
	     |              | disabled in Kconfig | enabled in Kconfig | enabled in Kconfig |
	     |   	    |                     | and unset for all  | and set for all    |
	     |   	    |    		  | test cases         | test cases         |
-------------|--------------|---------------------|--------------------|--------------------|
 0  mappings |       367 ns |              365 ns |             365 ns |             N/A    |
 1  mappings |       362 ns |              367 ns |             363 ns |             363 ns |
 2  mappings |       361 ns |              369 ns |             363 ns |             364 ns |
 3  mappings |       361 ns |              368 ns |             366 ns |             365 ns |
 5  mappings |       365 ns |              368 ns |             363 ns |             365 ns |
 10 mappings |       391 ns |              388 ns |             387 ns |             389 ns |
 50 mappings |       395 ns |              398 ns |             401 ns |             397 ns |
100 mappings |       400 ns |              405 ns |             399 ns |             399 ns |
200 mappings |       404 ns |              407 ns |             430 ns |             404 ns |
300 mappings |       492 ns |              494 ns |             432 ns |             413 ns |
340 mappings |       495 ns |              497 ns |             500 ns |             484 ns |

# Demos
[1]: Create a container with different id and fsid mappings.
     https://asciinema.org/a/300233 
[2]: Create a container with id mappings but without fsid mappings.
     https://asciinema.org/a/300234
[3]: Share storage between multiple containers with non-overlapping id
     mappings.
     https://asciinema.org/a/300235

Thanks!
Christian

Christian Brauner (25):
  user_namespace: introduce fsid mappings infrastructure
  proc: add /proc/<pid>/fsuid_map
  proc: add /proc/<pid>/fsgid_map
  fsuidgid: add fsid mapping helpers
  user_namespace: refactor map_write()
  user_namespace: make map_write() support fsid mappings
  proc: task_state(): use from_kfs{g,u}id_munged
  cred: add kfs{g,u}id
  fs: add is_userns_visible() helper
  namei: may_{o_}create(): handle fsid mappings
  inode: inode_owner_or_capable(): handle fsid mappings
  capability: privileged_wrt_inode_uidgid(): handle fsid mappings
  stat: handle fsid mappings
  open: handle fsid mappings
  posix_acl: handle fsid mappings
  attr: notify_change(): handle fsid mappings
  commoncap: cap_bprm_set_creds(): handle fsid mappings
  commoncap: cap_task_fix_setuid(): handle fsid mappings
  commoncap: handle fsid mappings with vfs caps
  exec: bprm_fill_uid(): handle fsid mappings
  ptrace: adapt ptrace_may_access() to always uses unmapped fsids
  devpts: handle fsid mappings
  keys: handle fsid mappings
  sys: handle fsid mappings in set*id() calls
  selftests: add simple fsid mapping selftests

 fs/attr.c                                     |  23 +-
 fs/devpts/inode.c                             |   7 +-
 fs/exec.c                                     |  25 +-
 fs/inode.c                                    |   7 +-
 fs/namei.c                                    |  36 +-
 fs/open.c                                     |  16 +-
 fs/posix_acl.c                                |  17 +-
 fs/proc/array.c                               |   5 +-
 fs/proc/base.c                                |  34 ++
 fs/stat.c                                     |  48 +-
 include/linux/cred.h                          |   4 +
 include/linux/fs.h                            |   5 +
 include/linux/fsuidgid.h                      | 122 +++++
 include/linux/stat.h                          |   1 +
 include/linux/user_namespace.h                |  10 +
 init/Kconfig                                  |  11 +
 kernel/capability.c                           |  10 +-
 kernel/ptrace.c                               |   4 +-
 kernel/sys.c                                  | 106 +++-
 kernel/user.c                                 |  22 +
 kernel/user_namespace.c                       | 517 ++++++++++++++++--
 security/commoncap.c                          |  35 +-
 security/keys/key.c                           |   2 +-
 security/keys/permission.c                    |   4 +-
 security/keys/process_keys.c                  |   6 +-
 security/keys/request_key.c                   |  10 +-
 security/keys/request_key_auth.c              |   2 +-
 tools/testing/selftests/Makefile              |   1 +
 .../testing/selftests/user_namespace/Makefile |  11 +
 .../selftests/user_namespace/test_fsid_map.c  | 511 +++++++++++++++++
 30 files changed, 1461 insertions(+), 151 deletions(-)
 create mode 100644 include/linux/fsuidgid.h
 create mode 100644 tools/testing/selftests/user_namespace/Makefile
 create mode 100644 tools/testing/selftests/user_namespace/test_fsid_map.c


base-commit: bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9
-- 
2.25.0


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v3 01/25] user_namespace: introduce fsid mappings infrastructure
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-19  2:33   ` Serge E. Hallyn
  2020-02-18 14:33 ` [PATCH v3 02/25] proc: add /proc/<pid>/fsuid_map Christian Brauner
                   ` (27 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

This introduces the infrastructure to setup fsid mappings which will be used in
later patches.
All new code depends on CONFIG_USER_NS_FSID=y. It currently defaults to "N".
If CONFIG_USER_NS_FSID is not set, no new code is added.

In this patch fsuid_m_show() and fsgid_m_show() are introduced. They are
identical to uid_m_show() and gid_m_show() until we introduce from_kfsuid() and
from_kfsgid() in a follow-up patch.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Randy Dunlap <rdunlap@infradead.org>:
  - Fix typo in USER_NS_FSID kconfig documentation.

/* v3 */
unchanged
---
 include/linux/user_namespace.h |  10 +++
 init/Kconfig                   |  11 +++
 kernel/user.c                  |  22 ++++++
 kernel/user_namespace.c        | 122 +++++++++++++++++++++++++++++++++
 4 files changed, 165 insertions(+)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 6ef1c7109fc4..e44742b0cf8a 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -56,6 +56,10 @@ enum ucount_type {
 struct user_namespace {
 	struct uid_gid_map	uid_map;
 	struct uid_gid_map	gid_map;
+#ifdef CONFIG_USER_NS_FSID
+	struct uid_gid_map	fsuid_map;
+	struct uid_gid_map	fsgid_map;
+#endif
 	struct uid_gid_map	projid_map;
 	atomic_t		count;
 	struct user_namespace	*parent;
@@ -127,6 +131,12 @@ struct seq_operations;
 extern const struct seq_operations proc_uid_seq_operations;
 extern const struct seq_operations proc_gid_seq_operations;
 extern const struct seq_operations proc_projid_seq_operations;
+#ifdef CONFIG_USER_NS_FSID
+extern const struct seq_operations proc_fsuid_seq_operations;
+extern const struct seq_operations proc_fsgid_seq_operations;
+extern ssize_t proc_fsuid_map_write(struct file *, const char __user *, size_t, loff_t *);
+extern ssize_t proc_fsgid_map_write(struct file *, const char __user *, size_t, loff_t *);
+#endif
 extern ssize_t proc_uid_map_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t proc_gid_map_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t, loff_t *);
diff --git a/init/Kconfig b/init/Kconfig
index cfee56c151f1..d4d0beeba48f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1111,6 +1111,17 @@ config USER_NS
 
 	  If unsure, say N.
 
+config USER_NS_FSID
+	bool "User namespace fsid mappings"
+	depends on USER_NS
+	default n
+	help
+	  This allows containers to alter their filesystem id mappings.
+	  With this containers with different id mappings can still share
+	  the same filesystem.
+
+	  If unsure, say N.
+
 config PID_NS
 	bool "PID Namespaces"
 	default y
diff --git a/kernel/user.c b/kernel/user.c
index 5235d7f49982..2ccaea9b810b 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -55,6 +55,28 @@ struct user_namespace init_user_ns = {
 			},
 		},
 	},
+#ifdef CONFIG_USER_NS_FSID
+	.fsuid_map = {
+		.nr_extents = 1,
+		{
+			.extent[0] = {
+				.first = 0,
+				.lower_first = 0,
+				.count = 4294967295U,
+			},
+		},
+	},
+	.fsgid_map = {
+		.nr_extents = 1,
+		{
+			.extent[0] = {
+				.first = 0,
+				.lower_first = 0,
+				.count = 4294967295U,
+			},
+		},
+	},
+#endif
 	.count = ATOMIC_INIT(3),
 	.owner = GLOBAL_ROOT_UID,
 	.group = GLOBAL_ROOT_GID,
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 8eadadc478f9..cbdf456f95f0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -191,6 +191,16 @@ static void free_user_ns(struct work_struct *work)
 			kfree(ns->projid_map.forward);
 			kfree(ns->projid_map.reverse);
 		}
+#ifdef CONFIG_USER_NS_FSID
+		if (ns->fsgid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
+			kfree(ns->fsgid_map.forward);
+			kfree(ns->fsgid_map.reverse);
+		}
+		if (ns->fsuid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
+			kfree(ns->fsuid_map.forward);
+			kfree(ns->fsuid_map.reverse);
+		}
+#endif
 		retire_userns_sysctls(ns);
 		key_free_user_ns(ns);
 		ns_free_inum(&ns->ns);
@@ -637,6 +647,50 @@ static int projid_m_show(struct seq_file *seq, void *v)
 	return 0;
 }
 
+#ifdef CONFIG_USER_NS_FSID
+static int fsuid_m_show(struct seq_file *seq, void *v)
+{
+	struct user_namespace *ns = seq->private;
+	struct uid_gid_extent *extent = v;
+	struct user_namespace *lower_ns;
+	uid_t lower;
+
+	lower_ns = seq_user_ns(seq);
+	if ((lower_ns == ns) && lower_ns->parent)
+		lower_ns = lower_ns->parent;
+
+	lower = from_kuid(lower_ns, KUIDT_INIT(extent->lower_first));
+
+	seq_printf(seq, "%10u %10u %10u\n",
+		extent->first,
+		lower,
+		extent->count);
+
+	return 0;
+}
+
+static int fsgid_m_show(struct seq_file *seq, void *v)
+{
+	struct user_namespace *ns = seq->private;
+	struct uid_gid_extent *extent = v;
+	struct user_namespace *lower_ns;
+	gid_t lower;
+
+	lower_ns = seq_user_ns(seq);
+	if ((lower_ns == ns) && lower_ns->parent)
+		lower_ns = lower_ns->parent;
+
+	lower = from_kgid(lower_ns, KGIDT_INIT(extent->lower_first));
+
+	seq_printf(seq, "%10u %10u %10u\n",
+		extent->first,
+		lower,
+		extent->count);
+
+	return 0;
+}
+#endif
+
 static void *m_start(struct seq_file *seq, loff_t *ppos,
 		     struct uid_gid_map *map)
 {
@@ -674,6 +728,22 @@ static void *projid_m_start(struct seq_file *seq, loff_t *ppos)
 	return m_start(seq, ppos, &ns->projid_map);
 }
 
+#ifdef CONFIG_USER_NS_FSID
+static void *fsuid_m_start(struct seq_file *seq, loff_t *ppos)
+{
+	struct user_namespace *ns = seq->private;
+
+	return m_start(seq, ppos, &ns->fsuid_map);
+}
+
+static void *fsgid_m_start(struct seq_file *seq, loff_t *ppos)
+{
+	struct user_namespace *ns = seq->private;
+
+	return m_start(seq, ppos, &ns->fsgid_map);
+}
+#endif
+
 static void *m_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	(*pos)++;
@@ -706,6 +776,22 @@ const struct seq_operations proc_projid_seq_operations = {
 	.show = projid_m_show,
 };
 
+#ifdef CONFIG_USER_NS_FSID
+const struct seq_operations proc_fsuid_seq_operations = {
+	.start = fsuid_m_start,
+	.stop = m_stop,
+	.next = m_next,
+	.show = fsuid_m_show,
+};
+
+const struct seq_operations proc_fsgid_seq_operations = {
+	.start = fsgid_m_start,
+	.stop = m_stop,
+	.next = m_next,
+	.show = fsgid_m_show,
+};
+#endif
+
 static bool mappings_overlap(struct uid_gid_map *new_map,
 			     struct uid_gid_extent *extent)
 {
@@ -1081,6 +1167,42 @@ ssize_t proc_projid_map_write(struct file *file, const char __user *buf,
 			 &ns->projid_map, &ns->parent->projid_map);
 }
 
+#ifdef CONFIG_USER_NS_FSID
+ssize_t proc_fsuid_map_write(struct file *file, const char __user *buf,
+			     size_t size, loff_t *ppos)
+{
+	struct seq_file *seq = file->private_data;
+	struct user_namespace *ns = seq->private;
+	struct user_namespace *seq_ns = seq_user_ns(seq);
+
+	if (!ns->parent)
+		return -EPERM;
+
+	if ((seq_ns != ns) && (seq_ns != ns->parent))
+		return -EPERM;
+
+	return map_write(file, buf, size, ppos, CAP_SETUID, &ns->fsuid_map,
+			 &ns->parent->fsuid_map);
+}
+
+ssize_t proc_fsgid_map_write(struct file *file, const char __user *buf,
+			     size_t size, loff_t *ppos)
+{
+	struct seq_file *seq = file->private_data;
+	struct user_namespace *ns = seq->private;
+	struct user_namespace *seq_ns = seq_user_ns(seq);
+
+	if (!ns->parent)
+		return -EPERM;
+
+	if ((seq_ns != ns) && (seq_ns != ns->parent))
+		return -EPERM;
+
+	return map_write(file, buf, size, ppos, CAP_SETGID, &ns->fsgid_map,
+			 &ns->parent->fsgid_map);
+}
+#endif
+
 static bool new_idmap_permitted(const struct file *file,
 				struct user_namespace *ns, int cap_setid,
 				struct uid_gid_map *new_map)
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 02/25] proc: add /proc/<pid>/fsuid_map
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
  2020-02-18 14:33 ` [PATCH v3 01/25] user_namespace: introduce fsid mappings infrastructure Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-19  2:33   ` Serge E. Hallyn
  2020-02-18 14:33 ` [PATCH v3 03/25] proc: add /proc/<pid>/fsgid_map Christian Brauner
                   ` (26 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

The /proc/<pid>/fsuid_map file can be written once to setup an fsuid mapping
for a user namespace. Writing to this file has the same restrictions as writing
to /proc/<pid>/fsuid_map:

root@e1-vm:/# cat /proc/13023/fsuid_map
         0     300000     100000

Fsid mappings have always been around. They are currently always identical to
the id mappings for a user namespace. This means, currently whenever an fsid
needs to be looked up the kernel will use the id mapping of the user namespace.
With the introduction of fsid mappings the kernel will now lookup fsids in the
fsid mappings of the user namespace. If no fsid mapping exists the kernel will
continue looking up fsids in the id mappings of the user namespace. Hence, if a
system supports fsid mappings through /proc/<pid>/fs*id_map and a container
runtime is not aware of fsid mappings it or does not use them it will it will
continue to work just as before.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Fix grammar in commit message.
---
 fs/proc/base.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index c7c64272b0fa..5fb28004663e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2970,6 +2970,13 @@ static int proc_projid_map_open(struct inode *inode, struct file *file)
 	return proc_id_map_open(inode, file, &proc_projid_seq_operations);
 }
 
+#ifdef CONFIG_USER_NS_FSID
+static int proc_fsuid_map_open(struct inode *inode, struct file *file)
+{
+	return proc_id_map_open(inode, file, &proc_fsuid_seq_operations);
+}
+#endif
+
 static const struct file_operations proc_uid_map_operations = {
 	.open		= proc_uid_map_open,
 	.write		= proc_uid_map_write,
@@ -2994,6 +3001,16 @@ static const struct file_operations proc_projid_map_operations = {
 	.release	= proc_id_map_release,
 };
 
+#ifdef CONFIG_USER_NS_FSID
+static const struct file_operations proc_fsuid_map_operations = {
+	.open		= proc_fsuid_map_open,
+	.write		= proc_fsuid_map_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= proc_id_map_release,
+};
+#endif
+
 static int proc_setgroups_open(struct inode *inode, struct file *file)
 {
 	struct user_namespace *ns = NULL;
@@ -3176,6 +3193,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
 #endif
 #ifdef CONFIG_USER_NS
+#ifdef CONFIG_USER_NS_FSID
+	REG("fsuid_map",  S_IRUGO|S_IWUSR, proc_fsuid_map_operations),
+#endif
 	REG("uid_map",    S_IRUGO|S_IWUSR, proc_uid_map_operations),
 	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
 	REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 03/25] proc: add /proc/<pid>/fsgid_map
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
  2020-02-18 14:33 ` [PATCH v3 01/25] user_namespace: introduce fsid mappings infrastructure Christian Brauner
  2020-02-18 14:33 ` [PATCH v3 02/25] proc: add /proc/<pid>/fsuid_map Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-19  2:33   ` Serge E. Hallyn
  2020-02-18 14:33 ` [PATCH v3 04/25] fsuidgid: add fsid mapping helpers Christian Brauner
                   ` (25 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

The /proc/<pid>/fsgid_map file can be written once to setup an fsgid mapping
for a user namespace. Writing to this file has the same restrictions as writing
to /proc/<pid>/fsgid_map.

root@e1-vm:/# cat /proc/13023/fsgid_map
         0     300000     100000

Fsid mappings have always been around. They are currently always identical to
the id mappings for a user namespace. This means, currently whenever an fsid
needs to be looked up the kernel will use the id mapping of the user namespace.
With the introduction of fsid mappings the kernel will now lookup fsids in the
fsid mappings of the user namespace. If no fsid mapping exists the kernel will
continue looking up fsids in the id mappings of the user namespace. Hence, if a
system supports fsid mappings through /proc/<pid>/fs*id_map and a container
runtime is not aware of fsid mappings it or does not use them it will it will
continue to work just as before.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Fix grammar in commit message.
---
 fs/proc/base.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5fb28004663e..1303cdd2e617 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2975,6 +2975,11 @@ static int proc_fsuid_map_open(struct inode *inode, struct file *file)
 {
 	return proc_id_map_open(inode, file, &proc_fsuid_seq_operations);
 }
+
+static int proc_fsgid_map_open(struct inode *inode, struct file *file)
+{
+	return proc_id_map_open(inode, file, &proc_fsgid_seq_operations);
+}
 #endif
 
 static const struct file_operations proc_uid_map_operations = {
@@ -3009,6 +3014,14 @@ static const struct file_operations proc_fsuid_map_operations = {
 	.llseek		= seq_lseek,
 	.release	= proc_id_map_release,
 };
+
+static const struct file_operations proc_fsgid_map_operations = {
+	.open		= proc_fsgid_map_open,
+	.write		= proc_fsgid_map_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= proc_id_map_release,
+};
 #endif
 
 static int proc_setgroups_open(struct inode *inode, struct file *file)
@@ -3195,6 +3208,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_USER_NS
 #ifdef CONFIG_USER_NS_FSID
 	REG("fsuid_map",  S_IRUGO|S_IWUSR, proc_fsuid_map_operations),
+	REG("fsgid_map",  S_IRUGO|S_IWUSR, proc_fsgid_map_operations),
 #endif
 	REG("uid_map",    S_IRUGO|S_IWUSR, proc_uid_map_operations),
 	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 04/25] fsuidgid: add fsid mapping helpers
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (2 preceding siblings ...)
  2020-02-18 14:33 ` [PATCH v3 03/25] proc: add /proc/<pid>/fsgid_map Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-18 14:33 ` [PATCH v3 05/25] user_namespace: refactor map_write() Christian Brauner
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

This adds a set of helpers to translate between kfsuid/kfsgid and their
userspace fsuid/fsgid counter parts relative to a given user namespace.

- kuid_t make_kfsuid(struct user_namespace *from, uid_t fsuid)
  Maps a user-namespace fsuid pair into a kfsuid.
  If no fsuid mappings have been written it behaves identical to calling
  make_kuid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- kgid_t make_kfsgid(struct user_namespace *from, gid_t fsgid)
  Maps a user-namespace fsgid pair into a kfsgid.
  If no fsgid mappings have been written it behaves identical to calling
  make_kgid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- uid_t from_kfsuid(struct user_namespace *to, kuid_t fsuid)
  Creates a fsuid from a kfsuid user-namespace pair if possible.
  If no fsuid mappings have been written it behaves identical to calling
  from_kuid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- gid_t from_kfsgid(struct user_namespace *to, kgid_t fsgid)
  Creates a fsgid from a kfsgid user-namespace pair if possible.
  If no fsgid mappings have been written it behaves identical to calling
  make_kgid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- uid_t from_kfsuid_munged(struct user_namespace *to, kuid_t fsuid)
  Always creates a fsuid from a kfsuid user-namespace pair.
  If no fsuid mappings have been written it behaves identical to calling
  from_kuid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- gid_t from_kfsgid_munged(struct user_namespace *to, kgid_t fsgid)
  Always creates a fsgid from a kfsgid user-namespace pair if possible.
  If no fsgid mappings have been written it behaves identical to calling
  make_kgid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- bool kfsuid_has_mapping(struct user_namespace *ns, kuid_t uid)
  Check whether this kfsuid has a mapping in the provided user namespace.
  If no fsuid mappings have been written it behaves identical to calling
  from_kuid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- bool kfsgid_has_mapping(struct user_namespace *ns, kgid_t gid)
  Check whether this kfsgid has a mapping in the provided user namespace.
  If no fsgid mappings have been written it behaves identical to calling
  make_kgid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- kuid_t kfsuid_to_kuid(struct user_namespace *to, kuid_t kfsuid)
  Translate from a kfsuid into a kuid.

- kgid_t kfsgid_to_kgid(struct user_namespace *to, kgid_t kfsgid)
  Translate from a kfsgid into a kgid.

- kuid_t kuid_to_kfsuid(struct user_namespace *to, kuid_t kuid)
  Translate from a kuid into a kfsuid.

- kgid_t kgid_to_kfsuid(struct user_namespace *to, kgid_t kgid)
  Translate from a kgid into a kfsgid.

Cc: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
- Jann Horn <jannh@google.com>:
  - Split changes to map_write() to implement fsid mappings into three separate
    patches: basic fsid helpers, preparatory changes to map_write(), actual
    fsid mapping support in map_write().
---
 include/linux/fsuidgid.h | 122 +++++++++++++++++++++++++++++++++
 kernel/user_namespace.c  | 141 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 261 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/fsuidgid.h

diff --git a/include/linux/fsuidgid.h b/include/linux/fsuidgid.h
new file mode 100644
index 000000000000..46763591f4e6
--- /dev/null
+++ b/include/linux/fsuidgid.h
@@ -0,0 +1,122 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_FSUIDGID_H
+#define _LINUX_FSUIDGID_H
+
+#include <linux/uidgid.h>
+
+#ifdef CONFIG_USER_NS_FSID
+
+extern kuid_t make_kfsuid(struct user_namespace *from, uid_t fsuid);
+extern kgid_t make_kfsgid(struct user_namespace *from, gid_t fsgid);
+extern uid_t from_kfsuid(struct user_namespace *to, kuid_t kfsuid);
+extern gid_t from_kfsgid(struct user_namespace *to, kgid_t kfsgid);
+extern uid_t from_kfsuid_munged(struct user_namespace *to, kuid_t kfsuid);
+extern gid_t from_kfsgid_munged(struct user_namespace *to, kgid_t kfsgid);
+
+static inline bool kfsuid_has_mapping(struct user_namespace *ns, kuid_t kfsuid)
+{
+	return from_kfsuid(ns, kfsuid) != (uid_t) -1;
+}
+
+static inline bool kfsgid_has_mapping(struct user_namespace *ns, kgid_t kfsgid)
+{
+	return from_kfsgid(ns, kfsgid) != (gid_t) -1;
+}
+
+static inline kuid_t kfsuid_to_kuid(struct user_namespace *to, kuid_t kfsuid)
+{
+	uid_t fsuid = from_kfsuid(to, kfsuid);
+	if (fsuid == (uid_t) -1)
+		return INVALID_UID;
+	return make_kuid(to, fsuid);
+}
+
+static inline kgid_t kfsgid_to_kgid(struct user_namespace *to, kgid_t kfsgid)
+{
+	gid_t fsgid = from_kfsgid(to, kfsgid);
+	if (fsgid == (gid_t) -1)
+		return INVALID_GID;
+	return make_kgid(to, fsgid);
+}
+
+static inline kuid_t kuid_to_kfsuid(struct user_namespace *to, kuid_t kuid)
+{
+	uid_t uid = from_kuid(to, kuid);
+	if (uid == (uid_t) -1)
+		return INVALID_UID;
+	return make_kfsuid(to, uid);
+}
+
+static inline kgid_t kgid_to_kfsgid(struct user_namespace *to, kgid_t kgid)
+{
+	gid_t gid = from_kgid(to, kgid);
+	if (gid == (gid_t) -1)
+		return INVALID_GID;
+	return make_kfsgid(to, gid);
+}
+
+#else
+
+static inline kuid_t make_kfsuid(struct user_namespace *from, uid_t fsuid)
+{
+	return make_kuid(from, fsuid);
+}
+
+static inline kgid_t make_kfsgid(struct user_namespace *from, gid_t fsgid)
+{
+	return make_kgid(from, fsgid);
+}
+
+static inline uid_t from_kfsuid(struct user_namespace *to, kuid_t kfsuid)
+{
+	return from_kuid(to, kfsuid);
+}
+
+static inline gid_t from_kfsgid(struct user_namespace *to, kgid_t kfsgid)
+{
+	return from_kgid(to, kfsgid);
+}
+
+static inline uid_t from_kfsuid_munged(struct user_namespace *to, kuid_t kfsuid)
+{
+	return from_kuid_munged(to, kfsuid);
+}
+
+static inline gid_t from_kfsgid_munged(struct user_namespace *to, kgid_t kfsgid)
+{
+	return from_kgid_munged(to, kfsgid);
+}
+
+static inline bool kfsuid_has_mapping(struct user_namespace *ns, kuid_t kfsuid)
+{
+	return kuid_has_mapping(ns, kfsuid);
+}
+
+static inline bool kfsgid_has_mapping(struct user_namespace *ns, kgid_t kfsgid)
+{
+	return kgid_has_mapping(ns, kfsgid);
+}
+
+static inline kuid_t kfsuid_to_kuid(struct user_namespace *to, kuid_t kfsuid)
+{
+	return kfsuid;
+}
+
+static inline kgid_t kfsgid_to_kgid(struct user_namespace *to, kgid_t kfsgid)
+{
+	return kfsgid;
+}
+
+static inline kuid_t kuid_to_kfsuid(struct user_namespace *to, kuid_t kuid)
+{
+	return kuid;
+}
+
+static inline kgid_t kgid_to_kfsgid(struct user_namespace *to, kgid_t kgid)
+{
+	return kgid;
+}
+
+#endif /* CONFIG_USER_NS_FSID */
+
+#endif /* _LINUX_FSUIDGID_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index cbdf456f95f0..2cfd1e519cc4 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -20,6 +20,7 @@
 #include <linux/fs_struct.h>
 #include <linux/bsearch.h>
 #include <linux/sort.h>
+#include <linux/fsuidgid.h>
 
 static struct kmem_cache *user_ns_cachep __read_mostly;
 static DEFINE_MUTEX(userns_state_mutex);
@@ -583,6 +584,142 @@ projid_t from_kprojid_munged(struct user_namespace *targ, kprojid_t kprojid)
 }
 EXPORT_SYMBOL(from_kprojid_munged);
 
+#ifdef CONFIG_USER_NS_FSID
+/**
+ *	make_kfsuid - Map a user-namespace fsuid pair into a kuid.
+ *	@ns:  User namespace that the fsuid is in
+ *	@fsuid: User identifier
+ *
+ *	Maps a user-namespace fsuid pair into a kernel internal kfsuid,
+ *	and returns that kfsuid.
+ *
+ *	When there is no mapping defined for the user-namespace kfsuid
+ *	pair INVALID_UID is returned.  Callers are expected to test
+ *	for and handle INVALID_UID being returned.  INVALID_UID
+ *	may be tested for using uid_valid().
+ */
+kuid_t make_kfsuid(struct user_namespace *ns, uid_t fsuid)
+{
+	/* Map the fsuid to a global kernel fsuid */
+	return KUIDT_INIT(map_id_down(&ns->fsuid_map, fsuid));
+}
+EXPORT_SYMBOL(make_kfsuid);
+
+/**
+ *	from_kfsuid - Create a fsuid from a kfsuid user-namespace pair.
+ *	@targ: The user namespace we want a fsuid in.
+ *	@kfsuid: The kernel internal fsuid to start with.
+ *
+ *	Map @kfsuid into the user-namespace specified by @targ and
+ *	return the resulting fsuid.
+ *
+ *	There is always a mapping into the initial user_namespace.
+ *
+ *	If @kfsuid has no mapping in @targ (uid_t)-1 is returned.
+ */
+uid_t from_kfsuid(struct user_namespace *targ, kuid_t kfsuid)
+{
+	/* Map the fsuid from a global kernel fsuid */
+	return map_id_up(&targ->fsuid_map, __kuid_val(kfsuid));
+}
+EXPORT_SYMBOL(from_kfsuid);
+
+/**
+ *	from_kfsuid_munged - Create a fsuid from a kfsuid user-namespace pair.
+ *	@targ: The user namespace we want a fsuid in.
+ *	@kfsuid: The kernel internal fsuid to start with.
+ *
+ *	Map @kfsuid into the user-namespace specified by @targ and
+ *	return the resulting fsuid.
+ *
+ *	There is always a mapping into the initial user_namespace.
+ *
+ *	Unlike from_kfsuid from_kfsuid_munged never fails and always
+ *	returns a valid fsuid.  This makes from_kfsuid_munged appropriate
+ *	for use in syscalls like stat and getuid where failing the
+ *	system call and failing to provide a valid fsuid are not an
+ *	options.
+ *
+ *	If @kfsuid has no mapping in @targ overflowuid is returned.
+ */
+uid_t from_kfsuid_munged(struct user_namespace *targ, kuid_t kfsuid)
+{
+	uid_t fsuid;
+	fsuid = from_kfsuid(targ, kfsuid);
+
+	if (fsuid == (uid_t) -1)
+		fsuid = overflowuid;
+	return fsuid;
+}
+EXPORT_SYMBOL(from_kfsuid_munged);
+
+/**
+ *	make_kfsgid - Map a user-namespace fsgid pair into a kfsgid.
+ *	@ns:  User namespace that the fsgid is in
+ *	@fsgid: User identifier
+ *
+ *	Maps a user-namespace fsgid pair into a kernel internal kfsgid,
+ *	and returns that kfsgid.
+ *
+ *	When there is no mapping defined for the user-namespace fsgid
+ *	pair INVALID_GID is returned.  Callers are expected to test
+ *	for and handle INVALID_GID being returned.  INVALID_GID
+ *	may be tested for using gid_valid().
+ */
+kgid_t make_kfsgid(struct user_namespace *ns, gid_t fsgid)
+{
+	/* Map the fsgid to a global kernel fsgid */
+	return KGIDT_INIT(map_id_down(&ns->fsgid_map, fsgid));
+}
+EXPORT_SYMBOL(make_kfsgid);
+
+/**
+ *	from_kfsgid - Create a fsgid from a kfsgid user-namespace pair.
+ *	@targ: The user namespace we want a fsgid in.
+ *	@kfsgid: The kernel internal fsgid to start with.
+ *
+ *	Map @kfsgid into the user-namespace specified by @targ and
+ *	return the resulting fsgid.
+ *
+ *	There is always a mapping into the initial user_namespace.
+ *
+ *	If @kfsgid has no mapping in @targ (gid_t)-1 is returned.
+ */
+gid_t from_kfsgid(struct user_namespace *targ, kgid_t kfsgid)
+{
+	/* Map the fsgid from a global kernel fsgid */
+	return map_id_up(&targ->fsgid_map, __kgid_val(kfsgid));
+}
+EXPORT_SYMBOL(from_kfsgid);
+
+/**
+ *	from_kfsgid_munged - Create a fsgid from a kfsgid user-namespace pair.
+ *	@targ: The user namespace we want a fsgid in.
+ *	@kfsgid: The kernel internal fsgid to start with.
+ *
+ *	Map @kfsgid into the user-namespace specified by @targ and
+ *	return the resulting fsgid.
+ *
+ *	There is always a mapping into the initial user_namespace.
+ *
+ *	Unlike from_kfsgid from_kfsgid_munged never fails and always
+ *	returns a valid fsgid.  This makes from_kfsgid_munged appropriate
+ *	for use in syscalls like stat and getgid where failing the
+ *	system call and failing to provide a valid fsgid are not options.
+ *
+ *	If @kfsgid has no mapping in @targ overflowgid is returned.
+ */
+gid_t from_kfsgid_munged(struct user_namespace *targ, kgid_t kfsgid)
+{
+	gid_t fsgid;
+	fsgid = from_kfsgid(targ, kfsgid);
+
+	if (fsgid == (gid_t) -1)
+		fsgid = overflowgid;
+	return fsgid;
+}
+EXPORT_SYMBOL(from_kfsgid_munged);
+#endif /* CONFIG_USER_NS_FSID */
 
 static int uid_m_show(struct seq_file *seq, void *v)
 {
@@ -659,7 +796,7 @@ static int fsuid_m_show(struct seq_file *seq, void *v)
 	if ((lower_ns == ns) && lower_ns->parent)
 		lower_ns = lower_ns->parent;
 
-	lower = from_kuid(lower_ns, KUIDT_INIT(extent->lower_first));
+	lower = from_kfsuid(lower_ns, KUIDT_INIT(extent->lower_first));
 
 	seq_printf(seq, "%10u %10u %10u\n",
 		extent->first,
@@ -680,7 +817,7 @@ static int fsgid_m_show(struct seq_file *seq, void *v)
 	if ((lower_ns == ns) && lower_ns->parent)
 		lower_ns = lower_ns->parent;
 
-	lower = from_kgid(lower_ns, KGIDT_INIT(extent->lower_first));
+	lower = from_kfsgid(lower_ns, KGIDT_INIT(extent->lower_first));
 
 	seq_printf(seq, "%10u %10u %10u\n",
 		extent->first,
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 05/25] user_namespace: refactor map_write()
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (3 preceding siblings ...)
  2020-02-18 14:33 ` [PATCH v3 04/25] fsuidgid: add fsid mapping helpers Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-19  2:35   ` Serge E. Hallyn
  2020-02-18 14:33 ` [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings Christian Brauner
                   ` (23 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Refactor map_write() to prepare for adding fsid mappings support. This mainly
factors out various open-coded parts into helpers that can be reused in the
follow up patch.

Cc: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
patch not present

/* v3 */
patch added
- Jann Horn <jannh@google.com>:
  - Split changes to map_write() to implement fsid mappings into three separate
    patches: basic fsid helpers, preparatory changes to map_write(), actual
    fsid mapping support in map_write().
---
 kernel/user_namespace.c | 117 +++++++++++++++++++++++++---------------
 1 file changed, 74 insertions(+), 43 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 2cfd1e519cc4..e91141262bcc 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1038,10 +1038,10 @@ static int cmp_extents_reverse(const void *a, const void *b)
 }
 
 /**
- * sort_idmaps - Sorts an array of idmap entries.
+ * sort_map - Sorts an array of idmap entries.
  * Can only be called if number of mappings exceeds UID_GID_MAP_MAX_BASE_EXTENTS.
  */
-static int sort_idmaps(struct uid_gid_map *map)
+static int sort_map(struct uid_gid_map *map)
 {
 	if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
 		return 0;
@@ -1064,6 +1064,71 @@ static int sort_idmaps(struct uid_gid_map *map)
 	return 0;
 }
 
+static int sort_idmaps(struct uid_gid_map *map)
+{
+	return sort_map(map);
+}
+
+static int map_from_parent(struct uid_gid_map *new_map,
+			   struct uid_gid_map *parent_map)
+{
+	unsigned idx;
+
+	/* Map the lower ids from the parent user namespace to the
+	 * kernel global id space.
+	 */
+	for (idx = 0; idx < new_map->nr_extents; idx++) {
+		struct uid_gid_extent *e;
+		u32 lower_first;
+
+		if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+			e = &new_map->extent[idx];
+		else
+			e = &new_map->forward[idx];
+
+		lower_first = map_id_range_down(parent_map, e->lower_first, e->count);
+
+		/* Fail if we can not map the specified extent to
+		 * the kernel global id space.
+		 */
+		if (lower_first == (u32)-1)
+			return -EPERM;
+
+		e->lower_first = lower_first;
+	}
+
+	return 0;
+}
+
+static int map_into_kids(struct uid_gid_map *id_map,
+			 struct uid_gid_map *parent_id_map)
+{
+	return map_from_parent(id_map, parent_id_map);
+}
+
+static void install_idmaps(struct uid_gid_map *id_map,
+			   struct uid_gid_map *new_id_map)
+{
+	if (new_id_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) {
+		memcpy(id_map->extent, new_id_map->extent,
+		       new_id_map->nr_extents * sizeof(new_id_map->extent[0]));
+	} else {
+		id_map->forward = new_id_map->forward;
+		id_map->reverse = new_id_map->reverse;
+	}
+}
+
+static void free_idmaps(struct uid_gid_map *new_id_map)
+{
+	if (new_id_map->nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
+		kfree(new_id_map->forward);
+		kfree(new_id_map->reverse);
+		new_id_map->forward = NULL;
+		new_id_map->reverse = NULL;
+		new_id_map->nr_extents = 0;
+	}
+}
+
 static ssize_t map_write(struct file *file, const char __user *buf,
 			 size_t count, loff_t *ppos,
 			 int cap_setid,
@@ -1073,7 +1138,6 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	struct seq_file *seq = file->private_data;
 	struct user_namespace *ns = seq->private;
 	struct uid_gid_map new_map;
-	unsigned idx;
 	struct uid_gid_extent extent;
 	char *kbuf = NULL, *pos, *next_line;
 	ssize_t ret;
@@ -1191,61 +1255,28 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	if (!new_idmap_permitted(file, ns, cap_setid, &new_map))
 		goto out;
 
-	ret = -EPERM;
-	/* Map the lower ids from the parent user namespace to the
-	 * kernel global id space.
-	 */
-	for (idx = 0; idx < new_map.nr_extents; idx++) {
-		struct uid_gid_extent *e;
-		u32 lower_first;
-
-		if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
-			e = &new_map.extent[idx];
-		else
-			e = &new_map.forward[idx];
-
-		lower_first = map_id_range_down(parent_map,
-						e->lower_first,
-						e->count);
-
-		/* Fail if we can not map the specified extent to
-		 * the kernel global id space.
-		 */
-		if (lower_first == (u32) -1)
-			goto out;
-
-		e->lower_first = lower_first;
-	}
+	ret = map_into_kids(&new_map, parent_map);
+	if (ret)
+		goto out;
 
 	/*
 	 * If we want to use binary search for lookup, this clones the extent
 	 * array and sorts both copies.
 	 */
 	ret = sort_idmaps(&new_map);
-	if (ret < 0)
+	if (ret)
 		goto out;
 
 	/* Install the map */
-	if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) {
-		memcpy(map->extent, new_map.extent,
-		       new_map.nr_extents * sizeof(new_map.extent[0]));
-	} else {
-		map->forward = new_map.forward;
-		map->reverse = new_map.reverse;
-	}
+	install_idmaps(map, &new_map);
 	smp_wmb();
 	map->nr_extents = new_map.nr_extents;
 
 	*ppos = count;
 	ret = count;
 out:
-	if (ret < 0 && new_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
-		kfree(new_map.forward);
-		kfree(new_map.reverse);
-		map->forward = NULL;
-		map->reverse = NULL;
-		map->nr_extents = 0;
-	}
+	if (ret < 0)
+		free_idmaps(&new_map);
 
 	mutex_unlock(&userns_state_mutex);
 	kfree(kbuf);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (4 preceding siblings ...)
  2020-02-18 14:33 ` [PATCH v3 05/25] user_namespace: refactor map_write() Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-19 16:18   ` Jann Horn
  2020-02-18 14:33 ` [PATCH v3 07/25] proc: task_state(): use from_kfs{g,u}id_munged Christian Brauner
                   ` (22 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Based on discussions with Jann we decided in order to cleanly handle nested
user namespaces that fsid mappings can only be written before the corresponding
id mappings have been written. Writing id mappings before writing the
corresponding fsid mappings causes fsid mappings to mirror id mappings.

Consider creating a user namespace NS1 with the initial user namespace as
parent. Assume NS1 receives id mapping 0 100000 100000 and fsid mappings 0
300000 100000. Files that root in NS1 will create will map to kfsuid=300000 and
kfsgid=300000 and will hence be owned by uid=300000 and gid 300000 on-disk in
the initial user namespace.
Now assume user namespace NS2 is created in user namespace NS1. Assume that NS2
receives id mapping 0 10000 65536 and an fsid mapping of 0 10000 65536. Files
that root in NS2 will create will map to kfsuid=10000 and kfsgid=10000 in NS1.
hence, files created by NS2 will hence be appear to be be owned by uid=10000
and gid=10000 on-disk in NS1. Looking at the initial user namespace, files
created by NS2 will map to kfsuid=310000 and kfsgid=310000 and hence will be
owned by uid=310000 and gid=310000 on-disk.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
patch not present

/* v3 */
patch added
- Jann Horn <jannh@google.com>:
  - Split changes to map_write() to implement fsid mappings into three separate
    patches: basic fsid helpers, preparatory changes to map_write(), actual
    fsid mapping support in map_write().
---
 kernel/user_namespace.c | 165 ++++++++++++++++++++++++++++++++++------
 1 file changed, 143 insertions(+), 22 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index e91141262bcc..7905ca19dfab 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -25,9 +25,18 @@
 static struct kmem_cache *user_ns_cachep __read_mostly;
 static DEFINE_MUTEX(userns_state_mutex);
 
+enum idmap_type {
+	UID_MAP,
+	GID_MAP,
+	FSUID_MAP,
+	FSGID_MAP,
+	PROJID_MAP,
+};
+
 static bool new_idmap_permitted(const struct file *file,
 				struct user_namespace *ns, int cap_setid,
-				struct uid_gid_map *map);
+				struct uid_gid_map *map,
+				enum idmap_type idmap_type);
 static void free_user_ns(struct work_struct *work);
 
 static struct ucounts *inc_user_namespaces(struct user_namespace *ns, kuid_t uid)
@@ -913,6 +922,16 @@ const struct seq_operations proc_projid_seq_operations = {
 	.show = projid_m_show,
 };
 
+static inline bool idmap_exists(const struct uid_gid_map *map)
+{
+	return map && map->nr_extents != 0;
+}
+
+static inline bool idmap_type_wants_fsidmap(enum idmap_type type)
+{
+	return type == UID_MAP || type == GID_MAP;
+}
+
 #ifdef CONFIG_USER_NS_FSID
 const struct seq_operations proc_fsuid_seq_operations = {
 	.start = fsuid_m_start,
@@ -927,6 +946,31 @@ const struct seq_operations proc_fsgid_seq_operations = {
 	.next = m_next,
 	.show = fsgid_m_show,
 };
+
+static int idmap_to_fsidmap(struct uid_gid_map *id_map,
+			    struct uid_gid_map *fsid_map,
+			    struct uid_gid_map *new_fsid_map,
+			    enum idmap_type type)
+{
+	if (!idmap_type_wants_fsidmap(type) || idmap_exists(fsid_map))
+		return 0;
+
+	/* fsid maps mirror id maps. */
+	if (id_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) {
+		memcpy(new_fsid_map, id_map, sizeof(struct uid_gid_map));
+		return 0;
+	}
+
+	memset(new_fsid_map, 0, sizeof(struct uid_gid_map));
+	new_fsid_map->forward = kmemdup(id_map->forward,
+			id_map->nr_extents * sizeof(struct uid_gid_extent),
+			GFP_KERNEL);
+	if (!new_fsid_map->forward)
+		return -ENOMEM;
+	new_fsid_map->nr_extents = id_map->nr_extents;
+
+	return 0;
+}
 #endif
 
 static bool mappings_overlap(struct uid_gid_map *new_map,
@@ -1064,9 +1108,17 @@ static int sort_map(struct uid_gid_map *map)
 	return 0;
 }
 
-static int sort_idmaps(struct uid_gid_map *map)
+static int sort_idmaps(struct uid_gid_map *map,
+		       struct uid_gid_map *new_fsid_map)
 {
-	return sort_map(map);
+	int ret;
+
+	ret = sort_map(map);
+	if (ret)
+		return ret;
+
+	/* Sort fsid maps in case they mirror id maps. */
+	return sort_map(new_fsid_map);
 }
 
 static int map_from_parent(struct uid_gid_map *new_map,
@@ -1101,13 +1153,31 @@ static int map_from_parent(struct uid_gid_map *new_map,
 }
 
 static int map_into_kids(struct uid_gid_map *id_map,
-			 struct uid_gid_map *parent_id_map)
+			 struct uid_gid_map *parent_id_map,
+			 struct user_namespace *ns,
+			 struct uid_gid_map *new_fsid_map, enum idmap_type type)
 {
-	return map_from_parent(id_map, parent_id_map);
+	int ret;
+
+	ret = map_from_parent(id_map, parent_id_map);
+	if (ret)
+		return ret;
+
+#ifdef CONFIG_USER_NS_FSID
+	/* fsid maps mirror id maps. */
+	if (idmap_type_wants_fsidmap(type) && idmap_exists(new_fsid_map))
+		ret = map_from_parent(new_fsid_map,
+				      type == UID_MAP ? &ns->parent->fsuid_map :
+							&ns->parent->fsgid_map);
+#endif
+	return ret;
 }
 
 static void install_idmaps(struct uid_gid_map *id_map,
-			   struct uid_gid_map *new_id_map)
+			   struct uid_gid_map *new_id_map,
+			   struct uid_gid_map *fsid_map,
+			   struct uid_gid_map *new_fsid_map,
+			   enum idmap_type type)
 {
 	if (new_id_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) {
 		memcpy(id_map->extent, new_id_map->extent,
@@ -1116,9 +1186,21 @@ static void install_idmaps(struct uid_gid_map *id_map,
 		id_map->forward = new_id_map->forward;
 		id_map->reverse = new_id_map->reverse;
 	}
+
+	if (idmap_type_wants_fsidmap(type) && idmap_exists(new_fsid_map)) {
+		/* fsid maps mirror id maps. */
+		if (new_fsid_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) {
+			memcpy(fsid_map->extent, new_fsid_map->extent,
+			       new_fsid_map->nr_extents * sizeof(new_fsid_map->extent[0]));
+		} else {
+			fsid_map->forward = new_fsid_map->forward;
+			fsid_map->reverse = new_fsid_map->reverse;
+		}
+	}
 }
 
-static void free_idmaps(struct uid_gid_map *new_id_map)
+static void free_idmaps(struct uid_gid_map *new_id_map,
+			struct uid_gid_map *new_fsid_map)
 {
 	if (new_id_map->nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
 		kfree(new_id_map->forward);
@@ -1127,17 +1209,28 @@ static void free_idmaps(struct uid_gid_map *new_id_map)
 		new_id_map->reverse = NULL;
 		new_id_map->nr_extents = 0;
 	}
+
+	/* fsid maps mirror id maps. */
+	if (new_fsid_map->nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
+		kfree(new_fsid_map->forward);
+		kfree(new_fsid_map->reverse);
+		new_fsid_map->forward = NULL;
+		new_fsid_map->reverse = NULL;
+		new_fsid_map->nr_extents = 0;
+	}
 }
 
 static ssize_t map_write(struct file *file, const char __user *buf,
 			 size_t count, loff_t *ppos,
 			 int cap_setid,
 			 struct uid_gid_map *map,
-			 struct uid_gid_map *parent_map)
+			 struct uid_gid_map *parent_map,
+			 enum idmap_type type)
 {
 	struct seq_file *seq = file->private_data;
 	struct user_namespace *ns = seq->private;
-	struct uid_gid_map new_map;
+	struct uid_gid_map *fsid_map = NULL;
+	struct uid_gid_map new_map, new_fsid_map;
 	struct uid_gid_extent extent;
 	char *kbuf = NULL, *pos, *next_line;
 	ssize_t ret;
@@ -1173,6 +1266,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	mutex_lock(&userns_state_mutex);
 
 	memset(&new_map, 0, sizeof(struct uid_gid_map));
+	new_fsid_map.nr_extents = 0;
 
 	ret = -EPERM;
 	/* Only allow one successful write to the map */
@@ -1252,10 +1346,21 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 
 	ret = -EPERM;
 	/* Validate the user is allowed to use user id's mapped to. */
-	if (!new_idmap_permitted(file, ns, cap_setid, &new_map))
+	if (!new_idmap_permitted(file, ns, cap_setid, &new_map, type))
+		goto out;
+
+#ifdef CONFIG_USER_NS_FSID
+	/* Take pointer to fsid maps in case we're mirroring id maps. */
+	if (type == UID_MAP)
+		fsid_map = &ns->fsuid_map;
+	else if (type == GID_MAP)
+		fsid_map = &ns->fsgid_map;
+	ret = idmap_to_fsidmap(&new_map, fsid_map, &new_fsid_map, type);
+	if (ret)
 		goto out;
+#endif
 
-	ret = map_into_kids(&new_map, parent_map);
+	ret = map_into_kids(&new_map, parent_map, ns, &new_fsid_map, type);
 	if (ret)
 		goto out;
 
@@ -1263,20 +1368,22 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	 * If we want to use binary search for lookup, this clones the extent
 	 * array and sorts both copies.
 	 */
-	ret = sort_idmaps(&new_map);
+	ret = sort_idmaps(&new_map, &new_fsid_map);
 	if (ret)
 		goto out;
 
 	/* Install the map */
-	install_idmaps(map, &new_map);
+	install_idmaps(map, &new_map, fsid_map, &new_fsid_map, type);
 	smp_wmb();
 	map->nr_extents = new_map.nr_extents;
+	if (idmap_exists(&new_fsid_map))
+		fsid_map->nr_extents = new_fsid_map.nr_extents;
 
 	*ppos = count;
 	ret = count;
 out:
 	if (ret < 0)
-		free_idmaps(&new_map);
+		free_idmaps(&new_map, &new_fsid_map);
 
 	mutex_unlock(&userns_state_mutex);
 	kfree(kbuf);
@@ -1297,7 +1404,7 @@ ssize_t proc_uid_map_write(struct file *file, const char __user *buf,
 		return -EPERM;
 
 	return map_write(file, buf, size, ppos, CAP_SETUID,
-			 &ns->uid_map, &ns->parent->uid_map);
+			 &ns->uid_map, &ns->parent->uid_map, UID_MAP);
 }
 
 ssize_t proc_gid_map_write(struct file *file, const char __user *buf,
@@ -1314,7 +1421,7 @@ ssize_t proc_gid_map_write(struct file *file, const char __user *buf,
 		return -EPERM;
 
 	return map_write(file, buf, size, ppos, CAP_SETGID,
-			 &ns->gid_map, &ns->parent->gid_map);
+			 &ns->gid_map, &ns->parent->gid_map, GID_MAP);
 }
 
 ssize_t proc_projid_map_write(struct file *file, const char __user *buf,
@@ -1332,7 +1439,7 @@ ssize_t proc_projid_map_write(struct file *file, const char __user *buf,
 
 	/* Anyone can set any valid project id no capability needed */
 	return map_write(file, buf, size, ppos, -1,
-			 &ns->projid_map, &ns->parent->projid_map);
+			 &ns->projid_map, &ns->parent->projid_map, PROJID_MAP);
 }
 
 #ifdef CONFIG_USER_NS_FSID
@@ -1350,7 +1457,7 @@ ssize_t proc_fsuid_map_write(struct file *file, const char __user *buf,
 		return -EPERM;
 
 	return map_write(file, buf, size, ppos, CAP_SETUID, &ns->fsuid_map,
-			 &ns->parent->fsuid_map);
+			 &ns->parent->fsuid_map, FSUID_MAP);
 }
 
 ssize_t proc_fsgid_map_write(struct file *file, const char __user *buf,
@@ -1367,15 +1474,25 @@ ssize_t proc_fsgid_map_write(struct file *file, const char __user *buf,
 		return -EPERM;
 
 	return map_write(file, buf, size, ppos, CAP_SETGID, &ns->fsgid_map,
-			 &ns->parent->fsgid_map);
+			 &ns->parent->fsgid_map, FSGID_MAP);
 }
 #endif
 
 static bool new_idmap_permitted(const struct file *file,
 				struct user_namespace *ns, int cap_setid,
-				struct uid_gid_map *new_map)
+				struct uid_gid_map *new_map,
+				enum idmap_type idmap_type)
 {
 	const struct cred *cred = file->f_cred;
+
+	/* Don't allow writing fsuid maps when uid maps have been written. */
+	if (idmap_type == FSUID_MAP && idmap_exists(&ns->uid_map))
+		return false;
+
+	/* Don't allow writing fsgid maps when gid maps have been written. */
+	if (idmap_type == FSGID_MAP && idmap_exists(&ns->gid_map))
+		return false;
+
 	/* Don't allow mappings that would allow anything that wouldn't
 	 * be allowed without the establishment of unprivileged mappings.
 	 */
@@ -1383,11 +1500,15 @@ static bool new_idmap_permitted(const struct file *file,
 	    uid_eq(ns->owner, cred->euid)) {
 		u32 id = new_map->extent[0].lower_first;
 		if (cap_setid == CAP_SETUID) {
-			kuid_t uid = make_kuid(ns->parent, id);
+			kuid_t uid = idmap_type == FSUID_MAP ?
+					     make_kfsuid(ns->parent, id) :
+					     make_kuid(ns->parent, id);
 			if (uid_eq(uid, cred->euid))
 				return true;
 		} else if (cap_setid == CAP_SETGID) {
-			kgid_t gid = make_kgid(ns->parent, id);
+			kgid_t gid = idmap_type == FSGID_MAP ?
+					     make_kfsgid(ns->parent, id) :
+					     make_kgid(ns->parent, id);
 			if (!(ns->flags & USERNS_SETGROUPS_ALLOWED) &&
 			    gid_eq(gid, cred->egid))
 				return true;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 07/25] proc: task_state(): use from_kfs{g,u}id_munged
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (5 preceding siblings ...)
  2020-02-18 14:33 ` [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-19  2:36   ` Serge E. Hallyn
  2020-02-18 14:33 ` [PATCH v3 08/25] cred: add kfs{g,u}id Christian Brauner
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

If fsid mappings have been written, this will cause proc to look at fsid
mappings for the user namespace. If no fsid mappings have been written the
behavior is as before.

Here is part of the output from /proc/<pid>/status from the initial user
namespace for systemd running in an unprivileged container as user namespace
root with id mapping 0 100000 100000 and fsid mapping 0 300000 100000:

Name:   systemd
Umask:  0000
State:  S (sleeping)
Tgid:   13023
Ngid:   0
Pid:    13023
PPid:   13008
TracerPid:      0
Uid:    100000  100000  100000  300000
Gid:    100000  100000  100000  300000
FDSize: 64
Groups:

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 fs/proc/array.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 5efaf3708ec6..d4a04f85a67e 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -91,6 +91,7 @@
 #include <linux/string_helpers.h>
 #include <linux/user_namespace.h>
 #include <linux/fs_struct.h>
+#include <linux/fsuidgid.h>
 
 #include <asm/pgtable.h>
 #include <asm/processor.h>
@@ -193,11 +194,11 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 	seq_put_decimal_ull(m, "\nUid:\t", from_kuid_munged(user_ns, cred->uid));
 	seq_put_decimal_ull(m, "\t", from_kuid_munged(user_ns, cred->euid));
 	seq_put_decimal_ull(m, "\t", from_kuid_munged(user_ns, cred->suid));
-	seq_put_decimal_ull(m, "\t", from_kuid_munged(user_ns, cred->fsuid));
+	seq_put_decimal_ull(m, "\t", from_kfsuid_munged(user_ns, cred->fsuid));
 	seq_put_decimal_ull(m, "\nGid:\t", from_kgid_munged(user_ns, cred->gid));
 	seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->egid));
 	seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->sgid));
-	seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->fsgid));
+	seq_put_decimal_ull(m, "\t", from_kfsgid_munged(user_ns, cred->fsgid));
 	seq_put_decimal_ull(m, "\nFDSize:\t", max_fds);
 
 	seq_puts(m, "\nGroups:\t");
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 08/25] cred: add kfs{g,u}id
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (6 preceding siblings ...)
  2020-02-18 14:33 ` [PATCH v3 07/25] proc: task_state(): use from_kfs{g,u}id_munged Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-18 14:33 ` [PATCH v3 09/25] fs: add is_userns_visible() helper Christian Brauner
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

After the introduction of fsid mappings we need to carefully handle
single-superblock filesystems that are visible in user namespaces. This
specifically concerns proc and sysfs. For those filesystems we want to continue
looking up fsid in the id mappings of the relevant user namespace. We can
either do this by dynamically translating between these fsids or we simply keep
them around with the other creds. The latter option is not just simpler but
also more performant since we don't need to do the translation from fsid
mappings into id mappings on the fly.

Link: https://lore.kernel.org/r/20200212145149.zohmc6d3x52bw6j6@wittgenstein
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
patch added

/* v3 */
unchanged
---
 include/linux/cred.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..604914d3fd51 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -125,6 +125,8 @@ struct cred {
 	kgid_t		egid;		/* effective GID of the task */
 	kuid_t		fsuid;		/* UID for VFS ops */
 	kgid_t		fsgid;		/* GID for VFS ops */
+	kuid_t		kfsuid;		/* UID for VFS ops for userns visible filesystems */
+	kgid_t		kfsgid;		/* GID for VFS ops for userns visible filesystems */
 	unsigned	securebits;	/* SUID-less security management */
 	kernel_cap_t	cap_inheritable; /* caps our children can inherit */
 	kernel_cap_t	cap_permitted;	/* caps we're permitted */
@@ -384,6 +386,8 @@ static inline void put_cred(const struct cred *_cred)
 #define current_sgid()		(current_cred_xxx(sgid))
 #define current_fsuid() 	(current_cred_xxx(fsuid))
 #define current_fsgid() 	(current_cred_xxx(fsgid))
+#define current_kfsuid() 	(current_cred_xxx(kfsuid))
+#define current_kfsgid() 	(current_cred_xxx(kfsgid))
 #define current_cap()		(current_cred_xxx(cap_effective))
 #define current_user()		(current_cred_xxx(user))
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 09/25] fs: add is_userns_visible() helper
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (7 preceding siblings ...)
  2020-02-18 14:33 ` [PATCH v3 08/25] cred: add kfs{g,u}id Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-19  2:42   ` Serge E. Hallyn
  2020-02-18 14:33 ` [PATCH v3 10/25] namei: may_{o_}create(): handle fsid mappings Christian Brauner
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Introduce a helper which makes it possible to detect fileystems whose
superblock is visible in multiple user namespace. This currently only
means proc and sys. Such filesystems usually have special semantics so their
behavior will not be changed with the introduction of fsid mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 include/linux/fs.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..fdc8fb2d786b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3651,4 +3651,9 @@ static inline int inode_drain_writes(struct inode *inode)
 	return filemap_write_and_wait(inode->i_mapping);
 }
 
+static inline bool is_userns_visible(unsigned long iflags)
+{
+	return (iflags & SB_I_USERNS_VISIBLE);
+}
+
 #endif /* _LINUX_FS_H */
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 10/25] namei: may_{o_}create(): handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (8 preceding siblings ...)
  2020-02-18 14:33 ` [PATCH v3 09/25] fs: add is_userns_visible() helper Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-18 14:33 ` [PATCH v3 11/25] inode: inode_owner_or_capable(): " Christian Brauner
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Switch may_{o_}create() to lookup fsids in the fsid mappings. If no fsid
mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
id mappings.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsid mappings.

Cc: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Jann Horn <jannh@google.com>:
  - Ensure that the correct fsid is used when dealing with userns visible
    filesystems like proc.

/* v3 */
unchanged
---
 fs/namei.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index db6565c99825..c5b014000f13 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -39,6 +39,7 @@
 #include <linux/bitops.h>
 #include <linux/init_task.h>
 #include <linux/uaccess.h>
+#include <linux/fsuidgid.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -287,6 +288,13 @@ static int check_acl(struct inode *inode, int mask)
 	return -EAGAIN;
 }
 
+static inline kuid_t get_current_fsuid(const struct inode *inode)
+{
+	if (is_userns_visible(inode->i_sb->s_iflags))
+		return current_kfsuid();
+	return current_fsuid();
+}
+
 /*
  * This does the basic permission checking
  */
@@ -294,7 +302,7 @@ static int acl_permission_check(struct inode *inode, int mask)
 {
 	unsigned int mode = inode->i_mode;
 
-	if (likely(uid_eq(current_fsuid(), inode->i_uid)))
+	if (likely(uid_eq(get_current_fsuid(inode), inode->i_uid)))
 		mode >>= 6;
 	else {
 		if (IS_POSIXACL(inode) && (mode & S_IRWXG)) {
@@ -980,7 +988,7 @@ static inline int may_follow_link(struct nameidata *nd)
 
 	/* Allowed if owner and follower match. */
 	inode = nd->link_inode;
-	if (uid_eq(current_cred()->fsuid, inode->i_uid))
+	if (uid_eq(get_current_fsuid(inode), inode->i_uid))
 		return 0;
 
 	/* Allowed if parent directory not sticky and world-writable. */
@@ -1097,7 +1105,7 @@ static int may_create_in_sticky(umode_t dir_mode, kuid_t dir_uid,
 	    (!sysctl_protected_regular && S_ISREG(inode->i_mode)) ||
 	    likely(!(dir_mode & S_ISVTX)) ||
 	    uid_eq(inode->i_uid, dir_uid) ||
-	    uid_eq(current_fsuid(), inode->i_uid))
+	    uid_eq(get_current_fsuid(inode), inode->i_uid))
 		return 0;
 
 	if (likely(dir_mode & 0002) ||
@@ -2832,7 +2840,7 @@ EXPORT_SYMBOL(kern_path_mountpoint);
 
 int __check_sticky(struct inode *dir, struct inode *inode)
 {
-	kuid_t fsuid = current_fsuid();
+	kuid_t fsuid = get_current_fsuid(inode);
 
 	if (uid_eq(inode->i_uid, fsuid))
 		return 0;
@@ -2902,6 +2910,20 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
 	return 0;
 }
 
+static bool fsid_has_mapping(struct user_namespace *ns, struct super_block *sb)
+{
+	if (is_userns_visible(sb->s_iflags)) {
+		if (!kuid_has_mapping(ns, current_kfsuid()) ||
+		    !kgid_has_mapping(ns, current_kfsgid()))
+			return false;
+	} else if (!kfsuid_has_mapping(ns, current_fsuid()) ||
+		   !kfsgid_has_mapping(ns, current_fsgid())) {
+		return false;
+	}
+
+	return true;
+}
+
 /*	Check whether we can create an object with dentry child in directory
  *  dir.
  *  1. We can't do it if child already exists (open has special treatment for
@@ -2920,8 +2942,7 @@ static inline int may_create(struct inode *dir, struct dentry *child)
 	if (IS_DEADDIR(dir))
 		return -ENOENT;
 	s_user_ns = dir->i_sb->s_user_ns;
-	if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
-	    !kgid_has_mapping(s_user_ns, current_fsgid()))
+	if (!fsid_has_mapping(s_user_ns, dir->i_sb))
 		return -EOVERFLOW;
 	return inode_permission(dir, MAY_WRITE | MAY_EXEC);
 }
@@ -3103,8 +3124,7 @@ static int may_o_create(const struct path *dir, struct dentry *dentry, umode_t m
 		return error;
 
 	s_user_ns = dir->dentry->d_sb->s_user_ns;
-	if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
-	    !kgid_has_mapping(s_user_ns, current_fsgid()))
+	if (!fsid_has_mapping(s_user_ns, dir->dentry->d_sb))
 		return -EOVERFLOW;
 
 	error = inode_permission(dir->dentry->d_inode, MAY_WRITE | MAY_EXEC);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 11/25] inode: inode_owner_or_capable(): handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (9 preceding siblings ...)
  2020-02-18 14:33 ` [PATCH v3 10/25] namei: may_{o_}create(): handle fsid mappings Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-18 22:25   ` Christoph Hellwig
  2020-02-18 14:33 ` [PATCH v3 12/25] capability: privileged_wrt_inode_uidgid(): " Christian Brauner
                   ` (17 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Switch inode_owner_or_capable() to lookup fsids in the fsid mappings. If no
fsid mappings are setup the behavior is unchanged, i.e. fsids are looked up in
the id mappings.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsid mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 fs/inode.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index 7d57068b6b7a..81d7a30b381d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@
 #include <linux/ratelimit.h>
 #include <linux/list_lru.h>
 #include <linux/iversion.h>
+#include <linux/fsuidgid.h>
 #include <trace/events/writeback.h>
 #include "internal.h"
 
@@ -2087,8 +2088,12 @@ bool inode_owner_or_capable(const struct inode *inode)
 		return true;
 
 	ns = current_user_ns();
-	if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
+	if (is_userns_visible(inode->i_sb->s_iflags)) {
+		if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
+			return true;
+	} else if (kfsuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER)) {
 		return true;
+	}
 	return false;
 }
 EXPORT_SYMBOL(inode_owner_or_capable);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 12/25] capability: privileged_wrt_inode_uidgid(): handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (10 preceding siblings ...)
  2020-02-18 14:33 ` [PATCH v3 11/25] inode: inode_owner_or_capable(): " Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-18 14:33 ` [PATCH v3 13/25] stat: " Christian Brauner
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Switch privileged_wrt_inode_uidgid() to lookup fsids in the fsid mappings. If
no fsid mappings are setup the behavior is unchanged, i.e. fsids are looked up
in the id mappings.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsid mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 kernel/capability.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/capability.c b/kernel/capability.c
index 1444f3954d75..2b0c1dc992e2 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -19,6 +19,8 @@
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
 #include <linux/uaccess.h>
+#include <linux/fsuidgid.h>
+#include <linux/fs.h>
 
 /*
  * Leveraged for setting/resetting capabilities
@@ -486,8 +488,12 @@ EXPORT_SYMBOL(file_ns_capable);
  */
 bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode)
 {
-	return kuid_has_mapping(ns, inode->i_uid) &&
-		kgid_has_mapping(ns, inode->i_gid);
+	if (is_userns_visible(inode->i_sb->s_iflags))
+		return kuid_has_mapping(ns, inode->i_uid) &&
+		       kgid_has_mapping(ns, inode->i_gid);
+
+	return kfsuid_has_mapping(ns, inode->i_uid) &&
+	       kfsgid_has_mapping(ns, inode->i_gid);
 }
 
 /**
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 13/25] stat: handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (11 preceding siblings ...)
  2020-02-18 14:33 ` [PATCH v3 12/25] capability: privileged_wrt_inode_uidgid(): " Christian Brauner
@ 2020-02-18 14:33 ` Christian Brauner
  2020-02-18 14:34 ` [PATCH v3 14/25] open: " Christian Brauner
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:33 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Switch attribute functions looking up fsids to them up in the fsid mappings. If
no fsid mappings are setup the behavior is unchanged, i.e. fsids are looked up
in the id mappings.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsid mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
- Tycho Andersen <tycho@tycho.ws>:
  - Replace , with = when converting to uid and gid in cp_new_stat64().
---
 fs/stat.c            | 48 +++++++++++++++++++++++++++++++++++---------
 include/linux/stat.h |  1 +
 2 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index 030008796479..612714ebacb4 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -10,6 +10,7 @@
 #include <linux/errno.h>
 #include <linux/file.h>
 #include <linux/highuid.h>
+#include <linux/fsuidgid.h>
 #include <linux/fs.h>
 #include <linux/namei.h>
 #include <linux/security.h>
@@ -79,6 +80,8 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	stat->userns_visible = is_userns_visible(inode->i_sb->s_iflags);
+
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path, stat, request_mask,
 					    query_flags);
@@ -239,8 +242,13 @@ static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * sta
 	tmp.st_nlink = stat->nlink;
 	if (tmp.st_nlink != stat->nlink)
 		return -EOVERFLOW;
-	SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
-	SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	if (stat->userns_visible) {
+		SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	} else {
+		SET_UID(tmp.st_uid, from_kfsuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kfsgid_munged(current_user_ns(), stat->gid));
+	}
 	tmp.st_rdev = old_encode_dev(stat->rdev);
 #if BITS_PER_LONG == 32
 	if (stat->size > MAX_NON_LFS)
@@ -327,8 +335,13 @@ static int cp_new_stat(struct kstat *stat, struct stat __user *statbuf)
 	tmp.st_nlink = stat->nlink;
 	if (tmp.st_nlink != stat->nlink)
 		return -EOVERFLOW;
-	SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
-	SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	if (stat->userns_visible) {
+		SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	} else {
+		SET_UID(tmp.st_uid, from_kfsuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kfsgid_munged(current_user_ns(), stat->gid));
+	}
 	tmp.st_rdev = encode_dev(stat->rdev);
 	tmp.st_size = stat->size;
 	tmp.st_atime = stat->atime.tv_sec;
@@ -471,8 +484,13 @@ static long cp_new_stat64(struct kstat *stat, struct stat64 __user *statbuf)
 #endif
 	tmp.st_mode = stat->mode;
 	tmp.st_nlink = stat->nlink;
-	tmp.st_uid = from_kuid_munged(current_user_ns(), stat->uid);
-	tmp.st_gid = from_kgid_munged(current_user_ns(), stat->gid);
+	if (stat->userns_visible) {
+		tmp.st_uid = from_kuid_munged(current_user_ns(), stat->uid);
+		tmp.st_gid = from_kgid_munged(current_user_ns(), stat->gid);
+	} else {
+		tmp.st_uid = from_kfsuid_munged(current_user_ns(), stat->uid);
+		tmp.st_gid = from_kfsgid_munged(current_user_ns(), stat->gid);
+	}
 	tmp.st_atime = stat->atime.tv_sec;
 	tmp.st_atime_nsec = stat->atime.tv_nsec;
 	tmp.st_mtime = stat->mtime.tv_sec;
@@ -544,8 +562,13 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_blksize = stat->blksize;
 	tmp.stx_attributes = stat->attributes;
 	tmp.stx_nlink = stat->nlink;
-	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
-	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
+	if (stat->userns_visible) {
+		tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
+		tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
+	} else {
+		tmp.stx_uid = from_kfsuid_munged(current_user_ns(), stat->uid);
+		tmp.stx_gid = from_kfsgid_munged(current_user_ns(), stat->gid);
+	}
 	tmp.stx_mode = stat->mode;
 	tmp.stx_ino = stat->ino;
 	tmp.stx_size = stat->size;
@@ -615,8 +638,13 @@ static int cp_compat_stat(struct kstat *stat, struct compat_stat __user *ubuf)
 	tmp.st_nlink = stat->nlink;
 	if (tmp.st_nlink != stat->nlink)
 		return -EOVERFLOW;
-	SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
-	SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	if (stat->userns_visible) {
+		SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	} else {
+		SET_UID(tmp.st_uid, from_kfsuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kfsgid_munged(current_user_ns(), stat->gid));
+	}
 	tmp.st_rdev = old_encode_dev(stat->rdev);
 	if ((u64) stat->size > MAX_NON_LFS)
 		return -EOVERFLOW;
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 528c4baad091..e6d4ba73a970 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -47,6 +47,7 @@ struct kstat {
 	struct timespec64 ctime;
 	struct timespec64 btime;			/* File creation time */
 	u64		blocks;
+	bool		userns_visible;
 };
 
 #endif
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 14/25] open: handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (12 preceding siblings ...)
  2020-02-18 14:33 ` [PATCH v3 13/25] stat: " Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-18 14:34 ` [PATCH v3 15/25] posix_acl: " Christian Brauner
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Let chown_common() lookup fsids in the fsid mappings. If no fsid mappings are
setup the behavior is unchanged, i.e. fsids are looked up in the id mappings.
do_faccessat() just needs to translate from real ids into fsids.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsid mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - handle faccessat() too

/* v3 */
unchanged
---
 fs/open.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..4e092845728f 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -32,6 +32,7 @@
 #include <linux/ima.h>
 #include <linux/dnotify.h>
 #include <linux/compat.h>
+#include <linux/fsuidgid.h>
 
 #include "internal.h"
 
@@ -361,8 +362,10 @@ long do_faccessat(int dfd, const char __user *filename, int mode)
 	if (!override_cred)
 		return -ENOMEM;
 
-	override_cred->fsuid = override_cred->uid;
-	override_cred->fsgid = override_cred->gid;
+	override_cred->kfsuid = override_cred->uid;
+	override_cred->kfsgid = override_cred->gid;
+	override_cred->fsuid = kuid_to_kfsuid(override_cred->user_ns, override_cred->uid);
+	override_cred->fsgid = kgid_to_kfsgid(override_cred->user_ns, override_cred->gid);
 
 	if (!issecure(SECURE_NO_SETUID_FIXUP)) {
 		/* Clear the capabilities if we switch to a non-root user */
@@ -626,8 +629,13 @@ static int chown_common(const struct path *path, uid_t user, gid_t group)
 	kuid_t uid;
 	kgid_t gid;
 
-	uid = make_kuid(current_user_ns(), user);
-	gid = make_kgid(current_user_ns(), group);
+	if (is_userns_visible(inode->i_sb->s_iflags)) {
+		uid = make_kuid(current_user_ns(), user);
+		gid = make_kgid(current_user_ns(), group);
+	} else {
+		uid = make_kfsuid(current_user_ns(), user);
+		gid = make_kfsgid(current_user_ns(), group);
+	}
 
 retry_deleg:
 	newattrs.ia_valid =  ATTR_CTIME;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 15/25] posix_acl: handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (13 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 14/25] open: " Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-18 22:26   ` Christoph Hellwig
  2020-02-18 14:34 ` [PATCH v3 16/25] attr: notify_change(): " Christian Brauner
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Switch posix_acls() to lookup fsids in the fsid mappings. If no fsid
mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
id mappings.

Afaict, all filesystems that share a superblock in all user namespaces
currently do not support acls so this change should be safe to do
unconditionally.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 fs/posix_acl.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 249672bf54fe..ed6112c9b804 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -22,6 +22,7 @@
 #include <linux/xattr.h>
 #include <linux/export.h>
 #include <linux/user_namespace.h>
+#include <linux/fsuidgid.h>
 
 static struct posix_acl **acl_by_type(struct inode *inode, int type)
 {
@@ -692,12 +693,12 @@ static void posix_acl_fix_xattr_userns(
 	for (end = entry + count; entry != end; entry++) {
 		switch(le16_to_cpu(entry->e_tag)) {
 		case ACL_USER:
-			uid = make_kuid(from, le32_to_cpu(entry->e_id));
-			entry->e_id = cpu_to_le32(from_kuid(to, uid));
+			uid = make_kfsuid(from, le32_to_cpu(entry->e_id));
+			entry->e_id = cpu_to_le32(from_kfsuid(to, uid));
 			break;
 		case ACL_GROUP:
-			gid = make_kgid(from, le32_to_cpu(entry->e_id));
-			entry->e_id = cpu_to_le32(from_kgid(to, gid));
+			gid = make_kfsgid(from, le32_to_cpu(entry->e_id));
+			entry->e_id = cpu_to_le32(from_kfsgid(to, gid));
 			break;
 		default:
 			break;
@@ -765,14 +766,14 @@ posix_acl_from_xattr(struct user_namespace *user_ns,
 
 			case ACL_USER:
 				acl_e->e_uid =
-					make_kuid(user_ns,
+					make_kfsuid(user_ns,
 						  le32_to_cpu(entry->e_id));
 				if (!uid_valid(acl_e->e_uid))
 					goto fail;
 				break;
 			case ACL_GROUP:
 				acl_e->e_gid =
-					make_kgid(user_ns,
+					make_kfsgid(user_ns,
 						  le32_to_cpu(entry->e_id));
 				if (!gid_valid(acl_e->e_gid))
 					goto fail;
@@ -817,11 +818,11 @@ posix_acl_to_xattr(struct user_namespace *user_ns, const struct posix_acl *acl,
 		switch(acl_e->e_tag) {
 		case ACL_USER:
 			ext_entry->e_id =
-				cpu_to_le32(from_kuid(user_ns, acl_e->e_uid));
+				cpu_to_le32(from_kfsuid(user_ns, acl_e->e_uid));
 			break;
 		case ACL_GROUP:
 			ext_entry->e_id =
-				cpu_to_le32(from_kgid(user_ns, acl_e->e_gid));
+				cpu_to_le32(from_kfsgid(user_ns, acl_e->e_gid));
 			break;
 		default:
 			ext_entry->e_id = cpu_to_le32(ACL_UNDEFINED_ID);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 16/25] attr: notify_change(): handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (14 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 15/25] posix_acl: " Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-18 14:34 ` [PATCH v3 17/25] commoncap: cap_bprm_set_creds(): " Christian Brauner
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Switch notify_change() to lookup fsids in the fsid mappings. If no fsid
mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
id mappings.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsid mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 fs/attr.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index b4bbdbd4c8ca..b3fe9d9582d2 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -17,6 +17,8 @@
 #include <linux/security.h>
 #include <linux/evm.h>
 #include <linux/ima.h>
+#include <linux/fsuidgid.h>
+#include <linux/fs.h>
 
 static bool chown_ok(const struct inode *inode, kuid_t uid)
 {
@@ -310,12 +312,21 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	 * Verify that uid/gid changes are valid in the target
 	 * namespace of the superblock.
 	 */
-	if (ia_valid & ATTR_UID &&
-	    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
-		return -EOVERFLOW;
-	if (ia_valid & ATTR_GID &&
-	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
-		return -EOVERFLOW;
+	if (is_userns_visible(inode->i_sb->s_iflags)) {
+		if (ia_valid & ATTR_UID &&
+		    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
+			return -EOVERFLOW;
+		if (ia_valid & ATTR_GID &&
+		    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
+			return -EOVERFLOW;
+	} else {
+		if (ia_valid & ATTR_UID &&
+		    !kfsuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
+			return -EOVERFLOW;
+		if (ia_valid & ATTR_GID &&
+		    !kfsgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
+			return -EOVERFLOW;
+	}
 
 	/* Don't allow modifications of files with invalid uids or
 	 * gids unless those uids & gids are being made valid.
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 17/25] commoncap: cap_bprm_set_creds(): handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (15 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 16/25] attr: notify_change(): " Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-18 14:34 ` [PATCH v3 18/25] commoncap: cap_task_fix_setuid(): " Christian Brauner
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

During exec the kfsids are currently reset to the effective kids. To retain the
same semantics with the introduction of fsid mappings, we lookup the userspace
effective id in the id mappings and translate the effective id into the
corresponding kfsid in the fsid mapping. This means, the behavior is unchanged
when no fsid mappings are setup and the semantics stay the same even when fsid
mappings are setup.

Cc: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Reset kfsids used for userns visible filesystems such as proc too.

/* v3 */
unchanged
---
 security/commoncap.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index f4ee0ae106b2..55e6cc24f887 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -24,6 +24,7 @@
 #include <linux/user_namespace.h>
 #include <linux/binfmts.h>
 #include <linux/personality.h>
+#include <linux/fsuidgid.h>
 
 /*
  * If a non-root user executes a setuid-root binary in
@@ -810,7 +811,10 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
 	struct cred *new = bprm->cred;
 	bool effective = false, has_fcap = false, is_setid;
 	int ret;
-	kuid_t root_uid;
+	kuid_t root_uid, kfsuid;
+	kgid_t kfsgid;
+	uid_t fsuid;
+	gid_t fsgid;
 
 	if (WARN_ON(!cap_ambient_invariant_ok(old)))
 		return -EPERM;
@@ -847,8 +851,15 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
 						   old->cap_permitted);
 	}
 
-	new->suid = new->fsuid = new->euid;
-	new->sgid = new->fsgid = new->egid;
+	fsuid = from_kuid_munged(new->user_ns, new->euid);
+	kfsuid = make_kfsuid(new->user_ns, fsuid);
+	new->suid = new->kfsuid = new->euid;
+	new->fsuid = kfsuid;
+
+	fsgid = from_kgid_munged(new->user_ns, new->egid);
+	kfsgid = make_kfsgid(new->user_ns, fsgid);
+	new->sgid = new->kfsgid = new->egid;
+	new->fsgid = kfsgid;
 
 	/* File caps or setid cancels ambient. */
 	if (has_fcap || is_setid)
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 18/25] commoncap: cap_task_fix_setuid(): handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (16 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 17/25] commoncap: cap_bprm_set_creds(): " Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-18 14:34 ` [PATCH v3 19/25] commoncap: handle fsid mappings with vfs caps Christian Brauner
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Switch cap_task_fix_setuid() to lookup fsids in the fsid mappings. If no fsid
mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
id mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 security/commoncap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index 55e6cc24f887..0581c6aa8bdc 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -1062,7 +1062,7 @@ int cap_task_fix_setuid(struct cred *new, const struct cred *old, int flags)
 		 *          if not, we might be a bit too harsh here.
 		 */
 		if (!issecure(SECURE_NO_SETUID_FIXUP)) {
-			kuid_t root_uid = make_kuid(old->user_ns, 0);
+			kuid_t root_uid = make_kfsuid(old->user_ns, 0);
 			if (uid_eq(old->fsuid, root_uid) && !uid_eq(new->fsuid, root_uid))
 				new->cap_effective =
 					cap_drop_fs_set(new->cap_effective);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 19/25] commoncap: handle fsid mappings with vfs caps
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (17 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 18/25] commoncap: cap_task_fix_setuid(): " Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-19 15:53   ` Jann Horn
  2020-02-18 14:34 ` [PATCH v3 20/25] exec: bprm_fill_uid(): handle fsid mappings Christian Brauner
                   ` (9 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Switch vfs cap helpers to lookup fsids in the fsid mappings. If no fsid
mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
id mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 security/commoncap.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index 0581c6aa8bdc..d2259dc0450b 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -328,7 +328,7 @@ static bool rootid_owns_currentns(kuid_t kroot)
 		return false;
 
 	for (ns = current_user_ns(); ; ns = ns->parent) {
-		if (from_kuid(ns, kroot) == 0)
+		if (from_kfsuid(ns, kroot) == 0)
 			return true;
 		if (ns == &init_user_ns)
 			break;
@@ -411,11 +411,11 @@ int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
 
 	nscap = (struct vfs_ns_cap_data *) tmpbuf;
 	root = le32_to_cpu(nscap->rootid);
-	kroot = make_kuid(fs_ns, root);
+	kroot = make_kfsuid(fs_ns, root);
 
-	/* If the root kuid maps to a valid uid in current ns, then return
+	/* If the root kfsuid maps to a valid uid in current ns, then return
 	 * this as a nscap. */
-	mappedroot = from_kuid(current_user_ns(), kroot);
+	mappedroot = from_kfsuid(current_user_ns(), kroot);
 	if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
 		if (alloc) {
 			*buffer = tmpbuf;
@@ -460,7 +460,7 @@ static kuid_t rootid_from_xattr(const void *value, size_t size,
 	if (size == XATTR_CAPS_SZ_3)
 		rootid = le32_to_cpu(nscap->rootid);
 
-	return make_kuid(task_ns, rootid);
+	return make_kfsuid(task_ns, rootid);
 }
 
 static bool validheader(size_t size, const struct vfs_cap_data *cap)
@@ -501,7 +501,7 @@ int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
 	if (!uid_valid(rootid))
 		return -EINVAL;
 
-	nsrootid = from_kuid(fs_ns, rootid);
+	nsrootid = from_kfsuid(fs_ns, rootid);
 	if (nsrootid == -1)
 		return -EINVAL;
 
@@ -600,7 +600,7 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 
 	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
 
-	rootkuid = make_kuid(fs_ns, 0);
+	rootkuid = make_kfsuid(fs_ns, 0);
 	switch (magic_etc & VFS_CAP_REVISION_MASK) {
 	case VFS_CAP_REVISION_1:
 		if (size != XATTR_CAPS_SZ_1)
@@ -616,7 +616,7 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 		if (size != XATTR_CAPS_SZ_3)
 			return -EINVAL;
 		tocopy = VFS_CAP_U32_3;
-		rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
+		rootkuid = make_kfsuid(fs_ns, le32_to_cpu(nscaps->rootid));
 		break;
 
 	default:
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 20/25] exec: bprm_fill_uid(): handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (18 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 19/25] commoncap: handle fsid mappings with vfs caps Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-18 14:34 ` [PATCH v3 21/25] ptrace: adapt ptrace_may_access() to always uses unmapped fsids Christian Brauner
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Make sure that during suid/sgid binary execution we lookup the fsids in the
fsid mappings. If the kernel is compiled without fsid mappings or no fsid
mappings are setup the behavior is unchanged.

Assuming we have a binary in a given user namespace that is owned by 0:0 in the
given user namespace which appears as 300000:300000 on-disk in the initial user
namespace. Now assume we write an id mapping of 0 100000 100000 and an fsid
mapping for 0 300000 300000 in the user namespace. When we hit bprm_fill_uid()
during setid execution we will retrieve inode kuid=300000 and kgid=300000. We
first check whether there's an fsid mapping for these kids. In our scenario we
find that they map to fsuid=0 and fsgid=0 in the user namespace. Now we
translate them into kids in the id mapping. In our example they translate to
kuid=100000 and kgid=100000 which means the file will ultimately run as uid=0
and gid=0 in the user namespace and as uid=100000, gid=100000 in the initial
user namespace.
Let's alter the example and assume that there is an fsid mapping of 0 300000
300000 set up but no id mapping has been setup for the user namespace. In this
the last step of translating into a valid kid pair in the id mappings will fail
and we will behave as before and ignore the sid bits.

Cc: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
patch added
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Make sure that bprm_fill_uid() handles fsid mappings.

/* v3 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - Fix commit message.
---
 fs/exec.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index db17be51b112..9e4a7e757cef 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -62,6 +62,7 @@
 #include <linux/oom.h>
 #include <linux/compat.h>
 #include <linux/vmalloc.h>
+#include <linux/fsuidgid.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1518,8 +1519,8 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
 {
 	struct inode *inode;
 	unsigned int mode;
-	kuid_t uid;
-	kgid_t gid;
+	kuid_t uid, euid;
+	kgid_t gid, egid;
 
 	/*
 	 * Since this can be called multiple times (via prepare_binprm),
@@ -1551,18 +1552,30 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
 	inode_unlock(inode);
 
 	/* We ignore suid/sgid if there are no mappings for them in the ns */
-	if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
-		 !kgid_has_mapping(bprm->cred->user_ns, gid))
+	if (!kfsuid_has_mapping(bprm->cred->user_ns, uid) ||
+		 !kfsgid_has_mapping(bprm->cred->user_ns, gid))
 		return;
 
+	if (mode & S_ISUID) {
+		euid = kfsuid_to_kuid(bprm->cred->user_ns, uid);
+		if (!uid_valid(euid))
+			return;
+	}
+
+	if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
+		egid = kfsgid_to_kgid(bprm->cred->user_ns, gid);
+		if (!gid_valid(egid))
+			return;
+	}
+
 	if (mode & S_ISUID) {
 		bprm->per_clear |= PER_CLEAR_ON_SETID;
-		bprm->cred->euid = uid;
+		bprm->cred->euid = euid;
 	}
 
 	if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
 		bprm->per_clear |= PER_CLEAR_ON_SETID;
-		bprm->cred->egid = gid;
+		bprm->cred->egid = egid;
 	}
 }
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 21/25] ptrace: adapt ptrace_may_access() to always uses unmapped fsids
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (19 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 20/25] exec: bprm_fill_uid(): handle fsid mappings Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-18 14:34 ` [PATCH v3 22/25] devpts: handle fsid mappings Christian Brauner
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

ptrace_may_access() with PTRACE_MODE_FSCREDS is only used with proc and proc
wants to use the unmapped fsids.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
patch added

/* v3 */
unchanged
---
 kernel/ptrace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179508d6..3734713cc0dd 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -304,8 +304,8 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 		return 0;
 	rcu_read_lock();
 	if (mode & PTRACE_MODE_FSCREDS) {
-		caller_uid = cred->fsuid;
-		caller_gid = cred->fsgid;
+		caller_uid = cred->kfsuid;
+		caller_gid = cred->kfsgid;
 	} else {
 		/*
 		 * Using the euid would make more sense here, but something
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 22/25] devpts: handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (20 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 21/25] ptrace: adapt ptrace_may_access() to always uses unmapped fsids Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-18 14:34 ` [PATCH v3 23/25] keys: " Christian Brauner
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

When a uid or gid mount option is specified with devpts have it lookup the
corresponding kfsids in the fsid mappings. If no fsid mappings are setup the
behavior is unchanged, i.e. fsids are looked up in the id mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
unchanged

/* v3 */
unchanged
---
 fs/devpts/inode.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 42e5a766d33c..139958892572 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -24,6 +24,7 @@
 #include <linux/parser.h>
 #include <linux/fsnotify.h>
 #include <linux/seq_file.h>
+#include <linux/fsuidgid.h>
 
 #define DEVPTS_DEFAULT_MODE 0600
 /*
@@ -277,7 +278,7 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts)
 		case Opt_uid:
 			if (match_int(&args[0], &option))
 				return -EINVAL;
-			uid = make_kuid(current_user_ns(), option);
+			uid = make_kfsuid(current_user_ns(), option);
 			if (!uid_valid(uid))
 				return -EINVAL;
 			opts->uid = uid;
@@ -286,7 +287,7 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts)
 		case Opt_gid:
 			if (match_int(&args[0], &option))
 				return -EINVAL;
-			gid = make_kgid(current_user_ns(), option);
+			gid = make_kfsgid(current_user_ns(), option);
 			if (!gid_valid(gid))
 				return -EINVAL;
 			opts->gid = gid;
@@ -410,7 +411,7 @@ static int devpts_show_options(struct seq_file *seq, struct dentry *root)
 			   from_kuid_munged(&init_user_ns, opts->uid));
 	if (opts->setgid)
 		seq_printf(seq, ",gid=%u",
-			   from_kgid_munged(&init_user_ns, opts->gid));
+			   from_kfsgid_munged(&init_user_ns, opts->gid));
 	seq_printf(seq, ",mode=%03o", opts->mode);
 	seq_printf(seq, ",ptmxmode=%03o", opts->ptmxmode);
 	if (opts->max < NR_UNIX98_PTY_MAX)
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 23/25] keys: handle fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (21 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 22/25] devpts: handle fsid mappings Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-18 14:34 ` [PATCH v3 24/25] sys: handle fsid mappings in set*id() calls Christian Brauner
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Similar to proc and sysfs let keys use kfsids which are always mapped according
to id mappings.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
patch not present

/* v3 */
patch added
- Jann Horn <jannh@google.com>:
  - Add patch to handle the keyrings.
---
 security/keys/key.c              |  2 +-
 security/keys/permission.c       |  4 ++--
 security/keys/process_keys.c     |  6 ++++--
 security/keys/request_key.c      | 10 +++++-----
 security/keys/request_key_auth.c |  2 +-
 5 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/security/keys/key.c b/security/keys/key.c
index 718bf7217420..bfb17e8210d7 100644
--- a/security/keys/key.c
+++ b/security/keys/key.c
@@ -923,7 +923,7 @@ key_ref_t key_create_or_update(key_ref_t keyring_ref,
 
 	/* allocate a new key */
 	key = key_alloc(index_key.type, index_key.description,
-			cred->fsuid, cred->fsgid, cred, perm, flags, NULL);
+			cred->kfsuid, cred->kfsgid, cred, perm, flags, NULL);
 	if (IS_ERR(key)) {
 		key_ref = ERR_CAST(key);
 		goto error_link_end;
diff --git a/security/keys/permission.c b/security/keys/permission.c
index 085f907b64ac..847187ca6b41 100644
--- a/security/keys/permission.c
+++ b/security/keys/permission.c
@@ -33,7 +33,7 @@ int key_task_permission(const key_ref_t key_ref, const struct cred *cred,
 	key = key_ref_to_ptr(key_ref);
 
 	/* use the second 8-bits of permissions for keys the caller owns */
-	if (uid_eq(key->uid, cred->fsuid)) {
+	if (uid_eq(key->uid, cred->kfsuid)) {
 		kperm = key->perm >> 16;
 		goto use_these_perms;
 	}
@@ -41,7 +41,7 @@ int key_task_permission(const key_ref_t key_ref, const struct cred *cred,
 	/* use the third 8-bits of permissions for keys the caller has a group
 	 * membership in common with */
 	if (gid_valid(key->gid) && key->perm & KEY_GRP_ALL) {
-		if (gid_eq(key->gid, cred->fsgid)) {
+		if (gid_eq(key->gid, cred->kfsgid)) {
 			kperm = key->perm >> 8;
 			goto use_these_perms;
 		}
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index 09541de31f2f..32376f0fbb42 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -379,7 +379,7 @@ void key_fsuid_changed(struct cred *new_cred)
 	/* update the ownership of the thread keyring */
 	if (new_cred->thread_keyring) {
 		down_write(&new_cred->thread_keyring->sem);
-		new_cred->thread_keyring->uid = new_cred->fsuid;
+		new_cred->thread_keyring->uid = new_cred->kfsuid;
 		up_write(&new_cred->thread_keyring->sem);
 	}
 }
@@ -392,7 +392,7 @@ void key_fsgid_changed(struct cred *new_cred)
 	/* update the ownership of the thread keyring */
 	if (new_cred->thread_keyring) {
 		down_write(&new_cred->thread_keyring->sem);
-		new_cred->thread_keyring->gid = new_cred->fsgid;
+		new_cred->thread_keyring->gid = new_cred->kfsgid;
 		up_write(&new_cred->thread_keyring->sem);
 	}
 }
@@ -923,10 +923,12 @@ void key_change_session_keyring(struct callback_head *twork)
 	new-> euid	= old-> euid;
 	new-> suid	= old-> suid;
 	new->fsuid	= old->fsuid;
+	new->kfsuid	= old->kfsuid;
 	new->  gid	= old->  gid;
 	new-> egid	= old-> egid;
 	new-> sgid	= old-> sgid;
 	new->fsgid	= old->fsgid;
+	new->kfsgid	= old->kfsgid;
 	new->user	= get_uid(old->user);
 	new->user_ns	= get_user_ns(old->user_ns);
 	new->group_info	= get_group_info(old->group_info);
diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index 957b9e3e1492..254a7c2f3fde 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -134,7 +134,7 @@ static int call_sbin_request_key(struct key *authkey, void *aux)
 	sprintf(desc, "_req.%u", key->serial);
 
 	cred = get_current_cred();
-	keyring = keyring_alloc(desc, cred->fsuid, cred->fsgid, cred,
+	keyring = keyring_alloc(desc, cred->kfsuid, cred->kfsgid, cred,
 				KEY_POS_ALL | KEY_USR_VIEW | KEY_USR_READ,
 				KEY_ALLOC_QUOTA_OVERRUN, NULL, NULL);
 	put_cred(cred);
@@ -149,8 +149,8 @@ static int call_sbin_request_key(struct key *authkey, void *aux)
 		goto error_link;
 
 	/* record the UID and GID */
-	sprintf(uid_str, "%d", from_kuid(&init_user_ns, cred->fsuid));
-	sprintf(gid_str, "%d", from_kgid(&init_user_ns, cred->fsgid));
+	sprintf(uid_str, "%d", from_kuid(&init_user_ns, cred->kfsuid));
+	sprintf(gid_str, "%d", from_kgid(&init_user_ns, cred->kfsgid));
 
 	/* we say which key is under construction */
 	sprintf(key_str, "%d", key->serial);
@@ -390,7 +390,7 @@ static int construct_alloc_key(struct keyring_search_context *ctx,
 		perm |= KEY_POS_WRITE;
 
 	key = key_alloc(ctx->index_key.type, ctx->index_key.description,
-			ctx->cred->fsuid, ctx->cred->fsgid, ctx->cred,
+			ctx->cred->kfsuid, ctx->cred->kfsgid, ctx->cred,
 			perm, flags, NULL);
 	if (IS_ERR(key))
 		goto alloc_failed;
@@ -490,7 +490,7 @@ static struct key *construct_key_and_link(struct keyring_search_context *ctx,
 	if (ret)
 		goto error;
 
-	user = key_user_lookup(current_fsuid());
+	user = key_user_lookup(current_kfsuid());
 	if (!user) {
 		ret = -ENOMEM;
 		goto error_put_dest_keyring;
diff --git a/security/keys/request_key_auth.c b/security/keys/request_key_auth.c
index ecba39c93fd9..26808146897c 100644
--- a/security/keys/request_key_auth.c
+++ b/security/keys/request_key_auth.c
@@ -215,7 +215,7 @@ struct key *request_key_auth_new(struct key *target, const char *op,
 	sprintf(desc, "%x", target->serial);
 
 	authkey = key_alloc(&key_type_request_key_auth, desc,
-			    cred->fsuid, cred->fsgid, cred,
+			    cred->kfsuid, cred->kfsgid, cred,
 			    KEY_POS_VIEW | KEY_POS_READ | KEY_POS_SEARCH | KEY_POS_LINK |
 			    KEY_USR_VIEW, KEY_ALLOC_NOT_IN_QUOTA, NULL);
 	if (IS_ERR(authkey)) {
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 24/25] sys: handle fsid mappings in set*id() calls
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (22 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 23/25] keys: " Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-19 15:42   ` Jann Horn
  2020-02-18 14:34 ` [PATCH v3 25/25] selftests: add simple fsid mapping selftests Christian Brauner
                   ` (4 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

Switch set*id() calls to lookup fsids in the fsid mappings. If no fsid mappings
are setup the behavior is unchanged, i.e. fsids are looked up in the id
mappings.

A caller can only setid() to a given id if the id maps to a valid kid in
both the id and fsid maps of the caller's user namespace. This is always the
case when no id mappings and fsid mappings have been written. It is also always
the case when an id mapping has been written which includes the target id and
but no fsid mappings have been written. All non-fsid mapping aware workloads
will thus work just as before.

During setr*id() calls the kfsid is set to the keid corresponding to the eid
that is requested by userspace. If the requested eid is -1 the kfsid is reset
to the current keid. For the latter case this means we need to lookup the
corresponding userspace eid corresponding to the current keid in the id
mappings and translate this eid into the corresponding kfsid in the fsid
mappings.

We require that a user must have a valid fsid mapping for the target id. This
is consistent with how the setid calls work today without fsid mappings.

The kfsid to cleanly handle userns visible filesystem is set as before.

Cc: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Christian Brauner <christian.brauner@ubuntu.com>:
  - set kfsid which is used when dealing with proc permission checking

/* v3 */
- Jann Horn <jannh@google.com>:
  - Squash all set*id() patches into a single patch and move this to be the
    last patch so we don't expose a half-done feature in the middle of this
    series.
---
 kernel/sys.c | 106 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 83 insertions(+), 23 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index f9bc5c303e3f..78592deee2d8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -59,6 +59,7 @@
 #include <linux/sched/cputime.h>
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
+#include <linux/fsuidgid.h>
 #include <linux/cred.h>
 
 #include <linux/nospec.h>
@@ -353,7 +354,7 @@ long __sys_setregid(gid_t rgid, gid_t egid)
 	const struct cred *old;
 	struct cred *new;
 	int retval;
-	kgid_t krgid, kegid;
+	kgid_t krgid, kegid, kfsgid;
 
 	krgid = make_kgid(ns, rgid);
 	kegid = make_kgid(ns, egid);
@@ -385,12 +386,20 @@ long __sys_setregid(gid_t rgid, gid_t egid)
 			new->egid = kegid;
 		else
 			goto error;
+		kfsgid = make_kfsgid(ns, egid);
+	} else {
+		kfsgid = kgid_to_kfsgid(new->user_ns, new->egid);
+	}
+	if (!gid_valid(kfsgid)) {
+		retval = -EINVAL;
+		goto error;
 	}
 
 	if (rgid != (gid_t) -1 ||
 	    (egid != (gid_t) -1 && !gid_eq(kegid, old->gid)))
 		new->sgid = new->egid;
-	new->fsgid = new->egid;
+	new->kfsgid = new->egid;
+	new->fsgid = kfsgid;
 
 	return commit_creds(new);
 
@@ -415,24 +424,31 @@ long __sys_setgid(gid_t gid)
 	const struct cred *old;
 	struct cred *new;
 	int retval;
-	kgid_t kgid;
+	kgid_t kgid, kfsgid;
 
 	kgid = make_kgid(ns, gid);
 	if (!gid_valid(kgid))
 		return -EINVAL;
 
+	kfsgid = make_kfsgid(ns, gid);
+	if (!gid_valid(kfsgid))
+		return -EINVAL;
+
 	new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
 	old = current_cred();
 
 	retval = -EPERM;
-	if (ns_capable(old->user_ns, CAP_SETGID))
-		new->gid = new->egid = new->sgid = new->fsgid = kgid;
-	else if (gid_eq(kgid, old->gid) || gid_eq(kgid, old->sgid))
-		new->egid = new->fsgid = kgid;
-	else
+	if (ns_capable(old->user_ns, CAP_SETGID)) {
+		new->gid = new->egid = new->sgid = new->kfsgid = kgid;
+		new->fsgid = kfsgid;
+	} else if (gid_eq(kgid, old->gid) || gid_eq(kgid, old->sgid)) {
+		new->egid = new->kfsgid = kgid;
+		new->fsgid = kfsgid;
+	} else {
 		goto error;
+	}
 
 	return commit_creds(new);
 
@@ -496,7 +512,7 @@ long __sys_setreuid(uid_t ruid, uid_t euid)
 	const struct cred *old;
 	struct cred *new;
 	int retval;
-	kuid_t kruid, keuid;
+	kuid_t kruid, keuid, kfsuid;
 
 	kruid = make_kuid(ns, ruid);
 	keuid = make_kuid(ns, euid);
@@ -527,6 +543,13 @@ long __sys_setreuid(uid_t ruid, uid_t euid)
 		    !uid_eq(old->suid, keuid) &&
 		    !ns_capable_setid(old->user_ns, CAP_SETUID))
 			goto error;
+		kfsuid = make_kfsuid(new->user_ns, euid);
+	} else {
+		kfsuid = kuid_to_kfsuid(new->user_ns, new->euid);
+	}
+	if (!uid_valid(kfsuid)) {
+		retval = -EINVAL;
+		goto error;
 	}
 
 	if (!uid_eq(new->uid, old->uid)) {
@@ -537,7 +560,8 @@ long __sys_setreuid(uid_t ruid, uid_t euid)
 	if (ruid != (uid_t) -1 ||
 	    (euid != (uid_t) -1 && !uid_eq(keuid, old->uid)))
 		new->suid = new->euid;
-	new->fsuid = new->euid;
+	new->kfsuid = new->euid;
+	new->fsuid = kfsuid;
 
 	retval = security_task_fix_setuid(new, old, LSM_SETID_RE);
 	if (retval < 0)
@@ -573,11 +597,16 @@ long __sys_setuid(uid_t uid)
 	struct cred *new;
 	int retval;
 	kuid_t kuid;
+	kuid_t kfsuid;
 
 	kuid = make_kuid(ns, uid);
 	if (!uid_valid(kuid))
 		return -EINVAL;
 
+	kfsuid = make_kfsuid(ns, uid);
+	if (!uid_valid(kfsuid))
+		return -EINVAL;
+
 	new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
@@ -595,7 +624,8 @@ long __sys_setuid(uid_t uid)
 		goto error;
 	}
 
-	new->fsuid = new->euid = kuid;
+	new->kfsuid = new->euid = kuid;
+	new->fsuid = kfsuid;
 
 	retval = security_task_fix_setuid(new, old, LSM_SETID_ID);
 	if (retval < 0)
@@ -624,7 +654,7 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
 	const struct cred *old;
 	struct cred *new;
 	int retval;
-	kuid_t kruid, keuid, ksuid;
+	kuid_t kruid, keuid, ksuid, kfsuid;
 
 	kruid = make_kuid(ns, ruid);
 	keuid = make_kuid(ns, euid);
@@ -666,11 +696,21 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
 				goto error;
 		}
 	}
-	if (euid != (uid_t) -1)
+	if (euid != (uid_t) -1) {
 		new->euid = keuid;
+		kfsuid = make_kfsuid(ns, euid);
+	} else {
+		kfsuid = kuid_to_kfsuid(new->user_ns, new->euid);
+	}
+	if (!uid_valid(kfsuid)) {
+		return -EINVAL;
+		goto error;
+	}
+
 	if (suid != (uid_t) -1)
 		new->suid = ksuid;
-	new->fsuid = new->euid;
+	new->kfsuid = new->euid;
+	new->fsuid = kfsuid;
 
 	retval = security_task_fix_setuid(new, old, LSM_SETID_RES);
 	if (retval < 0)
@@ -716,7 +756,7 @@ long __sys_setresgid(gid_t rgid, gid_t egid, gid_t sgid)
 	const struct cred *old;
 	struct cred *new;
 	int retval;
-	kgid_t krgid, kegid, ksgid;
+	kgid_t krgid, kegid, ksgid, kfsgid;
 
 	krgid = make_kgid(ns, rgid);
 	kegid = make_kgid(ns, egid);
@@ -749,11 +789,21 @@ long __sys_setresgid(gid_t rgid, gid_t egid, gid_t sgid)
 
 	if (rgid != (gid_t) -1)
 		new->gid = krgid;
-	if (egid != (gid_t) -1)
+	if (egid != (gid_t) -1) {
 		new->egid = kegid;
+		kfsgid = make_kfsgid(ns, egid);
+	} else {
+		kfsgid = kgid_to_kfsgid(new->user_ns, new->egid);
+	}
+	if (!gid_valid(kfsgid)) {
+		retval = -EINVAL;
+		goto error;
+	}
+
 	if (sgid != (gid_t) -1)
 		new->sgid = ksgid;
-	new->fsgid = new->egid;
+	new->kfsgid = new->egid;
+	new->fsgid = kfsgid;
 
 	return commit_creds(new);
 
@@ -799,15 +849,19 @@ long __sys_setfsuid(uid_t uid)
 	const struct cred *old;
 	struct cred *new;
 	uid_t old_fsuid;
-	kuid_t kuid;
+	kuid_t kuid, kfsuid;
 
 	old = current_cred();
-	old_fsuid = from_kuid_munged(old->user_ns, old->fsuid);
+	old_fsuid = from_kfsuid_munged(old->user_ns, old->fsuid);
 
-	kuid = make_kuid(old->user_ns, uid);
+	kuid = make_kfsuid(old->user_ns, uid);
 	if (!uid_valid(kuid))
 		return old_fsuid;
 
+	kfsuid = make_kuid(old->user_ns, uid);
+	if (!uid_valid(kfsuid))
+		return old_fsuid;
+
 	new = prepare_creds();
 	if (!new)
 		return old_fsuid;
@@ -817,6 +871,7 @@ long __sys_setfsuid(uid_t uid)
 	    ns_capable_setid(old->user_ns, CAP_SETUID)) {
 		if (!uid_eq(kuid, old->fsuid)) {
 			new->fsuid = kuid;
+			new->kfsuid = kfsuid;
 			if (security_task_fix_setuid(new, old, LSM_SETID_FS) == 0)
 				goto change_okay;
 		}
@@ -843,15 +898,19 @@ long __sys_setfsgid(gid_t gid)
 	const struct cred *old;
 	struct cred *new;
 	gid_t old_fsgid;
-	kgid_t kgid;
+	kgid_t kgid, kfsgid;
 
 	old = current_cred();
-	old_fsgid = from_kgid_munged(old->user_ns, old->fsgid);
+	old_fsgid = from_kfsgid_munged(old->user_ns, old->fsgid);
 
-	kgid = make_kgid(old->user_ns, gid);
+	kgid = make_kfsgid(old->user_ns, gid);
 	if (!gid_valid(kgid))
 		return old_fsgid;
 
+	kfsgid = make_kgid(old->user_ns, gid);
+	if (!gid_valid(kfsgid))
+		return old_fsgid;
+
 	new = prepare_creds();
 	if (!new)
 		return old_fsgid;
@@ -861,6 +920,7 @@ long __sys_setfsgid(gid_t gid)
 	    ns_capable(old->user_ns, CAP_SETGID)) {
 		if (!gid_eq(kgid, old->fsgid)) {
 			new->fsgid = kgid;
+			new->kfsgid = kfsgid;
 			goto change_okay;
 		}
 	}
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v3 25/25] selftests: add simple fsid mapping selftests
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (23 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 24/25] sys: handle fsid mappings in set*id() calls Christian Brauner
@ 2020-02-18 14:34 ` Christian Brauner
  2020-02-18 23:50 ` [PATCH v3 00/25] user_namespace: introduce fsid mappings James Bottomley
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-18 14:34 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, Christian Brauner

- Verify that fsid mappings cannot be written when if mappings have been
  written already.
- Set up an id mapping and an fsid mapping, create a file and compare ids in
  child and parent user namespace.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
patch not present

/* v3 */
patch added
---
 tools/testing/selftests/Makefile              |   1 +
 .../testing/selftests/user_namespace/Makefile |  11 +
 .../selftests/user_namespace/test_fsid_map.c  | 511 ++++++++++++++++++
 3 files changed, 523 insertions(+)
 create mode 100644 tools/testing/selftests/user_namespace/Makefile
 create mode 100644 tools/testing/selftests/user_namespace/test_fsid_map.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 63430e2664c2..49dcd21d2be7 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -60,6 +60,7 @@ endif
 TARGETS += tmpfs
 TARGETS += tpm2
 TARGETS += user
+TARGETS += user_namespace
 TARGETS += vm
 TARGETS += x86
 TARGETS += zram
diff --git a/tools/testing/selftests/user_namespace/Makefile b/tools/testing/selftests/user_namespace/Makefile
new file mode 100644
index 000000000000..3f89896f3285
--- /dev/null
+++ b/tools/testing/selftests/user_namespace/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -Wall
+
+all:
+
+TEST_GEN_PROGS += test_fsid_map
+
+include ../lib.mk
+
+$(OUTPUT)/test_fsid_map: test_fsid_map.c ../clone3/clone3_selftests.h
+
diff --git a/tools/testing/selftests/user_namespace/test_fsid_map.c b/tools/testing/selftests/user_namespace/test_fsid_map.c
new file mode 100644
index 000000000000..e278f137ff55
--- /dev/null
+++ b/tools/testing/selftests/user_namespace/test_fsid_map.c
@@ -0,0 +1,511 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <grp.h>
+#include <inttypes.h>
+#include <libgen.h>
+#include <stdbool.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <linux/sched.h>
+#include <sys/fsuid.h>
+#include <sys/mount.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "../clone3/clone3_selftests.h"
+
+static int wait_for_pid(pid_t pid)
+{
+	int status, ret;
+
+again:
+	ret = waitpid(pid, &status, 0);
+	if (ret == -1) {
+		if (errno == EINTR)
+			goto again;
+
+		return -1;
+	}
+
+	if (!WIFEXITED(status))
+		return -1;
+
+	return WEXITSTATUS(status);
+}
+
+static int setid_userns_root(void)
+{
+	if (setuid(0))
+		return -1;
+	if (setgid(0))
+		return -1;
+
+	setfsuid(0);
+	setfsgid(0);
+
+	if (setfsuid(0))
+		return -1;
+
+	if (setfsgid(0))
+		return -1;
+
+	return 0;
+}
+
+enum idmap_type {
+	UID_MAP,
+	GID_MAP,
+	FSUID_MAP,
+	FSGID_MAP,
+};
+
+static ssize_t read_nointr(int fd, void *buf, size_t count)
+{
+	ssize_t ret;
+again:
+	ret = read(fd, buf, count);
+	if (ret < 0 && errno == EINTR)
+		goto again;
+
+	return ret;
+}
+
+static ssize_t write_nointr(int fd, const void *buf, size_t count)
+{
+	ssize_t ret;
+again:
+	ret = write(fd, buf, count);
+	if (ret < 0 && errno == EINTR)
+		goto again;
+
+	return ret;
+}
+
+static int write_id_mapping(enum idmap_type type, pid_t pid, const char *buf,
+			    size_t buf_size)
+{
+	int fd;
+	int ret;
+	char path[4096];
+
+	switch (type) {
+	case UID_MAP:
+		ret = snprintf(path, sizeof(path), "/proc/%d/uid_map", pid);
+		break;
+	case GID_MAP:
+		ret = snprintf(path, sizeof(path), "/proc/%d/gid_map", pid);
+		break;
+	case FSUID_MAP:
+		ret = snprintf(path, sizeof(path), "/proc/%d/fsuid_map", pid);
+		break;
+	case FSGID_MAP:
+		ret = snprintf(path, sizeof(path), "/proc/%d/fsgid_map", pid);
+		break;
+	default:
+		return -1;
+	}
+	if (ret < 0 || ret >= sizeof(path))
+		return -E2BIG;
+
+	fd = open(path, O_WRONLY);
+	if (fd < 0)
+		return -1;
+
+	ret = write_nointr(fd, buf, buf_size);
+	close(fd);
+	if (ret != buf_size)
+		return -1;
+
+	return 0;
+}
+
+const char id_map[] = "0 100000 100000";
+#define id_map_size (sizeof(id_map) - 1)
+
+const char fsid_map[] = "0 300000 100000";
+#define fsid_map_size (sizeof(fsid_map) - 1)
+
+int unix_send_fds_iov(int fd, int *sendfds, int num_sendfds, struct iovec *iov,
+		      size_t iovlen)
+{
+	char *cmsgbuf = NULL;
+	int ret;
+	struct msghdr msg;
+	struct cmsghdr *cmsg = NULL;
+	size_t cmsgbufsize = CMSG_SPACE(num_sendfds * sizeof(int));
+
+	memset(&msg, 0, sizeof(msg));
+
+	cmsgbuf = malloc(cmsgbufsize);
+	if (!cmsgbuf) {
+		errno = ENOMEM;
+		return -1;
+	}
+
+	msg.msg_control = cmsgbuf;
+	msg.msg_controllen = cmsgbufsize;
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	cmsg->cmsg_level = SOL_SOCKET;
+	cmsg->cmsg_type = SCM_RIGHTS;
+	cmsg->cmsg_len = CMSG_LEN(num_sendfds * sizeof(int));
+
+	msg.msg_controllen = cmsg->cmsg_len;
+
+	memcpy(CMSG_DATA(cmsg), sendfds, num_sendfds * sizeof(int));
+
+	msg.msg_iov = iov;
+	msg.msg_iovlen = iovlen;
+
+again:
+	ret = sendmsg(fd, &msg, MSG_NOSIGNAL);
+	if (ret < 0)
+		if (errno == EINTR)
+			goto again;
+
+	free(cmsgbuf);
+	return ret;
+}
+
+static int unix_send_fds(int fd, int *sendfds, int num_sendfds, void *data,
+			 size_t size)
+{
+	char buf[1] = {0};
+	struct iovec iov = {
+		.iov_base = data ? data : buf,
+		.iov_len = data ? size : sizeof(buf),
+	};
+	return unix_send_fds_iov(fd, sendfds, num_sendfds, &iov, 1);
+}
+
+static int unix_recv_fds_iov(int fd, int *recvfds, int num_recvfds,
+			     struct iovec *iov, size_t iovlen)
+{
+	char *cmsgbuf = NULL;
+	int ret;
+	struct msghdr msg;
+	struct cmsghdr *cmsg = NULL;
+	size_t cmsgbufsize = CMSG_SPACE(sizeof(struct ucred)) +
+			     CMSG_SPACE(num_recvfds * sizeof(int));
+
+	memset(&msg, 0, sizeof(msg));
+
+	cmsgbuf = malloc(cmsgbufsize);
+	if (!cmsgbuf) {
+		errno = ENOMEM;
+		return -1;
+	}
+
+	msg.msg_control = cmsgbuf;
+	msg.msg_controllen = cmsgbufsize;
+
+	msg.msg_iov = iov;
+	msg.msg_iovlen = iovlen;
+
+again:
+	ret = recvmsg(fd, &msg, 0);
+	if (ret < 0) {
+		if (errno == EINTR)
+			goto again;
+
+		goto out;
+	}
+	if (ret == 0)
+		goto out;
+
+	/*
+	 * If SO_PASSCRED is set we will always get a ucred message.
+	 */
+	for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+		if (cmsg->cmsg_type != SCM_RIGHTS)
+			continue;
+
+		memset(recvfds, -1, num_recvfds * sizeof(int));
+		if (cmsg &&
+		    cmsg->cmsg_len == CMSG_LEN(num_recvfds * sizeof(int)) &&
+		    cmsg->cmsg_level == SOL_SOCKET)
+			memcpy(recvfds, CMSG_DATA(cmsg), num_recvfds * sizeof(int));
+		break;
+	}
+
+out:
+	free(cmsgbuf);
+	return ret;
+}
+
+static int unix_recv_fds(int fd, int *recvfds, int num_recvfds, void *data,
+			 size_t size)
+{
+	char buf[1] = {0};
+	struct iovec iov = {
+		.iov_base = data ? data : buf,
+		.iov_len = data ? size : sizeof(buf),
+	};
+	return unix_recv_fds_iov(fd, recvfds, num_recvfds, &iov, 1);
+}
+
+static bool has_expected_owner(int fd, uid_t uid, gid_t gid)
+{
+	int ret;
+	struct stat s;
+	ret = fstat(fd, &s);
+	return !ret && s.st_uid == uid && s.st_gid == gid;
+}
+
+static int make_file_cmp_owner(uid_t uid, gid_t gid)
+{
+	char template[] = P_tmpdir "/.fsid_map_test_XXXXXX";
+	int fd;
+
+	fd = mkstemp(template);
+	if (fd < 0)
+		return -1;
+	unlink(template);
+
+	if (!has_expected_owner(fd, uid, gid)) {
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
+static void test_id_maps_imply_fsid_maps(void)
+{
+	int fret = EXIT_FAILURE;
+	ssize_t ret;
+	int fd = -EBADF;
+	pid_t pid;
+	int ipc[2];
+	struct clone_args args = {
+		.flags = CLONE_NEWUSER,
+		.exit_signal = SIGCHLD,
+	};
+
+	ret = socketpair(PF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc);
+	if (ret < 0)
+		ksft_exit_fail_msg("socketpair() failed\n");
+
+	pid = sys_clone3(&args, sizeof(args));
+	if (pid < 0) {
+		close(ipc[0]);
+		close(ipc[1]);
+		ksft_exit_fail_msg("clone3() failed\n");
+	}
+
+	if (pid == 0) {
+		int fd;
+		char buf;
+
+		close(ipc[1]);
+
+		ret = read_nointr(ipc[0], &buf, 1);
+		if (ret != 1)
+			ksft_exit_fail_msg("read_nointr() failed\n");
+
+		if (setid_userns_root())
+			ksft_exit_fail_msg("setid_userns_root() failed\n");
+
+		fd = make_file_cmp_owner(0, 0);
+		if (fd < 0)
+			ksft_exit_fail_msg("make_file_cmp_owner() failed\n");
+
+		if (unix_send_fds(ipc[0], &fd, 1, NULL, 0) < 0)
+			ksft_exit_fail_msg("unix_send_fds() failed\n");
+
+		exit(EXIT_SUCCESS);
+	}
+
+	close(ipc[0]);
+
+	ret = write_id_mapping(UID_MAP, pid, id_map, id_map_size);
+	if (ret) {
+		ksft_exit_fail_msg("unix_send_fds() failed\n");
+		goto kill_child;
+	}
+
+	/* Must fail since a uid mapping has already been written. */
+	ret = write_id_mapping(FSUID_MAP, pid, fsid_map, fsid_map_size);
+	if (ret == 0) {
+		ksft_exit_fail_msg("unix_send_fds() succeeded\n");
+		goto kill_child;
+	}
+
+	ret = write_id_mapping(GID_MAP, pid, id_map, id_map_size);
+	if (ret) {
+		ksft_exit_fail_msg("unix_send_fds() failed\n");
+		goto kill_child;
+	}
+
+	/* Must fail since a gid mapping has already been written. */
+	ret = write_id_mapping(FSGID_MAP, pid, fsid_map, fsid_map_size);
+	if (ret == 0) {
+		ksft_exit_fail_msg("unix_send_fds() failed\n");
+		goto kill_child;
+	}
+
+	ret = write_nointr(ipc[1], "1", 1);
+	if (ret != 1) {
+		ksft_exit_fail_msg("write_nointr() failed\n");
+		goto kill_child;
+	}
+
+	if (unix_recv_fds(ipc[1], &fd, 1, NULL, 0) < 0) {
+		ksft_exit_fail_msg("unix_recv_fds() failed\n");
+		goto kill_child;
+	}
+
+	if (!has_expected_owner(fd, 100000, 100000)) {
+		ksft_exit_fail_msg("has_expected_owner() failed\n");
+		goto kill_child;
+	}
+
+	fret = EXIT_SUCCESS;
+
+wait_child:
+	ret = wait_for_pid(pid);
+	if (ret)
+		ksft_exit_fail_msg("wait_for_pid() failed\n");
+
+        if (fret == EXIT_SUCCESS)
+		return;
+	exit(fret);
+
+kill_child:
+	kill(pid, SIGKILL);
+	exit(EXIT_FAILURE);
+	goto wait_child;
+}
+
+static void test_fsid_maps_basic(void)
+{
+	int fret = EXIT_FAILURE;
+	ssize_t ret;
+	int fd = -EBADF;
+	pid_t pid;
+	int ipc[2];
+	struct clone_args args = {
+		.flags = CLONE_NEWUSER,
+		.exit_signal = SIGCHLD,
+	};
+
+	ret = socketpair(PF_LOCAL, SOCK_STREAM | SOCK_CLOEXEC, 0, ipc);
+	if (ret < 0)
+		ksft_exit_fail_msg("socketpair() failed\n");
+
+	pid = sys_clone3(&args, sizeof(args));
+	if (pid < 0) {
+		close(ipc[0]);
+		close(ipc[1]);
+		ksft_exit_fail_msg("clone3() failed\n");
+	}
+
+	if (pid == 0) {
+		int fd;
+		char buf;
+
+		close(ipc[1]);
+
+		ret = read_nointr(ipc[0], &buf, 1);
+		if (ret != 1)
+			ksft_exit_fail_msg("read_nointr() failed\n");
+
+		if (setid_userns_root())
+			ksft_exit_fail_msg("setid_userns_root() failed\n");
+
+		fd = make_file_cmp_owner(0, 0);
+		if (fd < 0)
+			ksft_exit_fail_msg("make_file_cmp_owner() failed\n");
+
+		if (unix_send_fds(ipc[0], &fd, 1, NULL, 0) < 0)
+			ksft_exit_fail_msg("unix_send_fds() failed\n");
+
+		exit(EXIT_SUCCESS);
+	}
+
+	close(ipc[0]);
+
+	/* Must fail since a uid mapping has already been written. */
+	ret = write_id_mapping(FSUID_MAP, pid, fsid_map, fsid_map_size);
+	if (ret) {
+		ksft_exit_fail_msg("unix_send_fds() failed\n");
+		goto kill_child;
+	}
+
+	ret = write_id_mapping(UID_MAP, pid, id_map, id_map_size);
+	if (ret) {
+		ksft_exit_fail_msg("unix_send_fds() failed\n");
+		goto kill_child;
+	}
+
+	/* Must fail since a gid mapping has already been written. */
+	ret = write_id_mapping(FSGID_MAP, pid, fsid_map, fsid_map_size);
+	if (ret) {
+		ksft_exit_fail_msg("unix_send_fds() failed\n");
+		goto kill_child;
+	}
+
+	ret = write_id_mapping(GID_MAP, pid, id_map, id_map_size);
+	if (ret) {
+		ksft_exit_fail_msg("unix_send_fds() failed\n");
+		goto kill_child;
+	}
+
+	ret = write_nointr(ipc[1], "1", 1);
+	if (ret != 1) {
+		ksft_exit_fail_msg("write_nointr() failed\n");
+		goto kill_child;
+	}
+
+	if (unix_recv_fds(ipc[1], &fd, 1, NULL, 0) < 0) {
+		ksft_exit_fail_msg("unix_recv_fds() failed\n");
+		goto kill_child;
+	}
+
+	if (!has_expected_owner(fd, 300000, 300000)) {
+		ksft_exit_fail_msg("has_expected_owner() failed\n");
+		goto kill_child;
+	}
+
+	fret = EXIT_SUCCESS;
+
+wait_child:
+	ret = wait_for_pid(pid);
+	if (ret)
+		ksft_exit_fail_msg("wait_for_pid() failed\n");
+
+        if (fret == EXIT_SUCCESS)
+		return;
+	exit(fret);
+
+kill_child:
+	kill(pid, SIGKILL);
+	exit(EXIT_FAILURE);
+	goto wait_child;
+}
+
+int main(int argc, char *argv[])
+{
+	if (getuid())
+		ksft_exit_skip("fsid mapping tests require root\n");
+
+	if (access("/proc/self/fsuid_map", F_OK))
+		ksft_exit_skip("fsid mappings not supported by this kernel\n");
+
+	test_clone3_supported();
+
+	test_id_maps_imply_fsid_maps();
+	test_fsid_maps_basic();
+
+	exit(EXIT_SUCCESS);
+}
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 11/25] inode: inode_owner_or_capable(): handle fsid mappings
  2020-02-18 14:33 ` [PATCH v3 11/25] inode: inode_owner_or_capable(): " Christian Brauner
@ 2020-02-18 22:25   ` Christoph Hellwig
  2020-02-19 12:29     ` Christian Brauner
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2020-02-18 22:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Tue, Feb 18, 2020 at 03:33:57PM +0100, Christian Brauner wrote:
> +	if (is_userns_visible(inode->i_sb->s_iflags)) {
> +		if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
> +			return true;
> +	} else if (kfsuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER)) {

This adds some crazy long unreadable lines..

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 15/25] posix_acl: handle fsid mappings
  2020-02-18 14:34 ` [PATCH v3 15/25] posix_acl: " Christian Brauner
@ 2020-02-18 22:26   ` Christoph Hellwig
  2020-02-19 12:56     ` Christian Brauner
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2020-02-18 22:26 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Tue, Feb 18, 2020 at 03:34:01PM +0100, Christian Brauner wrote:
> diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> index 249672bf54fe..ed6112c9b804 100644
> --- a/fs/posix_acl.c
> +++ b/fs/posix_acl.c
> @@ -22,6 +22,7 @@
>  #include <linux/xattr.h>
>  #include <linux/export.h>
>  #include <linux/user_namespace.h>
> +#include <linux/fsuidgid.h>
>  
>  static struct posix_acl **acl_by_type(struct inode *inode, int type)
>  {
> @@ -692,12 +693,12 @@ static void posix_acl_fix_xattr_userns(
>  	for (end = entry + count; entry != end; entry++) {
>  		switch(le16_to_cpu(entry->e_tag)) {
>  		case ACL_USER:
> -			uid = make_kuid(from, le32_to_cpu(entry->e_id));
> -			entry->e_id = cpu_to_le32(from_kuid(to, uid));
> +			uid = make_kfsuid(from, le32_to_cpu(entry->e_id));
> +			entry->e_id = cpu_to_le32(from_kfsuid(to, uid));
>  			break;
>  		case ACL_GROUP:
> -			gid = make_kgid(from, le32_to_cpu(entry->e_id));
> -			entry->e_id = cpu_to_le32(from_kgid(to, gid));
> +			gid = make_kfsgid(from, le32_to_cpu(entry->e_id));
> +			entry->e_id = cpu_to_le32(from_kfsgid(to, gid));
>  			break;

Before we touch this code any more it needs to move to the right place.
Poking into ACLs from generic xattr code is a complete layering
violation, and all this needs to be moved so that it is called by
the actual handlers called from the file systems.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/25] user_namespace: introduce fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (24 preceding siblings ...)
  2020-02-18 14:34 ` [PATCH v3 25/25] selftests: add simple fsid mapping selftests Christian Brauner
@ 2020-02-18 23:50 ` James Bottomley
  2020-02-19 12:27   ` Christian Brauner
  2020-02-19 15:33 ` Jann Horn
                   ` (2 subsequent siblings)
  28 siblings, 1 reply; 51+ messages in thread
From: James Bottomley @ 2020-02-18 23:50 UTC (permalink / raw)
  To: Christian Brauner, Stéphane Graber, Eric W. Biederman,
	Aleksa Sarai, Jann Horn
  Cc: Kees Cook, Jonathan Corbet, linux-kernel, containers, smbarber,
	Seth Forshee, linux-security-module, Alexander Viro, linux-api,
	linux-fsdevel, Alexey Dobriyan

On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote:
> In the usual case of running an unprivileged container we will have
> setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> correspond to this id mapping, i.e. all files which we want to appear
> as 0:0 inside the user namespace will be chowned to 100000:100000 on
> the host. This works, because whenever the kernel needs to do a
> filesystem access it will lookup the corresponding uid and gid in the
> idmapping tables of the container. Now think about the case where we
> want to have an id mapping of 0 100000 100000 but an on-disk mapping
> of 0 300000 100000 which is needed to e.g. share a single on-disk
> mapping with multiple containers that all have different id mappings.
> This will be problematic. Whenever a filesystem access is requested,
> the kernel will now try to lookup a mapping for 300000 in the id
> mapping tables of the user namespace but since there is none the
> files will appear to be owned by the overflow id, i.e. usually
> 65534:65534 or nobody:nogroup.
> 
> With fsid mappings we can solve this by writing an id mapping of 0
> 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> access the kernel will now lookup the mapping for 300000 in the fsid
> mapping tables of the user namespace. And since such a mapping
> exists, the corresponding files will have correct ownership.

So I did compile this up in order to run the shiftfs tests over it to
see how it coped with the various corner cases.  However, what I find
is it simply fails the fsid reverse mapping in the setup.  Trying to
use a simple uid of 0 100000 1000 and a fsid of 100000 0 1000 fails the
entry setuid(0) call because of this code:

long __sys_setuid(uid_t uid)
{
	struct user_namespace *ns =
current_user_ns();
	const struct cred *old;
	struct cred *new;
	int
retval;
	kuid_t kuid;
	kuid_t kfsuid;

	kuid = make_kuid(ns, uid);
	if
(!uid_valid(kuid))
		return -EINVAL;

	kfsuid = make_kfsuid(ns, uid);
	if
(!uid_valid(kfsuid))
		return -EINVAL;

which means you can't have a fsid mapping that doesn't have the same
domain as the uid mapping, meaning a reverse mapping isn't possible
because the range and domain have to be inverse and disjoint.

James


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 01/25] user_namespace: introduce fsid mappings infrastructure
  2020-02-18 14:33 ` [PATCH v3 01/25] user_namespace: introduce fsid mappings infrastructure Christian Brauner
@ 2020-02-19  2:33   ` Serge E. Hallyn
  0 siblings, 0 replies; 51+ messages in thread
From: Serge E. Hallyn @ 2020-02-19  2:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Tue, Feb 18, 2020 at 03:33:47PM +0100, Christian Brauner wrote:
> This introduces the infrastructure to setup fsid mappings which will be used in
> later patches.
> All new code depends on CONFIG_USER_NS_FSID=y. It currently defaults to "N".
> If CONFIG_USER_NS_FSID is not set, no new code is added.
> 
> In this patch fsuid_m_show() and fsgid_m_show() are introduced. They are
> identical to uid_m_show() and gid_m_show() until we introduce from_kfsuid() and
> from_kfsgid() in a follow-up patch.
> 
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
> /* v2 */
> - Randy Dunlap <rdunlap@infradead.org>:
>   - Fix typo in USER_NS_FSID kconfig documentation.
> 
> /* v3 */
> unchanged
> ---
>  include/linux/user_namespace.h |  10 +++
>  init/Kconfig                   |  11 +++
>  kernel/user.c                  |  22 ++++++
>  kernel/user_namespace.c        | 122 +++++++++++++++++++++++++++++++++
>  4 files changed, 165 insertions(+)
> 
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 6ef1c7109fc4..e44742b0cf8a 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -56,6 +56,10 @@ enum ucount_type {
>  struct user_namespace {
>  	struct uid_gid_map	uid_map;
>  	struct uid_gid_map	gid_map;
> +#ifdef CONFIG_USER_NS_FSID
> +	struct uid_gid_map	fsuid_map;
> +	struct uid_gid_map	fsgid_map;
> +#endif
>  	struct uid_gid_map	projid_map;
>  	atomic_t		count;
>  	struct user_namespace	*parent;
> @@ -127,6 +131,12 @@ struct seq_operations;
>  extern const struct seq_operations proc_uid_seq_operations;
>  extern const struct seq_operations proc_gid_seq_operations;
>  extern const struct seq_operations proc_projid_seq_operations;
> +#ifdef CONFIG_USER_NS_FSID
> +extern const struct seq_operations proc_fsuid_seq_operations;
> +extern const struct seq_operations proc_fsgid_seq_operations;
> +extern ssize_t proc_fsuid_map_write(struct file *, const char __user *, size_t, loff_t *);
> +extern ssize_t proc_fsgid_map_write(struct file *, const char __user *, size_t, loff_t *);
> +#endif
>  extern ssize_t proc_uid_map_write(struct file *, const char __user *, size_t, loff_t *);
>  extern ssize_t proc_gid_map_write(struct file *, const char __user *, size_t, loff_t *);
>  extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t, loff_t *);
> diff --git a/init/Kconfig b/init/Kconfig
> index cfee56c151f1..d4d0beeba48f 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1111,6 +1111,17 @@ config USER_NS
>  
>  	  If unsure, say N.
>  
> +config USER_NS_FSID
> +	bool "User namespace fsid mappings"
> +	depends on USER_NS
> +	default n
> +	help
> +	  This allows containers to alter their filesystem id mappings.
> +	  With this containers with different id mappings can still share
> +	  the same filesystem.
> +
> +	  If unsure, say N.
> +
>  config PID_NS
>  	bool "PID Namespaces"
>  	default y
> diff --git a/kernel/user.c b/kernel/user.c
> index 5235d7f49982..2ccaea9b810b 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -55,6 +55,28 @@ struct user_namespace init_user_ns = {
>  			},
>  		},
>  	},
> +#ifdef CONFIG_USER_NS_FSID
> +	.fsuid_map = {
> +		.nr_extents = 1,
> +		{
> +			.extent[0] = {
> +				.first = 0,
> +				.lower_first = 0,
> +				.count = 4294967295U,
> +			},
> +		},
> +	},
> +	.fsgid_map = {
> +		.nr_extents = 1,
> +		{
> +			.extent[0] = {
> +				.first = 0,
> +				.lower_first = 0,
> +				.count = 4294967295U,
> +			},
> +		},
> +	},
> +#endif
>  	.count = ATOMIC_INIT(3),
>  	.owner = GLOBAL_ROOT_UID,
>  	.group = GLOBAL_ROOT_GID,
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 8eadadc478f9..cbdf456f95f0 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -191,6 +191,16 @@ static void free_user_ns(struct work_struct *work)
>  			kfree(ns->projid_map.forward);
>  			kfree(ns->projid_map.reverse);
>  		}
> +#ifdef CONFIG_USER_NS_FSID
> +		if (ns->fsgid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
> +			kfree(ns->fsgid_map.forward);
> +			kfree(ns->fsgid_map.reverse);
> +		}
> +		if (ns->fsuid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
> +			kfree(ns->fsuid_map.forward);
> +			kfree(ns->fsuid_map.reverse);
> +		}
> +#endif
>  		retire_userns_sysctls(ns);
>  		key_free_user_ns(ns);
>  		ns_free_inum(&ns->ns);
> @@ -637,6 +647,50 @@ static int projid_m_show(struct seq_file *seq, void *v)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_USER_NS_FSID
> +static int fsuid_m_show(struct seq_file *seq, void *v)
> +{
> +	struct user_namespace *ns = seq->private;
> +	struct uid_gid_extent *extent = v;
> +	struct user_namespace *lower_ns;
> +	uid_t lower;
> +
> +	lower_ns = seq_user_ns(seq);
> +	if ((lower_ns == ns) && lower_ns->parent)
> +		lower_ns = lower_ns->parent;
> +
> +	lower = from_kuid(lower_ns, KUIDT_INIT(extent->lower_first));
> +
> +	seq_printf(seq, "%10u %10u %10u\n",
> +		extent->first,
> +		lower,
> +		extent->count);
> +
> +	return 0;
> +}
> +
> +static int fsgid_m_show(struct seq_file *seq, void *v)
> +{
> +	struct user_namespace *ns = seq->private;
> +	struct uid_gid_extent *extent = v;
> +	struct user_namespace *lower_ns;
> +	gid_t lower;
> +
> +	lower_ns = seq_user_ns(seq);
> +	if ((lower_ns == ns) && lower_ns->parent)
> +		lower_ns = lower_ns->parent;
> +
> +	lower = from_kgid(lower_ns, KGIDT_INIT(extent->lower_first));
> +
> +	seq_printf(seq, "%10u %10u %10u\n",
> +		extent->first,
> +		lower,
> +		extent->count);
> +
> +	return 0;
> +}
> +#endif
> +
>  static void *m_start(struct seq_file *seq, loff_t *ppos,
>  		     struct uid_gid_map *map)
>  {
> @@ -674,6 +728,22 @@ static void *projid_m_start(struct seq_file *seq, loff_t *ppos)
>  	return m_start(seq, ppos, &ns->projid_map);
>  }
>  
> +#ifdef CONFIG_USER_NS_FSID
> +static void *fsuid_m_start(struct seq_file *seq, loff_t *ppos)
> +{
> +	struct user_namespace *ns = seq->private;
> +
> +	return m_start(seq, ppos, &ns->fsuid_map);
> +}
> +
> +static void *fsgid_m_start(struct seq_file *seq, loff_t *ppos)
> +{
> +	struct user_namespace *ns = seq->private;
> +
> +	return m_start(seq, ppos, &ns->fsgid_map);
> +}
> +#endif
> +
>  static void *m_next(struct seq_file *seq, void *v, loff_t *pos)
>  {
>  	(*pos)++;
> @@ -706,6 +776,22 @@ const struct seq_operations proc_projid_seq_operations = {
>  	.show = projid_m_show,
>  };
>  
> +#ifdef CONFIG_USER_NS_FSID
> +const struct seq_operations proc_fsuid_seq_operations = {
> +	.start = fsuid_m_start,
> +	.stop = m_stop,
> +	.next = m_next,
> +	.show = fsuid_m_show,
> +};
> +
> +const struct seq_operations proc_fsgid_seq_operations = {
> +	.start = fsgid_m_start,
> +	.stop = m_stop,
> +	.next = m_next,
> +	.show = fsgid_m_show,
> +};
> +#endif
> +
>  static bool mappings_overlap(struct uid_gid_map *new_map,
>  			     struct uid_gid_extent *extent)
>  {
> @@ -1081,6 +1167,42 @@ ssize_t proc_projid_map_write(struct file *file, const char __user *buf,
>  			 &ns->projid_map, &ns->parent->projid_map);
>  }
>  
> +#ifdef CONFIG_USER_NS_FSID
> +ssize_t proc_fsuid_map_write(struct file *file, const char __user *buf,
> +			     size_t size, loff_t *ppos)
> +{
> +	struct seq_file *seq = file->private_data;
> +	struct user_namespace *ns = seq->private;
> +	struct user_namespace *seq_ns = seq_user_ns(seq);
> +
> +	if (!ns->parent)
> +		return -EPERM;
> +
> +	if ((seq_ns != ns) && (seq_ns != ns->parent))
> +		return -EPERM;
> +
> +	return map_write(file, buf, size, ppos, CAP_SETUID, &ns->fsuid_map,
> +			 &ns->parent->fsuid_map);
> +}
> +
> +ssize_t proc_fsgid_map_write(struct file *file, const char __user *buf,
> +			     size_t size, loff_t *ppos)
> +{
> +	struct seq_file *seq = file->private_data;
> +	struct user_namespace *ns = seq->private;
> +	struct user_namespace *seq_ns = seq_user_ns(seq);
> +
> +	if (!ns->parent)
> +		return -EPERM;
> +
> +	if ((seq_ns != ns) && (seq_ns != ns->parent))
> +		return -EPERM;
> +
> +	return map_write(file, buf, size, ppos, CAP_SETGID, &ns->fsgid_map,
> +			 &ns->parent->fsgid_map);
> +}
> +#endif
> +
>  static bool new_idmap_permitted(const struct file *file,
>  				struct user_namespace *ns, int cap_setid,
>  				struct uid_gid_map *new_map)
> -- 
> 2.25.0

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 02/25] proc: add /proc/<pid>/fsuid_map
  2020-02-18 14:33 ` [PATCH v3 02/25] proc: add /proc/<pid>/fsuid_map Christian Brauner
@ 2020-02-19  2:33   ` Serge E. Hallyn
  0 siblings, 0 replies; 51+ messages in thread
From: Serge E. Hallyn @ 2020-02-19  2:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Tue, Feb 18, 2020 at 03:33:48PM +0100, Christian Brauner wrote:
> The /proc/<pid>/fsuid_map file can be written once to setup an fsuid mapping
> for a user namespace. Writing to this file has the same restrictions as writing
> to /proc/<pid>/fsuid_map:
> 
> root@e1-vm:/# cat /proc/13023/fsuid_map
>          0     300000     100000
> 
> Fsid mappings have always been around. They are currently always identical to
> the id mappings for a user namespace. This means, currently whenever an fsid
> needs to be looked up the kernel will use the id mapping of the user namespace.
> With the introduction of fsid mappings the kernel will now lookup fsids in the
> fsid mappings of the user namespace. If no fsid mapping exists the kernel will
> continue looking up fsids in the id mappings of the user namespace. Hence, if a
> system supports fsid mappings through /proc/<pid>/fs*id_map and a container
> runtime is not aware of fsid mappings it or does not use them it will it will
> continue to work just as before.
> 
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
> /* v2 */
> unchanged
> 
> /* v3 */
> - Christian Brauner <christian.brauner@ubuntu.com>:
>   - Fix grammar in commit message.
> ---
>  fs/proc/base.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index c7c64272b0fa..5fb28004663e 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2970,6 +2970,13 @@ static int proc_projid_map_open(struct inode *inode, struct file *file)
>  	return proc_id_map_open(inode, file, &proc_projid_seq_operations);
>  }
>  
> +#ifdef CONFIG_USER_NS_FSID
> +static int proc_fsuid_map_open(struct inode *inode, struct file *file)
> +{
> +	return proc_id_map_open(inode, file, &proc_fsuid_seq_operations);
> +}
> +#endif
> +
>  static const struct file_operations proc_uid_map_operations = {
>  	.open		= proc_uid_map_open,
>  	.write		= proc_uid_map_write,
> @@ -2994,6 +3001,16 @@ static const struct file_operations proc_projid_map_operations = {
>  	.release	= proc_id_map_release,
>  };
>  
> +#ifdef CONFIG_USER_NS_FSID
> +static const struct file_operations proc_fsuid_map_operations = {
> +	.open		= proc_fsuid_map_open,
> +	.write		= proc_fsuid_map_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= proc_id_map_release,
> +};
> +#endif
> +
>  static int proc_setgroups_open(struct inode *inode, struct file *file)
>  {
>  	struct user_namespace *ns = NULL;
> @@ -3176,6 +3193,9 @@ static const struct pid_entry tgid_base_stuff[] = {
>  	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
>  #endif
>  #ifdef CONFIG_USER_NS
> +#ifdef CONFIG_USER_NS_FSID
> +	REG("fsuid_map",  S_IRUGO|S_IWUSR, proc_fsuid_map_operations),
> +#endif
>  	REG("uid_map",    S_IRUGO|S_IWUSR, proc_uid_map_operations),
>  	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
>  	REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
> -- 
> 2.25.0

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 03/25] proc: add /proc/<pid>/fsgid_map
  2020-02-18 14:33 ` [PATCH v3 03/25] proc: add /proc/<pid>/fsgid_map Christian Brauner
@ 2020-02-19  2:33   ` Serge E. Hallyn
  0 siblings, 0 replies; 51+ messages in thread
From: Serge E. Hallyn @ 2020-02-19  2:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Tue, Feb 18, 2020 at 03:33:49PM +0100, Christian Brauner wrote:
> The /proc/<pid>/fsgid_map file can be written once to setup an fsgid mapping
> for a user namespace. Writing to this file has the same restrictions as writing
> to /proc/<pid>/fsgid_map.
> 
> root@e1-vm:/# cat /proc/13023/fsgid_map
>          0     300000     100000
> 
> Fsid mappings have always been around. They are currently always identical to
> the id mappings for a user namespace. This means, currently whenever an fsid
> needs to be looked up the kernel will use the id mapping of the user namespace.
> With the introduction of fsid mappings the kernel will now lookup fsids in the
> fsid mappings of the user namespace. If no fsid mapping exists the kernel will
> continue looking up fsids in the id mappings of the user namespace. Hence, if a
> system supports fsid mappings through /proc/<pid>/fs*id_map and a container
> runtime is not aware of fsid mappings it or does not use them it will it will
> continue to work just as before.
> 
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
> /* v2 */
> unchanged
> 
> /* v3 */
> - Christian Brauner <christian.brauner@ubuntu.com>:
>   - Fix grammar in commit message.
> ---
>  fs/proc/base.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 5fb28004663e..1303cdd2e617 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2975,6 +2975,11 @@ static int proc_fsuid_map_open(struct inode *inode, struct file *file)
>  {
>  	return proc_id_map_open(inode, file, &proc_fsuid_seq_operations);
>  }
> +
> +static int proc_fsgid_map_open(struct inode *inode, struct file *file)
> +{
> +	return proc_id_map_open(inode, file, &proc_fsgid_seq_operations);
> +}
>  #endif
>  
>  static const struct file_operations proc_uid_map_operations = {
> @@ -3009,6 +3014,14 @@ static const struct file_operations proc_fsuid_map_operations = {
>  	.llseek		= seq_lseek,
>  	.release	= proc_id_map_release,
>  };
> +
> +static const struct file_operations proc_fsgid_map_operations = {
> +	.open		= proc_fsgid_map_open,
> +	.write		= proc_fsgid_map_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= proc_id_map_release,
> +};
>  #endif
>  
>  static int proc_setgroups_open(struct inode *inode, struct file *file)
> @@ -3195,6 +3208,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #ifdef CONFIG_USER_NS
>  #ifdef CONFIG_USER_NS_FSID
>  	REG("fsuid_map",  S_IRUGO|S_IWUSR, proc_fsuid_map_operations),
> +	REG("fsgid_map",  S_IRUGO|S_IWUSR, proc_fsgid_map_operations),
>  #endif
>  	REG("uid_map",    S_IRUGO|S_IWUSR, proc_uid_map_operations),
>  	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
> -- 
> 2.25.0

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 05/25] user_namespace: refactor map_write()
  2020-02-18 14:33 ` [PATCH v3 05/25] user_namespace: refactor map_write() Christian Brauner
@ 2020-02-19  2:35   ` Serge E. Hallyn
  0 siblings, 0 replies; 51+ messages in thread
From: Serge E. Hallyn @ 2020-02-19  2:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Tue, Feb 18, 2020 at 03:33:51PM +0100, Christian Brauner wrote:
> Refactor map_write() to prepare for adding fsid mappings support. This mainly
> factors out various open-coded parts into helpers that can be reused in the
> follow up patch.
> 
> Cc: Jann Horn <jannh@google.com>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
> /* v2 */
> patch not present
> 
> /* v3 */
> patch added
> - Jann Horn <jannh@google.com>:
>   - Split changes to map_write() to implement fsid mappings into three separate
>     patches: basic fsid helpers, preparatory changes to map_write(), actual
>     fsid mapping support in map_write().
> ---
>  kernel/user_namespace.c | 117 +++++++++++++++++++++++++---------------
>  1 file changed, 74 insertions(+), 43 deletions(-)
> 
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 2cfd1e519cc4..e91141262bcc 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -1038,10 +1038,10 @@ static int cmp_extents_reverse(const void *a, const void *b)
>  }
>  
>  /**
> - * sort_idmaps - Sorts an array of idmap entries.
> + * sort_map - Sorts an array of idmap entries.
>   * Can only be called if number of mappings exceeds UID_GID_MAP_MAX_BASE_EXTENTS.
>   */
> -static int sort_idmaps(struct uid_gid_map *map)
> +static int sort_map(struct uid_gid_map *map)
>  {
>  	if (map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
>  		return 0;
> @@ -1064,6 +1064,71 @@ static int sort_idmaps(struct uid_gid_map *map)
>  	return 0;
>  }
>  
> +static int sort_idmaps(struct uid_gid_map *map)
> +{
> +	return sort_map(map);
> +}
> +
> +static int map_from_parent(struct uid_gid_map *new_map,
> +			   struct uid_gid_map *parent_map)
> +{
> +	unsigned idx;
> +
> +	/* Map the lower ids from the parent user namespace to the
> +	 * kernel global id space.
> +	 */
> +	for (idx = 0; idx < new_map->nr_extents; idx++) {
> +		struct uid_gid_extent *e;
> +		u32 lower_first;
> +
> +		if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> +			e = &new_map->extent[idx];
> +		else
> +			e = &new_map->forward[idx];
> +
> +		lower_first = map_id_range_down(parent_map, e->lower_first, e->count);
> +
> +		/* Fail if we can not map the specified extent to
> +		 * the kernel global id space.
> +		 */
> +		if (lower_first == (u32)-1)
> +			return -EPERM;
> +
> +		e->lower_first = lower_first;
> +	}
> +
> +	return 0;
> +}
> +
> +static int map_into_kids(struct uid_gid_map *id_map,
> +			 struct uid_gid_map *parent_id_map)
> +{
> +	return map_from_parent(id_map, parent_id_map);
> +}
> +
> +static void install_idmaps(struct uid_gid_map *id_map,
> +			   struct uid_gid_map *new_id_map)
> +{
> +	if (new_id_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) {
> +		memcpy(id_map->extent, new_id_map->extent,
> +		       new_id_map->nr_extents * sizeof(new_id_map->extent[0]));
> +	} else {
> +		id_map->forward = new_id_map->forward;
> +		id_map->reverse = new_id_map->reverse;
> +	}
> +}
> +
> +static void free_idmaps(struct uid_gid_map *new_id_map)
> +{
> +	if (new_id_map->nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
> +		kfree(new_id_map->forward);
> +		kfree(new_id_map->reverse);
> +		new_id_map->forward = NULL;
> +		new_id_map->reverse = NULL;
> +		new_id_map->nr_extents = 0;
> +	}
> +}
> +
>  static ssize_t map_write(struct file *file, const char __user *buf,
>  			 size_t count, loff_t *ppos,
>  			 int cap_setid,
> @@ -1073,7 +1138,6 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>  	struct seq_file *seq = file->private_data;
>  	struct user_namespace *ns = seq->private;
>  	struct uid_gid_map new_map;
> -	unsigned idx;
>  	struct uid_gid_extent extent;
>  	char *kbuf = NULL, *pos, *next_line;
>  	ssize_t ret;
> @@ -1191,61 +1255,28 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>  	if (!new_idmap_permitted(file, ns, cap_setid, &new_map))
>  		goto out;
>  
> -	ret = -EPERM;
> -	/* Map the lower ids from the parent user namespace to the
> -	 * kernel global id space.
> -	 */
> -	for (idx = 0; idx < new_map.nr_extents; idx++) {
> -		struct uid_gid_extent *e;
> -		u32 lower_first;
> -
> -		if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> -			e = &new_map.extent[idx];
> -		else
> -			e = &new_map.forward[idx];
> -
> -		lower_first = map_id_range_down(parent_map,
> -						e->lower_first,
> -						e->count);
> -
> -		/* Fail if we can not map the specified extent to
> -		 * the kernel global id space.
> -		 */
> -		if (lower_first == (u32) -1)
> -			goto out;
> -
> -		e->lower_first = lower_first;
> -	}
> +	ret = map_into_kids(&new_map, parent_map);
> +	if (ret)
> +		goto out;
>  
>  	/*
>  	 * If we want to use binary search for lookup, this clones the extent
>  	 * array and sorts both copies.
>  	 */
>  	ret = sort_idmaps(&new_map);
> -	if (ret < 0)
> +	if (ret)
>  		goto out;
>  
>  	/* Install the map */
> -	if (new_map.nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS) {
> -		memcpy(map->extent, new_map.extent,
> -		       new_map.nr_extents * sizeof(new_map.extent[0]));
> -	} else {
> -		map->forward = new_map.forward;
> -		map->reverse = new_map.reverse;
> -	}
> +	install_idmaps(map, &new_map);
>  	smp_wmb();
>  	map->nr_extents = new_map.nr_extents;
>  
>  	*ppos = count;
>  	ret = count;
>  out:
> -	if (ret < 0 && new_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
> -		kfree(new_map.forward);
> -		kfree(new_map.reverse);
> -		map->forward = NULL;
> -		map->reverse = NULL;
> -		map->nr_extents = 0;
> -	}
> +	if (ret < 0)
> +		free_idmaps(&new_map);
>  
>  	mutex_unlock(&userns_state_mutex);
>  	kfree(kbuf);
> -- 
> 2.25.0

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 07/25] proc: task_state(): use from_kfs{g,u}id_munged
  2020-02-18 14:33 ` [PATCH v3 07/25] proc: task_state(): use from_kfs{g,u}id_munged Christian Brauner
@ 2020-02-19  2:36   ` Serge E. Hallyn
  0 siblings, 0 replies; 51+ messages in thread
From: Serge E. Hallyn @ 2020-02-19  2:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Tue, Feb 18, 2020 at 03:33:53PM +0100, Christian Brauner wrote:
> If fsid mappings have been written, this will cause proc to look at fsid
> mappings for the user namespace. If no fsid mappings have been written the
> behavior is as before.
> 
> Here is part of the output from /proc/<pid>/status from the initial user
> namespace for systemd running in an unprivileged container as user namespace
> root with id mapping 0 100000 100000 and fsid mapping 0 300000 100000:
> 
> Name:   systemd
> Umask:  0000
> State:  S (sleeping)
> Tgid:   13023
> Ngid:   0
> Pid:    13023
> PPid:   13008
> TracerPid:      0
> Uid:    100000  100000  100000  300000
> Gid:    100000  100000  100000  300000
> FDSize: 64
> Groups:
> 
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

Acked-by: Serge Hallyn <serge@hallyn.com>

> ---
> /* v2 */
> unchanged
> 
> /* v3 */
> unchanged
> ---
>  fs/proc/array.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 5efaf3708ec6..d4a04f85a67e 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -91,6 +91,7 @@
>  #include <linux/string_helpers.h>
>  #include <linux/user_namespace.h>
>  #include <linux/fs_struct.h>
> +#include <linux/fsuidgid.h>
>  
>  #include <asm/pgtable.h>
>  #include <asm/processor.h>
> @@ -193,11 +194,11 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>  	seq_put_decimal_ull(m, "\nUid:\t", from_kuid_munged(user_ns, cred->uid));
>  	seq_put_decimal_ull(m, "\t", from_kuid_munged(user_ns, cred->euid));
>  	seq_put_decimal_ull(m, "\t", from_kuid_munged(user_ns, cred->suid));
> -	seq_put_decimal_ull(m, "\t", from_kuid_munged(user_ns, cred->fsuid));
> +	seq_put_decimal_ull(m, "\t", from_kfsuid_munged(user_ns, cred->fsuid));
>  	seq_put_decimal_ull(m, "\nGid:\t", from_kgid_munged(user_ns, cred->gid));
>  	seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->egid));
>  	seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->sgid));
> -	seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->fsgid));
> +	seq_put_decimal_ull(m, "\t", from_kfsgid_munged(user_ns, cred->fsgid));
>  	seq_put_decimal_ull(m, "\nFDSize:\t", max_fds);
>  
>  	seq_puts(m, "\nGroups:\t");
> -- 
> 2.25.0

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 09/25] fs: add is_userns_visible() helper
  2020-02-18 14:33 ` [PATCH v3 09/25] fs: add is_userns_visible() helper Christian Brauner
@ 2020-02-19  2:42   ` Serge E. Hallyn
  2020-02-19 12:06     ` Christian Brauner
  0 siblings, 1 reply; 51+ messages in thread
From: Serge E. Hallyn @ 2020-02-19  2:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Tue, Feb 18, 2020 at 03:33:55PM +0100, Christian Brauner wrote:
> Introduce a helper which makes it possible to detect fileystems whose
> superblock is visible in multiple user namespace. This currently only
> means proc and sys. Such filesystems usually have special semantics so their
> behavior will not be changed with the introduction of fsid mappings.

Hi,

I'm afraid I've got a bit of a hangup about the terminology here.  I
*think* what you mean is that SB_I_USERNS_VISIBLE is an fs whose uids are
always translated per the id mappings, not fsid mappings.  But when I see
the name it seems to imply that !SB_I_USERNS_VISIBLE filesystems can't
be seen by other namespaces at all.

Am I right in my first interpretation?  If so, can we talk about the
naming?

> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> ---
> /* v2 */
> unchanged
> 
> /* v3 */
> unchanged
> ---
>  include/linux/fs.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3cd4fe6b845e..fdc8fb2d786b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3651,4 +3651,9 @@ static inline int inode_drain_writes(struct inode *inode)
>  	return filemap_write_and_wait(inode->i_mapping);
>  }
>  
> +static inline bool is_userns_visible(unsigned long iflags)
> +{
> +	return (iflags & SB_I_USERNS_VISIBLE);
> +}
> +
>  #endif /* _LINUX_FS_H */
> -- 
> 2.25.0

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 09/25] fs: add is_userns_visible() helper
  2020-02-19  2:42   ` Serge E. Hallyn
@ 2020-02-19 12:06     ` Christian Brauner
  2020-02-19 17:18       ` Andy Lutomirski
  0 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-19 12:06 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	James Morris, Kees Cook, Jonathan Corbet, Phil Estes,
	linux-kernel, linux-fsdevel, containers, linux-security-module,
	linux-api

On Tue, Feb 18, 2020 at 08:42:33PM -0600, Serge Hallyn wrote:
> On Tue, Feb 18, 2020 at 03:33:55PM +0100, Christian Brauner wrote:
> > Introduce a helper which makes it possible to detect fileystems whose
> > superblock is visible in multiple user namespace. This currently only
> > means proc and sys. Such filesystems usually have special semantics so their
> > behavior will not be changed with the introduction of fsid mappings.
> 
> Hi,
> 
> I'm afraid I've got a bit of a hangup about the terminology here.  I
> *think* what you mean is that SB_I_USERNS_VISIBLE is an fs whose uids are
> always translated per the id mappings, not fsid mappings.  But when I see

Correct!

> the name it seems to imply that !SB_I_USERNS_VISIBLE filesystems can't
> be seen by other namespaces at all.
> 
> Am I right in my first interpretation?  If so, can we talk about the
> naming?

Yep, your first interpretation is right. What about: wants_idmaps()

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/25] user_namespace: introduce fsid mappings
  2020-02-18 23:50 ` [PATCH v3 00/25] user_namespace: introduce fsid mappings James Bottomley
@ 2020-02-19 12:27   ` Christian Brauner
  2020-02-19 15:36     ` James Bottomley
  0 siblings, 1 reply; 51+ messages in thread
From: Christian Brauner @ 2020-02-19 12:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	Kees Cook, Jonathan Corbet, linux-api, containers, linux-kernel,
	smbarber, Seth Forshee, linux-security-module, Alexander Viro,
	linux-fsdevel, Alexey Dobriyan

On Tue, Feb 18, 2020 at 03:50:56PM -0800, James Bottomley wrote:
> On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote:
> > In the usual case of running an unprivileged container we will have
> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> > correspond to this id mapping, i.e. all files which we want to appear
> > as 0:0 inside the user namespace will be chowned to 100000:100000 on
> > the host. This works, because whenever the kernel needs to do a
> > filesystem access it will lookup the corresponding uid and gid in the
> > idmapping tables of the container. Now think about the case where we
> > want to have an id mapping of 0 100000 100000 but an on-disk mapping
> > of 0 300000 100000 which is needed to e.g. share a single on-disk
> > mapping with multiple containers that all have different id mappings.
> > This will be problematic. Whenever a filesystem access is requested,
> > the kernel will now try to lookup a mapping for 300000 in the id
> > mapping tables of the user namespace but since there is none the
> > files will appear to be owned by the overflow id, i.e. usually
> > 65534:65534 or nobody:nogroup.
> > 
> > With fsid mappings we can solve this by writing an id mapping of 0
> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > access the kernel will now lookup the mapping for 300000 in the fsid
> > mapping tables of the user namespace. And since such a mapping
> > exists, the corresponding files will have correct ownership.
> 
> So I did compile this up in order to run the shiftfs tests over it to
> see how it coped with the various corner cases.  However, what I find
> is it simply fails the fsid reverse mapping in the setup.  Trying to
> use a simple uid of 0 100000 1000 and a fsid of 100000 0 1000 fails the
> entry setuid(0) call because of this code:

This is easy to fix. But what's the exact use-case?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 11/25] inode: inode_owner_or_capable(): handle fsid mappings
  2020-02-18 22:25   ` Christoph Hellwig
@ 2020-02-19 12:29     ` Christian Brauner
  0 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-19 12:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Tue, Feb 18, 2020 at 02:25:23PM -0800, Christoph Hellwig wrote:
> On Tue, Feb 18, 2020 at 03:33:57PM +0100, Christian Brauner wrote:
> > +	if (is_userns_visible(inode->i_sb->s_iflags)) {
> > +		if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
> > +			return true;
> > +	} else if (kfsuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER)) {
> 
> This adds some crazy long unreadable lines..

I'll ad a helper in the next version or wrap those lines depending on
what makes more sense.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 15/25] posix_acl: handle fsid mappings
  2020-02-18 22:26   ` Christoph Hellwig
@ 2020-02-19 12:56     ` Christian Brauner
  0 siblings, 0 replies; 51+ messages in thread
From: Christian Brauner @ 2020-02-19 12:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Tue, Feb 18, 2020 at 02:26:31PM -0800, Christoph Hellwig wrote:
> On Tue, Feb 18, 2020 at 03:34:01PM +0100, Christian Brauner wrote:
> > diff --git a/fs/posix_acl.c b/fs/posix_acl.c
> > index 249672bf54fe..ed6112c9b804 100644
> > --- a/fs/posix_acl.c
> > +++ b/fs/posix_acl.c
> > @@ -22,6 +22,7 @@
> >  #include <linux/xattr.h>
> >  #include <linux/export.h>
> >  #include <linux/user_namespace.h>
> > +#include <linux/fsuidgid.h>
> >  
> >  static struct posix_acl **acl_by_type(struct inode *inode, int type)
> >  {
> > @@ -692,12 +693,12 @@ static void posix_acl_fix_xattr_userns(
> >  	for (end = entry + count; entry != end; entry++) {
> >  		switch(le16_to_cpu(entry->e_tag)) {
> >  		case ACL_USER:
> > -			uid = make_kuid(from, le32_to_cpu(entry->e_id));
> > -			entry->e_id = cpu_to_le32(from_kuid(to, uid));
> > +			uid = make_kfsuid(from, le32_to_cpu(entry->e_id));
> > +			entry->e_id = cpu_to_le32(from_kfsuid(to, uid));
> >  			break;
> >  		case ACL_GROUP:
> > -			gid = make_kgid(from, le32_to_cpu(entry->e_id));
> > -			entry->e_id = cpu_to_le32(from_kgid(to, gid));
> > +			gid = make_kfsgid(from, le32_to_cpu(entry->e_id));
> > +			entry->e_id = cpu_to_le32(from_kfsgid(to, gid));
> >  			break;
> 
> Before we touch this code any more it needs to move to the right place.
> Poking into ACLs from generic xattr code is a complete layering

git history shows that it was deliberately placed after the fs specific
xattr handlers have been called so individual filesystems don't need to
be aware of id mappings to make maintenance easier. Same goes for vfs
capabilities. Moving this down into individual filesystem seems like a
maintenance nightmare where now each individual filesystem will have to
remember to fixup their ids. For namespaced vfs caps which are handled
at the same level in setxattr() it will also mean breaking backwards
compatible translation from non-namespaced vfs caps aware userspace to
namespaced vfs-caps aware kernels.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/25] user_namespace: introduce fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (25 preceding siblings ...)
  2020-02-18 23:50 ` [PATCH v3 00/25] user_namespace: introduce fsid mappings James Bottomley
@ 2020-02-19 15:33 ` Jann Horn
  2020-02-19 19:35 ` Serge E. Hallyn
  2020-02-27 19:33 ` Josef Bacik
  28 siblings, 0 replies; 51+ messages in thread
From: Jann Horn @ 2020-02-19 15:33 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai,
	Stephen Barber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, kernel list, linux-fsdevel, Linux Containers,
	linux-security-module, Linux API

On Tue, Feb 18, 2020 at 3:35 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
[...]
> - Let the keyctl infrastructure only operate on kfsid which are always
>   mapped/looked up in the id mappings similar to what we do for
>   filesystems that have the same superblock visible in multiple user
>   namespaces.
>
> This version also comes with minimal tests which I intend to expand in
> the future.
>
> From pings and off-list questions and discussions at Google Container
> Security Summit there seems to be quite a lot of interest in this
> patchset with use-cases ranging from layer sharing for app containers
> and k8s, as well as data sharing between containers with different id
> mappings. I haven't Cced all people because I don't have all the email
> adresses at hand but I've at least added Phil now. :)
>
> This is the implementation of shiftfs which was cooked up during lunch at
> Linux Plumbers 2019 the day after the container's microconference. The
> idea is a design-stew from Stéphane, Aleksa, Eric, and myself (and by
> now also Jann.
> Back then we all were quite busy with other work and couldn't really sit
> down and implement it. But I took a few days last week to do this work,
> including demos and performance testing.
> This implementation does not require us to touch the VFS substantially
> at all. Instead, we implement shiftfs via fsid mappings.
> With this patch, it took me 20 mins to port both LXD and LXC to support
> shiftfs via fsid mappings.
[...]

Can you please grep through the kernel for all uses of ->fsuid and
->fsgid and fix them up appropriately? Some cases I still see:


The SafeSetID LSM wants to enforce that you can only use CAP_SETUID to
gain the privileges of a specific set of IDs:

static int safesetid_task_fix_setuid(struct cred *new,
                                     const struct cred *old,
                                     int flags)
{

        /* Do nothing if there are no setuid restrictions for our old RUID. */
        if (setuid_policy_lookup(old->uid, INVALID_UID) == SIDPOL_DEFAULT)
                return 0;

        if (uid_permitted_for_cred(old, new->uid) &&
            uid_permitted_for_cred(old, new->euid) &&
            uid_permitted_for_cred(old, new->suid) &&
            uid_permitted_for_cred(old, new->fsuid))
                return 0;

        /*
         * Kill this process to avoid potential security vulnerabilities
         * that could arise from a missing whitelist entry preventing a
         * privileged process from dropping to a lesser-privileged one.
         */
        force_sig(SIGKILL);
        return -EACCES;
}

This could theoretically be bypassed through setfsuid() if the kuid
based on the fsuid mappings is permitted but the kuid based on the
normal mappings is not.


fs/coredump.c in suid dump mode uses "cred->fsuid = GLOBAL_ROOT_UID";
this should probably also fix up the other uid, even if there is no
scenario in which it would actually be used at the moment?


The netfilter xt_owner stuff makes packet filtering decisions based on
the ->fsuid; it might be better to filter on the ->kfsuid so that you
can filter traffic from different user namespaces differently?


audit_log_task_info() is doing "from_kuid(&init_user_ns, cred->fsuid)".

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/25] user_namespace: introduce fsid mappings
  2020-02-19 12:27   ` Christian Brauner
@ 2020-02-19 15:36     ` James Bottomley
  0 siblings, 0 replies; 51+ messages in thread
From: James Bottomley @ 2020-02-19 15:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	Kees Cook, Jonathan Corbet, linux-api, containers, linux-kernel,
	smbarber, Seth Forshee, linux-security-module, Alexander Viro,
	linux-fsdevel, Alexey Dobriyan

On Wed, 2020-02-19 at 13:27 +0100, Christian Brauner wrote:
> On Tue, Feb 18, 2020 at 03:50:56PM -0800, James Bottomley wrote:
> > On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote:
[...]
> > > With fsid mappings we can solve this by writing an id mapping of
> > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On
> > > filesystem access the kernel will now lookup the mapping for
> > > 300000 in the fsid mapping tables of the user namespace. And
> > > since such a mapping exists, the corresponding files will have
> > > correct ownership.
> > 
> > So I did compile this up in order to run the shiftfs tests over it
> > to see how it coped with the various corner cases.  However, what I
> > find is it simply fails the fsid reverse mapping in the
> > setup.  Trying to use a simple uid of 0 100000 1000 and a fsid of
> > 100000 0 1000 fails the entry setuid(0) call because of this code:
> 
> This is easy to fix. But what's the exact use-case?

Well, the use case I'm looking to solve is the same one it's always
been: getting a deprivileged fake root in a user_ns to be able to write
an image at fsuid 0.

I don't think it's solvable in your current framework, although
allowing the domain to be disjoint might possibly hack around it.  The
problem with the proposed framework is that there are no backshifts
from the filesystem view, there are only forward shifts to the
filesystem view.  This means that to get your framework to write a
filesystem at fsuid 0 you have to have an identity map for fsuid. Which
I can do: I tested uid shift 0 100000 1000 and fsuid shift 0 0 1000. 
It does all work, as you'd expect because the container has real fs
root not a fake root.  And that's the whole problem:  Firstly, I'm fs
root for any filesystem my userns can see, so any imprecision in
setting up the mount namespace of the container and I own your host and
secondly any containment break and I'm privileged with respect to the
fs uid wherever I escape to so I will likewise own your host.

The only way to keep containment is to have a zero fsuid inside the
container corresponding to a non-zero one outside.  And the only way to
solve the imprecision in mount namespace issue is to strictly control
the entry point at which the writing at fsuid 0 becomes active.

James




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 24/25] sys: handle fsid mappings in set*id() calls
  2020-02-18 14:34 ` [PATCH v3 24/25] sys: handle fsid mappings in set*id() calls Christian Brauner
@ 2020-02-19 15:42   ` Jann Horn
  0 siblings, 0 replies; 51+ messages in thread
From: Jann Horn @ 2020-02-19 15:42 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai,
	Stephen Barber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, kernel list, linux-fsdevel, Linux Containers,
	linux-security-module, Linux API

On Tue, Feb 18, 2020 at 3:37 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> Switch set*id() calls to lookup fsids in the fsid mappings. If no fsid mappings
> are setup the behavior is unchanged, i.e. fsids are looked up in the id
> mappings.
[...]
> @@ -353,7 +354,7 @@ long __sys_setregid(gid_t rgid, gid_t egid)
>         const struct cred *old;
>         struct cred *new;
>         int retval;
> -       kgid_t krgid, kegid;
> +       kgid_t krgid, kegid, kfsgid;
>
>         krgid = make_kgid(ns, rgid);
>         kegid = make_kgid(ns, egid);
> @@ -385,12 +386,20 @@ long __sys_setregid(gid_t rgid, gid_t egid)
>                         new->egid = kegid;
>                 else
>                         goto error;
> +               kfsgid = make_kfsgid(ns, egid);
> +       } else {
> +               kfsgid = kgid_to_kfsgid(new->user_ns, new->egid);
> +       }

Here the "kfsgid" is the new filesystem GID as translated by the
special fsgid mapping...

> +       if (!gid_valid(kfsgid)) {
> +               retval = -EINVAL;
> +               goto error;
>         }
>
>         if (rgid != (gid_t) -1 ||
>             (egid != (gid_t) -1 && !gid_eq(kegid, old->gid)))
>                 new->sgid = new->egid;
> -       new->fsgid = new->egid;
> +       new->kfsgid = new->egid;

... but the "kfsgid" of the creds struct is translated by the normal
gid mapping...

> +       new->fsgid = kfsgid;

... and the local "kfsgid" is stored into the "fsgid" member.

This is pretty hard to follow. Can you come up with some naming scheme
that is clearer and where one name is always used for the
normally-translated fsgid and another name is always used for the
specially-translated fsgid? E.g. something like "pfsgid" (with the "p"
standing for "process", because it uses the same id mappings as used
for process identities) for the IDs translated via the normal gid
mapping?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 19/25] commoncap: handle fsid mappings with vfs caps
  2020-02-18 14:34 ` [PATCH v3 19/25] commoncap: handle fsid mappings with vfs caps Christian Brauner
@ 2020-02-19 15:53   ` Jann Horn
  0 siblings, 0 replies; 51+ messages in thread
From: Jann Horn @ 2020-02-19 15:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai,
	Stephen Barber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, kernel list, linux-fsdevel, Linux Containers,
	linux-security-module, Linux API

On Tue, Feb 18, 2020 at 3:35 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> Switch vfs cap helpers to lookup fsids in the fsid mappings. If no fsid
> mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
> id mappings.
[...]
> diff --git a/security/commoncap.c b/security/commoncap.c
[...]
> @@ -328,7 +328,7 @@ static bool rootid_owns_currentns(kuid_t kroot)
>                 return false;
>
>         for (ns = current_user_ns(); ; ns = ns->parent) {
> -               if (from_kuid(ns, kroot) == 0)
> +               if (from_kfsuid(ns, kroot) == 0)
>                         return true;
>                 if (ns == &init_user_ns)
>                         break;

Nit: Maybe change the name of this function to something that makes it
clear that this operates in the fsuid mapping domain.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings
  2020-02-18 14:33 ` [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings Christian Brauner
@ 2020-02-19 16:18   ` Jann Horn
  0 siblings, 0 replies; 51+ messages in thread
From: Jann Horn @ 2020-02-19 16:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai,
	Stephen Barber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, kernel list, linux-fsdevel, Linux Containers,
	linux-security-module, Linux API

On Tue, Feb 18, 2020 at 3:35 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> Based on discussions with Jann we decided in order to cleanly handle nested
> user namespaces that fsid mappings can only be written before the corresponding
> id mappings have been written. Writing id mappings before writing the
> corresponding fsid mappings causes fsid mappings to mirror id mappings.
>
> Consider creating a user namespace NS1 with the initial user namespace as
> parent. Assume NS1 receives id mapping 0 100000 100000 and fsid mappings 0
> 300000 100000. Files that root in NS1 will create will map to kfsuid=300000 and
> kfsgid=300000 and will hence be owned by uid=300000 and gid 300000 on-disk in
> the initial user namespace.
> Now assume user namespace NS2 is created in user namespace NS1. Assume that NS2
> receives id mapping 0 10000 65536 and an fsid mapping of 0 10000 65536. Files
> that root in NS2 will create will map to kfsuid=10000 and kfsgid=10000 in NS1.
> hence, files created by NS2 will hence be appear to be be owned by uid=10000
> and gid=10000 on-disk in NS1. Looking at the initial user namespace, files
> created by NS2 will map to kfsuid=310000 and kfsgid=310000 and hence will be
> owned by uid=310000 and gid=310000 on-disk.
[...]
>  static bool new_idmap_permitted(const struct file *file,
>                                 struct user_namespace *ns, int cap_setid,
> -                               struct uid_gid_map *new_map)
> +                               struct uid_gid_map *new_map,
> +                               enum idmap_type idmap_type)
>  {
>         const struct cred *cred = file->f_cred;
> +
> +       /* Don't allow writing fsuid maps when uid maps have been written. */
> +       if (idmap_type == FSUID_MAP && idmap_exists(&ns->uid_map))
> +               return false;
> +
> +       /* Don't allow writing fsgid maps when gid maps have been written. */
> +       if (idmap_type == FSGID_MAP && idmap_exists(&ns->gid_map))
> +               return false;

Why are these checks necessary? Shouldn't an fs*id map have already
been implicitly created?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 09/25] fs: add is_userns_visible() helper
  2020-02-19 12:06     ` Christian Brauner
@ 2020-02-19 17:18       ` Andy Lutomirski
  2020-02-20 14:26         ` Serge E. Hallyn
  0 siblings, 1 reply; 51+ messages in thread
From: Andy Lutomirski @ 2020-02-19 17:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Serge E. Hallyn, Stéphane Graber, Eric W. Biederman,
	Aleksa Sarai, Jann Horn, smbarber, Seth Forshee, Alexander Viro,
	Alexey Dobriyan, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, LKML, Linux FS Devel, Linux Containers, LSM List,
	Linux API

On Wed, Feb 19, 2020 at 4:06 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Tue, Feb 18, 2020 at 08:42:33PM -0600, Serge Hallyn wrote:
> > On Tue, Feb 18, 2020 at 03:33:55PM +0100, Christian Brauner wrote:
> > > Introduce a helper which makes it possible to detect fileystems whose
> > > superblock is visible in multiple user namespace. This currently only
> > > means proc and sys. Such filesystems usually have special semantics so their
> > > behavior will not be changed with the introduction of fsid mappings.
> >
> > Hi,
> >
> > I'm afraid I've got a bit of a hangup about the terminology here.  I
> > *think* what you mean is that SB_I_USERNS_VISIBLE is an fs whose uids are
> > always translated per the id mappings, not fsid mappings.  But when I see
>
> Correct!
>
> > the name it seems to imply that !SB_I_USERNS_VISIBLE filesystems can't
> > be seen by other namespaces at all.
> >
> > Am I right in my first interpretation?  If so, can we talk about the
> > naming?
>
> Yep, your first interpretation is right. What about: wants_idmaps()

Maybe fsidmap_exempt()?

I still haven't convinced myself that any of the above is actually
correct behavior, especially when people do things like creating
setuid binaries.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/25] user_namespace: introduce fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (26 preceding siblings ...)
  2020-02-19 15:33 ` Jann Horn
@ 2020-02-19 19:35 ` Serge E. Hallyn
  2020-02-19 21:48   ` Serge E. Hallyn
  2020-02-27 19:33 ` Josef Bacik
  28 siblings, 1 reply; 51+ messages in thread
From: Serge E. Hallyn @ 2020-02-19 19:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn,
	smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote:
> With fsid mappings we can solve this by writing an id mapping of 0
> 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> access the kernel will now lookup the mapping for 300000 in the fsid
> mapping tables of the user namespace. And since such a mapping exists,
> the corresponding files will have correct ownership.

So if I have

/proc/self/uid_map: 0 100000 100000
/proc/self/fsid_map: 1000 1000 1

1. If I read files from the rootfs which have host uid 101000, they
will appear as uid 100 to me?

2. If I read host files with uid 1000, they will appear as uid 1000 to me?

3. If I create a new file, as uid 1000, what will be the inode owning uid?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/25] user_namespace: introduce fsid mappings
  2020-02-19 19:35 ` Serge E. Hallyn
@ 2020-02-19 21:48   ` Serge E. Hallyn
  2020-02-19 21:56     ` Tycho Andersen
  0 siblings, 1 reply; 51+ messages in thread
From: Serge E. Hallyn @ 2020-02-19 21:48 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Christian Brauner, Stéphane Graber, Eric W. Biederman,
	Aleksa Sarai, Jann Horn, smbarber, Seth Forshee, Alexander Viro,
	Alexey Dobriyan, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Wed, Feb 19, 2020 at 01:35:58PM -0600, Serge E. Hallyn wrote:
> On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote:
> > With fsid mappings we can solve this by writing an id mapping of 0
> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > access the kernel will now lookup the mapping for 300000 in the fsid
> > mapping tables of the user namespace. And since such a mapping exists,
> > the corresponding files will have correct ownership.
> 
> So if I have
> 
> /proc/self/uid_map: 0 100000 100000
> /proc/self/fsid_map: 1000 1000 1

Oh, sorry.  Your explanation in 20/25 i think set me straight, though I need
to think through a few more examples.

...

> 3. If I create a new file, as nsuid 1000, what will be the inode owning kuid?

(Note - I edited the quoted txt above to be more precise)

I'm still not quite clear on this.  I believe the fsid mapping will take
precedence so it'll be uid 1000 ?  Per mount behavior would be nice there,
but perhaps unwieldy.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/25] user_namespace: introduce fsid mappings
  2020-02-19 21:48   ` Serge E. Hallyn
@ 2020-02-19 21:56     ` Tycho Andersen
  0 siblings, 0 replies; 51+ messages in thread
From: Tycho Andersen @ 2020-02-19 21:56 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Christian Brauner, Stéphane Graber, Eric W. Biederman,
	Aleksa Sarai, Jann Horn, smbarber, Seth Forshee, Alexander Viro,
	Alexey Dobriyan, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api

On Wed, Feb 19, 2020 at 03:48:37PM -0600, Serge E. Hallyn wrote:
> On Wed, Feb 19, 2020 at 01:35:58PM -0600, Serge E. Hallyn wrote:
> > On Tue, Feb 18, 2020 at 03:33:46PM +0100, Christian Brauner wrote:
> > > With fsid mappings we can solve this by writing an id mapping of 0
> > > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > > access the kernel will now lookup the mapping for 300000 in the fsid
> > > mapping tables of the user namespace. And since such a mapping exists,
> > > the corresponding files will have correct ownership.
> > 
> > So if I have
> > 
> > /proc/self/uid_map: 0 100000 100000
> > /proc/self/fsid_map: 1000 1000 1
> 
> Oh, sorry.  Your explanation in 20/25 i think set me straight, though I need
> to think through a few more examples.
> 
> ...
> 
> > 3. If I create a new file, as nsuid 1000, what will be the inode owning kuid?
> 
> (Note - I edited the quoted txt above to be more precise)
> 
> I'm still not quite clear on this.  I believe the fsid mapping will take
> precedence so it'll be uid 1000 ?  Per mount behavior would be nice there,
> but perhaps unwieldy.

The is_userns_visible() bits seems to be an attempt at understanding
what people would want per-mount, with a policy hard coded in the
kernel.

But maybe per-mount behavior can be solved more elegantly with shifted
bind mounts, so we can drop all that from this series, and ignore
per-mount settings here?

Tycho

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 09/25] fs: add is_userns_visible() helper
  2020-02-19 17:18       ` Andy Lutomirski
@ 2020-02-20 14:26         ` Serge E. Hallyn
  0 siblings, 0 replies; 51+ messages in thread
From: Serge E. Hallyn @ 2020-02-20 14:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christian Brauner, Serge E. Hallyn, Stéphane Graber,
	Eric W. Biederman, Aleksa Sarai, Jann Horn, smbarber,
	Seth Forshee, Alexander Viro, Alexey Dobriyan, James Morris,
	Kees Cook, Jonathan Corbet, Phil Estes, LKML, Linux FS Devel,
	Linux Containers, LSM List, Linux API

On Wed, Feb 19, 2020 at 09:18:51AM -0800, Andy Lutomirski wrote:
> On Wed, Feb 19, 2020 at 4:06 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Tue, Feb 18, 2020 at 08:42:33PM -0600, Serge Hallyn wrote:
> > > On Tue, Feb 18, 2020 at 03:33:55PM +0100, Christian Brauner wrote:
> > > > Introduce a helper which makes it possible to detect fileystems whose
> > > > superblock is visible in multiple user namespace. This currently only
> > > > means proc and sys. Such filesystems usually have special semantics so their
> > > > behavior will not be changed with the introduction of fsid mappings.
> > >
> > > Hi,
> > >
> > > I'm afraid I've got a bit of a hangup about the terminology here.  I
> > > *think* what you mean is that SB_I_USERNS_VISIBLE is an fs whose uids are
> > > always translated per the id mappings, not fsid mappings.  But when I see
> >
> > Correct!
> >
> > > the name it seems to imply that !SB_I_USERNS_VISIBLE filesystems can't
> > > be seen by other namespaces at all.
> > >
> > > Am I right in my first interpretation?  If so, can we talk about the
> > > naming?
> >
> > Yep, your first interpretation is right. What about: wants_idmaps()
> 
> Maybe fsidmap_exempt()?

Yeah, and maybe SB_USERNS_FSID_EXEMPT ?

> I still haven't convinced myself that any of the above is actually
> correct behavior, especially when people do things like creating
> setuid binaries.

The only place that would be a problem is if the child userns has an
fsidmapping from X to 0 in the parent userns, right?  Yeah I'm sure
many people would ignore all advice to the contrary and do this anyway,
but I would try hard to suggest that people use an intermediary userns
for storing filesystems for the "docker share" case.  So the host fsid
range would start at say 200000.  So a setuid binary would just be
setuid-200000.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/25] user_namespace: introduce fsid mappings
  2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
                   ` (27 preceding siblings ...)
  2020-02-19 19:35 ` Serge E. Hallyn
@ 2020-02-27 19:33 ` Josef Bacik
  2020-03-02 14:34   ` Serge E. Hallyn
  28 siblings, 1 reply; 51+ messages in thread
From: Josef Bacik @ 2020-02-27 19:33 UTC (permalink / raw)
  To: Christian Brauner, Stéphane Graber, Eric W. Biederman,
	Aleksa Sarai, Jann Horn
  Cc: smbarber, Seth Forshee, Alexander Viro, Alexey Dobriyan,
	Serge Hallyn, James Morris, Kees Cook, Jonathan Corbet,
	Phil Estes, linux-kernel, linux-fsdevel, containers,
	linux-security-module, linux-api, mpawlowski

On 2/18/20 9:33 AM, Christian Brauner wrote:
> Hey everyone,
> 
> This is v3 after (off- and online) discussions with Jann the following
> changes were made:
> - To handle nested user namespaces cleanly, efficiently, and with full
>    backwards compatibility for non fsid-mapping aware workloads we only
>    allow writing fsid mappings as long as the corresponding id mapping
>    type has not been written.
> - Split the patch which adds the internal ability in
>    kernel/user_namespace to verify and write fsid mappings into tree
>    patches:
>    1. [PATCH v3 04/25] fsuidgid: add fsid mapping helpers
>       patch to implement core helpers for fsid translations (i.e.
>       make_kfs*id(), from_kfs*id{_munged}(), kfs*id_to_k*id(),
>       k*id_to_kfs*id()
>    2. [PATCH v3 05/25] user_namespace: refactor map_write()
>       patch to refactor map_write() in order to prepare for actual fsid
>       mappings changes in the following patch. (This should make it
>       easier to review.)
>    3. [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings
>       patch to implement actual fsid mappings support in mape_write()
> - Let the keyctl infrastructure only operate on kfsid which are always
>    mapped/looked up in the id mappings similar to what we do for
>    filesystems that have the same superblock visible in multiple user
>    namespaces.
> 
> This version also comes with minimal tests which I intend to expand in
> the future.
> 
>  From pings and off-list questions and discussions at Google Container
> Security Summit there seems to be quite a lot of interest in this
> patchset with use-cases ranging from layer sharing for app containers
> and k8s, as well as data sharing between containers with different id
> mappings. I haven't Cced all people because I don't have all the email
> adresses at hand but I've at least added Phil now. :)
> 
I put this into a kernel for our container guys to mess with in order to 
validate it would actually be useful for real world uses.  I've cc'ed the guy 
who did all of the work in case you have specific questions.

Good news is the interface is acceptable, albeit apparently the whole user ns 
interface sucks in general.  But you haven't made it worse, so success!

But in testing it there appears to be a problem with tmpfs?  Our applications 
will use shared memory segments for certain things and it apparently breaks this 
in interesting ways, it appears to not shift the UID appropriately on tmpfs. 
This seems to be relatively straightforward to reproduce, but if you have 
trouble let me know and I'll come up with a shell script that reproduces the 
problem.

We are happy to continue testing these patches to make sure they're working in 
our container setup, if you want to CC me on future submissions I can build them 
for our internal testing and validate them as well.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v3 00/25] user_namespace: introduce fsid mappings
  2020-02-27 19:33 ` Josef Bacik
@ 2020-03-02 14:34   ` Serge E. Hallyn
  0 siblings, 0 replies; 51+ messages in thread
From: Serge E. Hallyn @ 2020-03-02 14:34 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christian Brauner, Stéphane Graber, Eric W. Biederman,
	Aleksa Sarai, Jann Horn, smbarber, Seth Forshee, Alexander Viro,
	Alexey Dobriyan, Serge Hallyn, James Morris, Kees Cook,
	Jonathan Corbet, Phil Estes, linux-kernel, linux-fsdevel,
	containers, linux-security-module, linux-api, mpawlowski

On Thu, Feb 27, 2020 at 02:33:04PM -0500, Josef Bacik wrote:
> On 2/18/20 9:33 AM, Christian Brauner wrote:
> > Hey everyone,
> > 
> > This is v3 after (off- and online) discussions with Jann the following
> > changes were made:
> > - To handle nested user namespaces cleanly, efficiently, and with full
> >    backwards compatibility for non fsid-mapping aware workloads we only
> >    allow writing fsid mappings as long as the corresponding id mapping
> >    type has not been written.
> > - Split the patch which adds the internal ability in
> >    kernel/user_namespace to verify and write fsid mappings into tree
> >    patches:
> >    1. [PATCH v3 04/25] fsuidgid: add fsid mapping helpers
> >       patch to implement core helpers for fsid translations (i.e.
> >       make_kfs*id(), from_kfs*id{_munged}(), kfs*id_to_k*id(),
> >       k*id_to_kfs*id()
> >    2. [PATCH v3 05/25] user_namespace: refactor map_write()
> >       patch to refactor map_write() in order to prepare for actual fsid
> >       mappings changes in the following patch. (This should make it
> >       easier to review.)
> >    3. [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings
> >       patch to implement actual fsid mappings support in mape_write()
> > - Let the keyctl infrastructure only operate on kfsid which are always
> >    mapped/looked up in the id mappings similar to what we do for
> >    filesystems that have the same superblock visible in multiple user
> >    namespaces.
> > 
> > This version also comes with minimal tests which I intend to expand in
> > the future.
> > 
> >  From pings and off-list questions and discussions at Google Container
> > Security Summit there seems to be quite a lot of interest in this
> > patchset with use-cases ranging from layer sharing for app containers
> > and k8s, as well as data sharing between containers with different id
> > mappings. I haven't Cced all people because I don't have all the email
> > adresses at hand but I've at least added Phil now. :)
> > 
> I put this into a kernel for our container guys to mess with in order to
> validate it would actually be useful for real world uses.  I've cc'ed the
> guy who did all of the work in case you have specific questions.
> 
> Good news is the interface is acceptable, albeit apparently the whole user
> ns interface sucks in general.  But you haven't made it worse, so success!

Well I very much disagree here :)  With the first part!  But I do
understand the shortcomings.  Anyway,

I still hope we get to talk about this in person, but IMO this is the
right approach (this being - thinking about how to make the uid mappings
more flexible without making them too complicated to be safe to use),
but a bit too static in terms of target.  There are at least two ways
that I could see usefully generalizing it

From a user space pov, the following goal is indespensible (for my use
cases):  that the fsuid be selectable based on fs, mountpoint, or file
context (as in selinux).

From a userns pov, one way to look at it is this:  when task t1 signals
task t2, it's not only t1's namespace that's considered when filling in
the sender uid, but also t2's.  Likewise, when writing a file, we should
consider both t1's fsuid+userns, and the file's, mount's, or filesystem's
userns.

From that POV, your patch is a step in the right direction and could be
taken as is (modulo any tmpfs fix Josef needs :)  From there I would
propose adding a 'userns=<uidnsfd>' bind mount option, so we could create
an empty userns with the desired mapping (subject to permissions granted
by subuids), get an fd to the uidns, and say

	mount --bind -o uidns=5 /shared /containers/c1/mnt/shared

So now when I write a file /etc/hosts as container fsuid 0, it'll be
subject to the container rootfs mount's uid mapping, presumably
100000.  When I write /mnt/shared/hello, it'll be subject to the mount's
uid mapping, which might be 1000.

-serge

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2020-03-02 14:34 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-18 14:33 [PATCH v3 00/25] user_namespace: introduce fsid mappings Christian Brauner
2020-02-18 14:33 ` [PATCH v3 01/25] user_namespace: introduce fsid mappings infrastructure Christian Brauner
2020-02-19  2:33   ` Serge E. Hallyn
2020-02-18 14:33 ` [PATCH v3 02/25] proc: add /proc/<pid>/fsuid_map Christian Brauner
2020-02-19  2:33   ` Serge E. Hallyn
2020-02-18 14:33 ` [PATCH v3 03/25] proc: add /proc/<pid>/fsgid_map Christian Brauner
2020-02-19  2:33   ` Serge E. Hallyn
2020-02-18 14:33 ` [PATCH v3 04/25] fsuidgid: add fsid mapping helpers Christian Brauner
2020-02-18 14:33 ` [PATCH v3 05/25] user_namespace: refactor map_write() Christian Brauner
2020-02-19  2:35   ` Serge E. Hallyn
2020-02-18 14:33 ` [PATCH v3 06/25] user_namespace: make map_write() support fsid mappings Christian Brauner
2020-02-19 16:18   ` Jann Horn
2020-02-18 14:33 ` [PATCH v3 07/25] proc: task_state(): use from_kfs{g,u}id_munged Christian Brauner
2020-02-19  2:36   ` Serge E. Hallyn
2020-02-18 14:33 ` [PATCH v3 08/25] cred: add kfs{g,u}id Christian Brauner
2020-02-18 14:33 ` [PATCH v3 09/25] fs: add is_userns_visible() helper Christian Brauner
2020-02-19  2:42   ` Serge E. Hallyn
2020-02-19 12:06     ` Christian Brauner
2020-02-19 17:18       ` Andy Lutomirski
2020-02-20 14:26         ` Serge E. Hallyn
2020-02-18 14:33 ` [PATCH v3 10/25] namei: may_{o_}create(): handle fsid mappings Christian Brauner
2020-02-18 14:33 ` [PATCH v3 11/25] inode: inode_owner_or_capable(): " Christian Brauner
2020-02-18 22:25   ` Christoph Hellwig
2020-02-19 12:29     ` Christian Brauner
2020-02-18 14:33 ` [PATCH v3 12/25] capability: privileged_wrt_inode_uidgid(): " Christian Brauner
2020-02-18 14:33 ` [PATCH v3 13/25] stat: " Christian Brauner
2020-02-18 14:34 ` [PATCH v3 14/25] open: " Christian Brauner
2020-02-18 14:34 ` [PATCH v3 15/25] posix_acl: " Christian Brauner
2020-02-18 22:26   ` Christoph Hellwig
2020-02-19 12:56     ` Christian Brauner
2020-02-18 14:34 ` [PATCH v3 16/25] attr: notify_change(): " Christian Brauner
2020-02-18 14:34 ` [PATCH v3 17/25] commoncap: cap_bprm_set_creds(): " Christian Brauner
2020-02-18 14:34 ` [PATCH v3 18/25] commoncap: cap_task_fix_setuid(): " Christian Brauner
2020-02-18 14:34 ` [PATCH v3 19/25] commoncap: handle fsid mappings with vfs caps Christian Brauner
2020-02-19 15:53   ` Jann Horn
2020-02-18 14:34 ` [PATCH v3 20/25] exec: bprm_fill_uid(): handle fsid mappings Christian Brauner
2020-02-18 14:34 ` [PATCH v3 21/25] ptrace: adapt ptrace_may_access() to always uses unmapped fsids Christian Brauner
2020-02-18 14:34 ` [PATCH v3 22/25] devpts: handle fsid mappings Christian Brauner
2020-02-18 14:34 ` [PATCH v3 23/25] keys: " Christian Brauner
2020-02-18 14:34 ` [PATCH v3 24/25] sys: handle fsid mappings in set*id() calls Christian Brauner
2020-02-19 15:42   ` Jann Horn
2020-02-18 14:34 ` [PATCH v3 25/25] selftests: add simple fsid mapping selftests Christian Brauner
2020-02-18 23:50 ` [PATCH v3 00/25] user_namespace: introduce fsid mappings James Bottomley
2020-02-19 12:27   ` Christian Brauner
2020-02-19 15:36     ` James Bottomley
2020-02-19 15:33 ` Jann Horn
2020-02-19 19:35 ` Serge E. Hallyn
2020-02-19 21:48   ` Serge E. Hallyn
2020-02-19 21:56     ` Tycho Andersen
2020-02-27 19:33 ` Josef Bacik
2020-03-02 14:34   ` Serge E. Hallyn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.