Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 00/24] user_namespace: introduce fsid mappings
@ 2020-02-11 16:57 Christian Brauner
  2020-02-11 16:57 ` [PATCH 01/24] user_namespace: introduce fsid mappings infrastructure Christian Brauner
                   ` (24 more replies)
  0 siblings, 25 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Hey everyone,

This is the implementation of shiftfs which was cooked up during lunch at
Linux Plumbers 2019 the day after the container's microconference. The
idea is a design-stew from Stéphane, Aleksa, Eric, and myself. Back then
we all were quite busy with other work and couldn't really sit down and
implement it. But I took a few days last week to do this work, including
demos and performance testing.
This implementation does not require us to touch the vfs substantially
at all. Instead, we implement shiftfs via fsid mappings.
With this patch, it took me 20 mins to port both LXD and LXC to support
shiftfs via fsid mappings.

For anyone wanting to play with this the branch can be pulled from:
https://github.com/brauner/linux/tree/fsid_mappings
https://gitlab.com/brauner/linux/-/tree/fsid_mappings
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings

The main use case for shiftfs for us is in allowing shared writable
storage to multiple containers using non-overlapping id mappings.
In such a scenario you want the fsids to be valid and identical in both
containers for the shared mount. A demo for this exists in [3].
If you don't want to read on, go straight to the other demos below in
[1] and [2].

People not as familiar with user namespaces might not be aware that fsid
mappings already exist. Right now, fsid mappings are always identical to
id mappings. Specifically, the kernel will lookup fsuids in the uid
mappings and fsgids in the gid mappings of the relevant user namespace.

With this patch series we simply introduce the ability to create fsid
mappings that are different from the id mappings of a user namespace.

In the usual case of running an unprivileged container we will have
setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
correspond to this id mapping, i.e. all files which we want to appear as
0:0 inside the user namespace will be chowned to 100000:100000 on the
host. This works, because whenever the kernel needs to do a filesystem
access it will lookup the corresponding uid and gid in the idmapping
tables of the container.
Now think about the case where we want to have an id mapping of 0 100000
100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
share a single on-disk mapping with multiple containers that all have
different id mappings.
This will be problematic. Whenever a filesystem access is requested, the
kernel will now try to lookup a mapping for 300000 in the id mapping
tables of the user namespace but since there is none the files will
appear to be owned by the overflow id, i.e. usually 65534:65534 or
nobody:nogroup.

With fsid mappings we can solve this by writing an id mapping of 0
100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
access the kernel will now lookup the mapping for 300000 in the fsid
mapping tables of the user namespace. And since such a mapping exists,
the corresponding files will have correct ownership.

A note on proc (and sys), the proc filesystem is special in sofar as it
only has a single superblock that is (currently but might be about to
change) visible in all user namespaces (same goes for sys). This means
it has special semantics in many ways, including how file ownership and
access works. The fsid mapping implementation does not alter how proc
(and sys) ownership works. proc and sys will both continue to lookup
filesystem access in id mapping tables.

When Writing fsid mappings the same rules apply as when writing id
mappings so I won't reiterate them here. The limit of fs id mappings is
the same as for id mappings, i.e. 340 lines.

# Performance
Back when I extended the range of possible id mappings to 340 I did
performance testing by booting into single user mode, creating 1,000,000
files to fstat()ing them and calculated the mean fstat() time per file.
(Back when Linux was still fast. I won't mention that the stat
 numbers have (thanks microcode!) doubled since then...)
I did the same test for this patchset: one vanilla kernel, one kernel
with my fsid mapping patches but CONFIG_USER_NS_FSID set to n and one
with fsid mappings patches enabled. I then ran the same test on all
three kernels and compared the numbers. The implementation does not
introduce overhead. That's all I can say. Here are the numbers:

             | vanilla v5.5 | fsid mappings       | fsid mappings      | fsid mappings      |
	     |              | disabled in Kconfig | enabled in Kconfig | enabled in Kconfig |
	     |   	    |                     | and unset for all  | and set for all    |
	     |   	    |    		  | test cases         | test cases         |
-------------|--------------|---------------------|--------------------|--------------------|
 0  mappings |       367 ns |              365 ns |             365 ns |             N/A    |
 1  mappings |       362 ns |              367 ns |             363 ns |             363 ns |
 2  mappings |       361 ns |              369 ns |             363 ns |             364 ns |
 3  mappings |       361 ns |              368 ns |             366 ns |             365 ns |
 5  mappings |       365 ns |              368 ns |             363 ns |             365 ns |
 10 mappings |       391 ns |              388 ns |             387 ns |             389 ns |
 50 mappings |       395 ns |              398 ns |             401 ns |             397 ns |
100 mappings |       400 ns |              405 ns |             399 ns |             399 ns |
200 mappings |       404 ns |              407 ns |             430 ns |             404 ns |
300 mappings |       492 ns |              494 ns |             432 ns |             413 ns |
340 mappings |       495 ns |              497 ns |             500 ns |             484 ns |

# Demos
[1]: Create a container with different id and fsid mappings.
     https://asciinema.org/a/300233 
[2]: Create a container with id mappings but without fsid mappings.
     https://asciinema.org/a/300234
[3]: Share storage between multiple containers with non-overlapping id
     mappings.
     https://asciinema.org/a/300235

Thanks!
Christian

Christian Brauner (24):
  user_namespace: introduce fsid mappings infrastructure
  proc: add /proc/<pid>/fsuid_map
  proc: add /proc/<pid>/fsgid_map
  fsuidgid: add fsid mapping helpers
  proc: task_state(): use from_kfs{g,u}id_munged
  fs: add is_userns_visible() helper
  namei: may_{o_}create(): handle fsid mappings
  inode: inode_owner_or_capable(): handle fsid mappings
  capability: privileged_wrt_inode_uidgid(): handle fsid mappings
  stat: handle fsid mappings
  open: chown_common(): handle fsid mappings
  posix_acl: handle fsid mappings
  attr: notify_change(): handle fsid mappings
  commoncap: cap_task_fix_setuid(): handle fsid mappings
  commoncap:cap_bprm_set_creds(): handle fsid mappings
  sys: __sys_setfsuid(): handle fsid mappings
  sys: __sys_setfsgid(): handle fsid mappings
  sys:__sys_setuid(): handle fsid mappings
  sys:__sys_setgid(): handle fsid mappings
  sys:__sys_setreuid(): handle fsid mappings
  sys:__sys_setregid(): handle fsid mappings
  sys:__sys_setresuid(): handle fsid mappings
  sys:__sys_setresgid(): handle fsid mappings
  devpts: handle fsid mappings

 fs/attr.c                      |  23 ++-
 fs/devpts/inode.c              |   7 +-
 fs/inode.c                     |   7 +-
 fs/namei.c                     |  21 ++-
 fs/open.c                      |  10 +-
 fs/posix_acl.c                 |  21 +--
 fs/proc/array.c                |   5 +-
 fs/proc/base.c                 |  34 ++++
 fs/stat.c                      |  48 ++++--
 include/linux/fs.h             |   5 +
 include/linux/fsuidgid.h       |  70 ++++++++
 include/linux/stat.h           |   1 +
 include/linux/user_namespace.h |  10 ++
 init/Kconfig                   |  11 ++
 kernel/capability.c            |  13 +-
 kernel/sys.c                   |  83 ++++++---
 kernel/user.c                  |  22 +++
 kernel/user_namespace.c        | 303 ++++++++++++++++++++++++++++++++-
 security/commoncap.c           |  19 ++-
 19 files changed, 638 insertions(+), 75 deletions(-)
 create mode 100644 include/linux/fsuidgid.h


base-commit: d5226fa6dbae0569ee43ecfc08bdcd6770fc4755
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/24] user_namespace: introduce fsid mappings infrastructure
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
@ 2020-02-11 16:57 ` Christian Brauner
  2020-02-11 17:26   ` Randy Dunlap
  2020-02-11 16:57 ` [PATCH 02/24] proc: add /proc/<pid>/fsuid_map Christian Brauner
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

This introduces the infrastructure to setup fsid mappings which will be used in
later patches.
All new code depends on CONFIG_USER_NS_FSID=y. It currently defaults to "N".
If CONFIG_USER_NS_FSID is not set, no new code is added.

In this patch fsuid_m_show() and fsgid_m_show() are introduced. They are
identical to uid_m_show() and gid_m_show() until we introduce from_kfsuid() and
from_kfsgid() in a follow-up patch.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/user_namespace.h |  10 +++
 init/Kconfig                   |  11 +++
 kernel/user.c                  |  22 ++++++
 kernel/user_namespace.c        | 122 +++++++++++++++++++++++++++++++++
 4 files changed, 165 insertions(+)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index fb9f4f799554..ffbd5e7e5ec7 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -55,6 +55,10 @@ enum ucount_type {
 struct user_namespace {
 	struct uid_gid_map	uid_map;
 	struct uid_gid_map	gid_map;
+#ifdef CONFIG_USER_NS_FSID
+	struct uid_gid_map	fsuid_map;
+	struct uid_gid_map	fsgid_map;
+#endif
 	struct uid_gid_map	projid_map;
 	atomic_t		count;
 	struct user_namespace	*parent;
@@ -126,6 +130,12 @@ struct seq_operations;
 extern const struct seq_operations proc_uid_seq_operations;
 extern const struct seq_operations proc_gid_seq_operations;
 extern const struct seq_operations proc_projid_seq_operations;
+#ifdef CONFIG_USER_NS_FSID
+extern const struct seq_operations proc_fsuid_seq_operations;
+extern const struct seq_operations proc_fsgid_seq_operations;
+extern ssize_t proc_fsuid_map_write(struct file *, const char __user *, size_t, loff_t *);
+extern ssize_t proc_fsgid_map_write(struct file *, const char __user *, size_t, loff_t *);
+#endif
 extern ssize_t proc_uid_map_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t proc_gid_map_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t proc_projid_map_write(struct file *, const char __user *, size_t, loff_t *);
diff --git a/init/Kconfig b/init/Kconfig
index a34064a031a5..4da082e4f787 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1102,6 +1102,17 @@ config USER_NS
 
 	  If unsure, say N.
 
+config USER_NS_FSID
+	bool "User namespace fsid mappings"
+	depends on USER_NS
+	default n
+	help
+	  This allows containers, to alter their filesystem id mappings.
+	  With this containers with different id mappings can still share
+	  the same filesystem.
+
+	  If unsure, say N.
+
 config PID_NS
 	bool "PID Namespaces"
 	default y
diff --git a/kernel/user.c b/kernel/user.c
index 5235d7f49982..2ccaea9b810b 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -55,6 +55,28 @@ struct user_namespace init_user_ns = {
 			},
 		},
 	},
+#ifdef CONFIG_USER_NS_FSID
+	.fsuid_map = {
+		.nr_extents = 1,
+		{
+			.extent[0] = {
+				.first = 0,
+				.lower_first = 0,
+				.count = 4294967295U,
+			},
+		},
+	},
+	.fsgid_map = {
+		.nr_extents = 1,
+		{
+			.extent[0] = {
+				.first = 0,
+				.lower_first = 0,
+				.count = 4294967295U,
+			},
+		},
+	},
+#endif
 	.count = ATOMIC_INIT(3),
 	.owner = GLOBAL_ROOT_UID,
 	.group = GLOBAL_ROOT_GID,
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 8eadadc478f9..cbdf456f95f0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -191,6 +191,16 @@ static void free_user_ns(struct work_struct *work)
 			kfree(ns->projid_map.forward);
 			kfree(ns->projid_map.reverse);
 		}
+#ifdef CONFIG_USER_NS_FSID
+		if (ns->fsgid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
+			kfree(ns->fsgid_map.forward);
+			kfree(ns->fsgid_map.reverse);
+		}
+		if (ns->fsuid_map.nr_extents > UID_GID_MAP_MAX_BASE_EXTENTS) {
+			kfree(ns->fsuid_map.forward);
+			kfree(ns->fsuid_map.reverse);
+		}
+#endif
 		retire_userns_sysctls(ns);
 		key_free_user_ns(ns);
 		ns_free_inum(&ns->ns);
@@ -637,6 +647,50 @@ static int projid_m_show(struct seq_file *seq, void *v)
 	return 0;
 }
 
+#ifdef CONFIG_USER_NS_FSID
+static int fsuid_m_show(struct seq_file *seq, void *v)
+{
+	struct user_namespace *ns = seq->private;
+	struct uid_gid_extent *extent = v;
+	struct user_namespace *lower_ns;
+	uid_t lower;
+
+	lower_ns = seq_user_ns(seq);
+	if ((lower_ns == ns) && lower_ns->parent)
+		lower_ns = lower_ns->parent;
+
+	lower = from_kuid(lower_ns, KUIDT_INIT(extent->lower_first));
+
+	seq_printf(seq, "%10u %10u %10u\n",
+		extent->first,
+		lower,
+		extent->count);
+
+	return 0;
+}
+
+static int fsgid_m_show(struct seq_file *seq, void *v)
+{
+	struct user_namespace *ns = seq->private;
+	struct uid_gid_extent *extent = v;
+	struct user_namespace *lower_ns;
+	gid_t lower;
+
+	lower_ns = seq_user_ns(seq);
+	if ((lower_ns == ns) && lower_ns->parent)
+		lower_ns = lower_ns->parent;
+
+	lower = from_kgid(lower_ns, KGIDT_INIT(extent->lower_first));
+
+	seq_printf(seq, "%10u %10u %10u\n",
+		extent->first,
+		lower,
+		extent->count);
+
+	return 0;
+}
+#endif
+
 static void *m_start(struct seq_file *seq, loff_t *ppos,
 		     struct uid_gid_map *map)
 {
@@ -674,6 +728,22 @@ static void *projid_m_start(struct seq_file *seq, loff_t *ppos)
 	return m_start(seq, ppos, &ns->projid_map);
 }
 
+#ifdef CONFIG_USER_NS_FSID
+static void *fsuid_m_start(struct seq_file *seq, loff_t *ppos)
+{
+	struct user_namespace *ns = seq->private;
+
+	return m_start(seq, ppos, &ns->fsuid_map);
+}
+
+static void *fsgid_m_start(struct seq_file *seq, loff_t *ppos)
+{
+	struct user_namespace *ns = seq->private;
+
+	return m_start(seq, ppos, &ns->fsgid_map);
+}
+#endif
+
 static void *m_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	(*pos)++;
@@ -706,6 +776,22 @@ const struct seq_operations proc_projid_seq_operations = {
 	.show = projid_m_show,
 };
 
+#ifdef CONFIG_USER_NS_FSID
+const struct seq_operations proc_fsuid_seq_operations = {
+	.start = fsuid_m_start,
+	.stop = m_stop,
+	.next = m_next,
+	.show = fsuid_m_show,
+};
+
+const struct seq_operations proc_fsgid_seq_operations = {
+	.start = fsgid_m_start,
+	.stop = m_stop,
+	.next = m_next,
+	.show = fsgid_m_show,
+};
+#endif
+
 static bool mappings_overlap(struct uid_gid_map *new_map,
 			     struct uid_gid_extent *extent)
 {
@@ -1081,6 +1167,42 @@ ssize_t proc_projid_map_write(struct file *file, const char __user *buf,
 			 &ns->projid_map, &ns->parent->projid_map);
 }
 
+#ifdef CONFIG_USER_NS_FSID
+ssize_t proc_fsuid_map_write(struct file *file, const char __user *buf,
+			     size_t size, loff_t *ppos)
+{
+	struct seq_file *seq = file->private_data;
+	struct user_namespace *ns = seq->private;
+	struct user_namespace *seq_ns = seq_user_ns(seq);
+
+	if (!ns->parent)
+		return -EPERM;
+
+	if ((seq_ns != ns) && (seq_ns != ns->parent))
+		return -EPERM;
+
+	return map_write(file, buf, size, ppos, CAP_SETUID, &ns->fsuid_map,
+			 &ns->parent->fsuid_map);
+}
+
+ssize_t proc_fsgid_map_write(struct file *file, const char __user *buf,
+			     size_t size, loff_t *ppos)
+{
+	struct seq_file *seq = file->private_data;
+	struct user_namespace *ns = seq->private;
+	struct user_namespace *seq_ns = seq_user_ns(seq);
+
+	if (!ns->parent)
+		return -EPERM;
+
+	if ((seq_ns != ns) && (seq_ns != ns->parent))
+		return -EPERM;
+
+	return map_write(file, buf, size, ppos, CAP_SETGID, &ns->fsgid_map,
+			 &ns->parent->fsgid_map);
+}
+#endif
+
 static bool new_idmap_permitted(const struct file *file,
 				struct user_namespace *ns, int cap_setid,
 				struct uid_gid_map *new_map)
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 02/24] proc: add /proc/<pid>/fsuid_map
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
  2020-02-11 16:57 ` [PATCH 01/24] user_namespace: introduce fsid mappings infrastructure Christian Brauner
@ 2020-02-11 16:57 ` Christian Brauner
  2020-02-11 16:57 ` [PATCH 03/24] proc: add /proc/<pid>/fsgid_map Christian Brauner
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

The /proc/<pid>/fsuid_map file can be written to once to setup an fsuid mapping
for a user namespace. Writing to this file has the same restrictions as writing
to /proc/<pid>/fsuid_map:

root@e1-vm:/# cat /proc/13023/fsuid_map
         0     300000     100000

Fsid mappings have always been around. They are currently always identical to
the id mappings for a user namespace. This means, currently whenever an fsid
needs to be looked up the kernel will use the id mapping of the user namespace.
With the introduction of fsid mappings the kernel will now lookup fsids in the
fsid mappings of the user namespace. If no fsid mapping exists the kernel will
continue looking up fsids in the id mappings of the user namespace. Hence, if a
system supports fsid mappings through /proc/<pid>/fs*id_map and a container
runtime is not aware of fsid mappings it or does not use them it will it will
continue to work just as before.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/proc/base.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..ad5f6adc9344 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2879,6 +2879,13 @@ static int proc_projid_map_open(struct inode *inode, struct file *file)
 	return proc_id_map_open(inode, file, &proc_projid_seq_operations);
 }
 
+#ifdef CONFIG_USER_NS_FSID
+static int proc_fsuid_map_open(struct inode *inode, struct file *file)
+{
+	return proc_id_map_open(inode, file, &proc_fsuid_seq_operations);
+}
+#endif
+
 static const struct file_operations proc_uid_map_operations = {
 	.open		= proc_uid_map_open,
 	.write		= proc_uid_map_write,
@@ -2903,6 +2910,16 @@ static const struct file_operations proc_projid_map_operations = {
 	.release	= proc_id_map_release,
 };
 
+#ifdef CONFIG_USER_NS_FSID
+static const struct file_operations proc_fsuid_map_operations = {
+	.open		= proc_fsuid_map_open,
+	.write		= proc_fsuid_map_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= proc_id_map_release,
+};
+#endif
+
 static int proc_setgroups_open(struct inode *inode, struct file *file)
 {
 	struct user_namespace *ns = NULL;
@@ -3079,6 +3096,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 	ONE("io",	S_IRUSR, proc_tgid_io_accounting),
 #endif
 #ifdef CONFIG_USER_NS
+#ifdef CONFIG_USER_NS_FSID
+	REG("fsuid_map",  S_IRUGO|S_IWUSR, proc_fsuid_map_operations),
+#endif
 	REG("uid_map",    S_IRUGO|S_IWUSR, proc_uid_map_operations),
 	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
 	REG("projid_map", S_IRUGO|S_IWUSR, proc_projid_map_operations),
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 03/24] proc: add /proc/<pid>/fsgid_map
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
  2020-02-11 16:57 ` [PATCH 01/24] user_namespace: introduce fsid mappings infrastructure Christian Brauner
  2020-02-11 16:57 ` [PATCH 02/24] proc: add /proc/<pid>/fsuid_map Christian Brauner
@ 2020-02-11 16:57 ` Christian Brauner
  2020-02-11 16:57 ` [PATCH 04/24] fsuidgid: add fsid mapping helpers Christian Brauner
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

The /proc/<pid>/fsgid_map file can be written to once to setup an fsgid mapping
for a user namespace. Writing to this file has the same restrictions as writing
to /proc/<pid>/fsgid_map.

root@e1-vm:/# cat /proc/13023/fsgid_map
         0     300000     100000

Fsid mappings have always been around. They are currently always identical to
the id mappings for a user namespace. This means, currently whenever an fsid
needs to be looked up the kernel will use the id mapping of the user namespace.
With the introduction of fsid mappings the kernel will now lookup fsids in the
fsid mappings of the user namespace. If no fsid mapping exists the kernel will
continue looking up fsids in the id mappings of the user namespace. Hence, if a
system supports fsid mappings through /proc/<pid>/fs*id_map and a container
runtime is not aware of fsid mappings it or does not use them it will it will
continue to work just as before.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/proc/base.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ad5f6adc9344..e085ad579604 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2884,6 +2884,11 @@ static int proc_fsuid_map_open(struct inode *inode, struct file *file)
 {
 	return proc_id_map_open(inode, file, &proc_fsuid_seq_operations);
 }
+
+static int proc_fsgid_map_open(struct inode *inode, struct file *file)
+{
+	return proc_id_map_open(inode, file, &proc_fsgid_seq_operations);
+}
 #endif
 
 static const struct file_operations proc_uid_map_operations = {
@@ -2918,6 +2923,14 @@ static const struct file_operations proc_fsuid_map_operations = {
 	.llseek		= seq_lseek,
 	.release	= proc_id_map_release,
 };
+
+static const struct file_operations proc_fsgid_map_operations = {
+	.open		= proc_fsgid_map_open,
+	.write		= proc_fsgid_map_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= proc_id_map_release,
+};
 #endif
 
 static int proc_setgroups_open(struct inode *inode, struct file *file)
@@ -3098,6 +3111,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_USER_NS
 #ifdef CONFIG_USER_NS_FSID
 	REG("fsuid_map",  S_IRUGO|S_IWUSR, proc_fsuid_map_operations),
+	REG("fsgid_map",  S_IRUGO|S_IWUSR, proc_fsgid_map_operations),
 #endif
 	REG("uid_map",    S_IRUGO|S_IWUSR, proc_uid_map_operations),
 	REG("gid_map",    S_IRUGO|S_IWUSR, proc_gid_map_operations),
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 04/24] fsuidgid: add fsid mapping helpers
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (2 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 03/24] proc: add /proc/<pid>/fsgid_map Christian Brauner
@ 2020-02-11 16:57 ` Christian Brauner
  2020-02-11 16:57 ` [PATCH 05/24] proc: task_state(): use from_kfs{g,u}id_munged Christian Brauner
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

This adds a set of helpers to translate between kfsuid/kfsgid and their
userspace fsuid/fsgid counter parts relative to a given user namespace.

- kuid_t make_kfsuid(struct user_namespace *from, uid_t fsuid)
  Maps a user-namespace fsuid pair into a kfsuid.
  If no fsuid mappings have been written it behaves identical to calling
  make_kuid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- kgid_t make_kfsgid(struct user_namespace *from, gid_t fsgid)
  Maps a user-namespace fsgid pair into a kfsgid.
  If no fsgid mappings have been written it behaves identical to calling
  make_kgid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- uid_t from_kfsuid(struct user_namespace *to, kuid_t fsuid)
  Creates a fsuid from a kfsuid user-namespace pair if possible.
  If no fsuid mappings have been written it behaves identical to calling
  from_kuid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- gid_t from_kfsgid(struct user_namespace *to, kgid_t fsgid)
  Creates a fsgid from a kfsgid user-namespace pair if possible.
  If no fsgid mappings have been written it behaves identical to calling
  make_kgid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- uid_t from_kfsuid_munged(struct user_namespace *to, kuid_t fsuid)
  Always creates a fsuid from a kfsuid user-namespace pair.
  If no fsuid mappings have been written it behaves identical to calling
  from_kuid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- gid_t from_kfsgid_munged(struct user_namespace *to, kgid_t fsgid)
  Always creates a fsgid from a kfsgid user-namespace pair if possible.
  If no fsgid mappings have been written it behaves identical to calling
  make_kgid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- bool kfsuid_has_mapping(struct user_namespace *ns, kuid_t uid)
  Check whether this kfsuid has a mapping in the provided user namespace.
  If no fsuid mappings have been written it behaves identical to calling
  from_kuid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

- bool kfsgid_has_mapping(struct user_namespace *ns, kgid_t gid)
  Check whether this kfsgid has a mapping in the provided user namespace.
  If no fsgid mappings have been written it behaves identical to calling
  make_kgid(). This ensures backwards compatibility for workloads unaware
  or not in need of fsid mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/fsuidgid.h |  70 +++++++++++++++
 kernel/user_namespace.c  | 189 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 246 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/fsuidgid.h

diff --git a/include/linux/fsuidgid.h b/include/linux/fsuidgid.h
new file mode 100644
index 000000000000..0ebfdaa796ab
--- /dev/null
+++ b/include/linux/fsuidgid.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_FSUIDGID_H
+#define _LINUX_FSUIDGID_H
+
+#include <linux/uidgid.h>
+
+#ifdef CONFIG_USER_NS_FSID
+
+extern kuid_t make_kfsuid(struct user_namespace *from, uid_t fsuid);
+extern kgid_t make_kfsgid(struct user_namespace *from, gid_t fsgid);
+extern uid_t from_kfsuid(struct user_namespace *to, kuid_t kfsuid);
+extern gid_t from_kfsgid(struct user_namespace *to, kgid_t kfsgid);
+extern uid_t from_kfsuid_munged(struct user_namespace *to, kuid_t kfsuid);
+extern gid_t from_kfsgid_munged(struct user_namespace *to, kgid_t kfsgid);
+
+static inline bool kfsuid_has_mapping(struct user_namespace *ns, kuid_t kfsuid)
+{
+	return from_kfsuid(ns, kfsuid) != (uid_t) -1;
+}
+
+static inline bool kfsgid_has_mapping(struct user_namespace *ns, kgid_t kfsgid)
+{
+	return from_kfsgid(ns, kfsgid) != (gid_t) -1;
+}
+
+#else
+
+static inline kuid_t make_kfsuid(struct user_namespace *from, uid_t fsuid)
+{
+	return make_kuid(from, fsuid);
+}
+
+static inline kgid_t make_kfsgid(struct user_namespace *from, gid_t fsgid)
+{
+	return make_kgid(from, fsgid);
+}
+
+static inline uid_t from_kfsuid(struct user_namespace *to, kuid_t kfsuid)
+{
+	return from_kuid(to, kfsuid);
+}
+
+static inline gid_t from_kfsgid(struct user_namespace *to, kgid_t kfsgid)
+{
+	return from_kgid(to, kfsgid);
+}
+
+static inline uid_t from_kfsuid_munged(struct user_namespace *to, kuid_t kfsuid)
+{
+	return from_kuid_munged(to, kfsuid);
+}
+
+static inline gid_t from_kfsgid_munged(struct user_namespace *to, kgid_t kfsgid)
+{
+	return from_kgid_munged(to, kfsgid);
+}
+
+static inline bool kfsuid_has_mapping(struct user_namespace *ns, kuid_t kfsuid)
+{
+	return kuid_has_mapping(ns, kfsuid);
+}
+
+static inline bool kfsgid_has_mapping(struct user_namespace *ns, kgid_t kfsgid)
+{
+	return kgid_has_mapping(ns, kfsgid);
+}
+
+#endif /* CONFIG_USER_NS_FSID */
+
+#endif /* _LINUX_FSUIDGID_H */
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index cbdf456f95f0..398be02de5c3 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -20,13 +20,14 @@
 #include <linux/fs_struct.h>
 #include <linux/bsearch.h>
 #include <linux/sort.h>
+#include <linux/fsuidgid.h>
 
 static struct kmem_cache *user_ns_cachep __read_mostly;
 static DEFINE_MUTEX(userns_state_mutex);
 
 static bool new_idmap_permitted(const struct file *file,
 				struct user_namespace *ns, int cap_setid,
-				struct uid_gid_map *map);
+				struct uid_gid_map *map, bool map_fsid);
 static void free_user_ns(struct work_struct *work);
 
 static struct ucounts *inc_user_namespaces(struct user_namespace *ns, kuid_t uid)
@@ -583,6 +584,166 @@ projid_t from_kprojid_munged(struct user_namespace *targ, kprojid_t kprojid)
 }
 EXPORT_SYMBOL(from_kprojid_munged);
 
+#ifdef CONFIG_USER_NS_FSID
+/**
+ *	make_kfsuid - Map a user-namespace fsuid pair into a kuid.
+ *	@ns:  User namespace that the fsuid is in
+ *	@fsuid: User identifier
+ *
+ *	Maps a user-namespace fsuid pair into a kernel internal kfsuid,
+ *	and returns that kfsuid.
+ *
+ *	When there is no mapping defined for the user-namespace kfsuid
+ *	pair INVALID_UID is returned.  Callers are expected to test
+ *	for and handle INVALID_UID being returned.  INVALID_UID
+ *	may be tested for using uid_valid().
+ */
+kuid_t make_kfsuid(struct user_namespace *ns, uid_t fsuid)
+{
+	unsigned extents = ns->fsuid_map.nr_extents;
+	smp_rmb();
+
+	/* Map the fsuid to a global kernel fsuid */
+	if (extents == 0)
+		return KUIDT_INIT(map_id_down(&ns->uid_map, fsuid));
+
+	return KUIDT_INIT(map_id_down(&ns->fsuid_map, fsuid));
+}
+EXPORT_SYMBOL(make_kfsuid);
+
+/**
+ *	from_kfsuid - Create a fsuid from a kfsuid user-namespace pair.
+ *	@targ: The user namespace we want a fsuid in.
+ *	@kfsuid: The kernel internal fsuid to start with.
+ *
+ *	Map @kfsuid into the user-namespace specified by @targ and
+ *	return the resulting fsuid.
+ *
+ *	There is always a mapping into the initial user_namespace.
+ *
+ *	If @kfsuid has no mapping in @targ (uid_t)-1 is returned.
+ */
+uid_t from_kfsuid(struct user_namespace *targ, kuid_t kfsuid)
+{
+	unsigned extents = targ->fsuid_map.nr_extents;
+	smp_rmb();
+
+	/* Map the fsuid from a global kernel fsuid */
+	if (extents == 0)
+		return map_id_up(&targ->uid_map, __kuid_val(kfsuid));
+
+	return map_id_up(&targ->fsuid_map, __kuid_val(kfsuid));
+}
+EXPORT_SYMBOL(from_kfsuid);
+
+/**
+ *	from_kfsuid_munged - Create a fsuid from a kfsuid user-namespace pair.
+ *	@targ: The user namespace we want a fsuid in.
+ *	@kfsuid: The kernel internal fsuid to start with.
+ *
+ *	Map @kfsuid into the user-namespace specified by @targ and
+ *	return the resulting fsuid.
+ *
+ *	There is always a mapping into the initial user_namespace.
+ *
+ *	Unlike from_kfsuid from_kfsuid_munged never fails and always
+ *	returns a valid fsuid.  This makes from_kfsuid_munged appropriate
+ *	for use in syscalls like stat and getuid where failing the
+ *	system call and failing to provide a valid fsuid are not an
+ *	options.
+ *
+ *	If @kfsuid has no mapping in @targ overflowuid is returned.
+ */
+uid_t from_kfsuid_munged(struct user_namespace *targ, kuid_t kfsuid)
+{
+	uid_t fsuid;
+	fsuid = from_kfsuid(targ, kfsuid);
+
+	if (fsuid == (uid_t) -1)
+		fsuid = overflowuid;
+	return fsuid;
+}
+EXPORT_SYMBOL(from_kfsuid_munged);
+
+/**
+ *	make_kfsgid - Map a user-namespace fsgid pair into a kfsgid.
+ *	@ns:  User namespace that the fsgid is in
+ *	@fsgid: User identifier
+ *
+ *	Maps a user-namespace fsgid pair into a kernel internal kfsgid,
+ *	and returns that kfsgid.
+ *
+ *	When there is no mapping defined for the user-namespace fsgid
+ *	pair INVALID_GID is returned.  Callers are expected to test
+ *	for and handle INVALID_GID being returned.  INVALID_GID
+ *	may be tested for using gid_valid().
+ */
+kgid_t make_kfsgid(struct user_namespace *ns, gid_t fsgid)
+{
+	unsigned extents = ns->fsgid_map.nr_extents;
+	smp_rmb();
+
+	/* Map the fsgid to a global kernel fsgid */
+	if (extents == 0)
+		return KGIDT_INIT(map_id_down(&ns->gid_map, fsgid));
+
+	return KGIDT_INIT(map_id_down(&ns->fsgid_map, fsgid));
+}
+EXPORT_SYMBOL(make_kfsgid);
+
+/**
+ *	from_kfsgid - Create a fsgid from a kfsgid user-namespace pair.
+ *	@targ: The user namespace we want a fsgid in.
+ *	@kfsgid: The kernel internal fsgid to start with.
+ *
+ *	Map @kfsgid into the user-namespace specified by @targ and
+ *	return the resulting fsgid.
+ *
+ *	There is always a mapping into the initial user_namespace.
+ *
+ *	If @kfsgid has no mapping in @targ (gid_t)-1 is returned.
+ */
+gid_t from_kfsgid(struct user_namespace *targ, kgid_t kfsgid)
+{
+	unsigned extents = targ->fsgid_map.nr_extents;
+	smp_rmb();
+
+	/* Map the fsgid from a global kernel fsgid */
+	if (extents == 0)
+		return map_id_up(&targ->gid_map, __kgid_val(kfsgid));
+
+	return map_id_up(&targ->fsgid_map, __kgid_val(kfsgid));
+}
+EXPORT_SYMBOL(from_kfsgid);
+
+/**
+ *	from_kfsgid_munged - Create a fsgid from a kfsgid user-namespace pair.
+ *	@targ: The user namespace we want a fsgid in.
+ *	@kfsgid: The kernel internal fsgid to start with.
+ *
+ *	Map @kfsgid into the user-namespace specified by @targ and
+ *	return the resulting fsgid.
+ *
+ *	There is always a mapping into the initial user_namespace.
+ *
+ *	Unlike from_kfsgid from_kfsgid_munged never fails and always
+ *	returns a valid fsgid.  This makes from_kfsgid_munged appropriate
+ *	for use in syscalls like stat and getgid where failing the
+ *	system call and failing to provide a valid fsgid are not options.
+ *
+ *	If @kfsgid has no mapping in @targ overflowgid is returned.
+ */
+gid_t from_kfsgid_munged(struct user_namespace *targ, kgid_t kfsgid)
+{
+	gid_t fsgid;
+	fsgid = from_kfsgid(targ, kfsgid);
+
+	if (fsgid == (gid_t) -1)
+		fsgid = overflowgid;
+	return fsgid;
+}
+EXPORT_SYMBOL(from_kfsgid_munged);
+#endif /* CONFIG_USER_NS_FSID */
 
 static int uid_m_show(struct seq_file *seq, void *v)
 {
@@ -659,7 +820,7 @@ static int fsuid_m_show(struct seq_file *seq, void *v)
 	if ((lower_ns == ns) && lower_ns->parent)
 		lower_ns = lower_ns->parent;
 
-	lower = from_kuid(lower_ns, KUIDT_INIT(extent->lower_first));
+	lower = from_kfsuid(lower_ns, KUIDT_INIT(extent->lower_first));
 
 	seq_printf(seq, "%10u %10u %10u\n",
 		extent->first,
@@ -680,7 +841,7 @@ static int fsgid_m_show(struct seq_file *seq, void *v)
 	if ((lower_ns == ns) && lower_ns->parent)
 		lower_ns = lower_ns->parent;
 
-	lower = from_kgid(lower_ns, KGIDT_INIT(extent->lower_first));
+	lower = from_kfsgid(lower_ns, KGIDT_INIT(extent->lower_first));
 
 	seq_printf(seq, "%10u %10u %10u\n",
 		extent->first,
@@ -931,7 +1092,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 			 size_t count, loff_t *ppos,
 			 int cap_setid,
 			 struct uid_gid_map *map,
-			 struct uid_gid_map *parent_map)
+			 struct uid_gid_map *parent_map, bool map_fsid)
 {
 	struct seq_file *seq = file->private_data;
 	struct user_namespace *ns = seq->private;
@@ -1051,7 +1212,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 
 	ret = -EPERM;
 	/* Validate the user is allowed to use user id's mapped to. */
-	if (!new_idmap_permitted(file, ns, cap_setid, &new_map))
+	if (!new_idmap_permitted(file, ns, cap_setid, &new_map, map_fsid))
 		goto out;
 
 	ret = -EPERM;
@@ -1129,7 +1290,7 @@ ssize_t proc_uid_map_write(struct file *file, const char __user *buf,
 		return -EPERM;
 
 	return map_write(file, buf, size, ppos, CAP_SETUID,
-			 &ns->uid_map, &ns->parent->uid_map);
+			 &ns->uid_map, &ns->parent->uid_map, false);
 }
 
 ssize_t proc_gid_map_write(struct file *file, const char __user *buf,
@@ -1146,7 +1307,7 @@ ssize_t proc_gid_map_write(struct file *file, const char __user *buf,
 		return -EPERM;
 
 	return map_write(file, buf, size, ppos, CAP_SETGID,
-			 &ns->gid_map, &ns->parent->gid_map);
+			 &ns->gid_map, &ns->parent->gid_map, false);
 }
 
 ssize_t proc_projid_map_write(struct file *file, const char __user *buf,
@@ -1164,7 +1325,7 @@ ssize_t proc_projid_map_write(struct file *file, const char __user *buf,
 
 	/* Anyone can set any valid project id no capability needed */
 	return map_write(file, buf, size, ppos, -1,
-			 &ns->projid_map, &ns->parent->projid_map);
+			 &ns->projid_map, &ns->parent->projid_map, false);
 }
 
 #ifdef CONFIG_USER_NS_FSID
@@ -1182,7 +1343,7 @@ ssize_t proc_fsuid_map_write(struct file *file, const char __user *buf,
 		return -EPERM;
 
 	return map_write(file, buf, size, ppos, CAP_SETUID, &ns->fsuid_map,
-			 &ns->parent->fsuid_map);
+			 &ns->parent->fsuid_map, true);
 }
 
 ssize_t proc_fsgid_map_write(struct file *file, const char __user *buf,
@@ -1199,13 +1360,13 @@ ssize_t proc_fsgid_map_write(struct file *file, const char __user *buf,
 		return -EPERM;
 
 	return map_write(file, buf, size, ppos, CAP_SETGID, &ns->fsgid_map,
-			 &ns->parent->fsgid_map);
+			 &ns->parent->fsgid_map, true);
 }
 #endif
 
 static bool new_idmap_permitted(const struct file *file,
 				struct user_namespace *ns, int cap_setid,
-				struct uid_gid_map *new_map)
+				struct uid_gid_map *new_map, bool map_fsid)
 {
 	const struct cred *cred = file->f_cred;
 	/* Don't allow mappings that would allow anything that wouldn't
@@ -1215,11 +1376,13 @@ static bool new_idmap_permitted(const struct file *file,
 	    uid_eq(ns->owner, cred->euid)) {
 		u32 id = new_map->extent[0].lower_first;
 		if (cap_setid == CAP_SETUID) {
-			kuid_t uid = make_kuid(ns->parent, id);
+			kuid_t uid = map_fsid ? make_kfsuid(ns->parent, id) :
+						make_kuid(ns->parent, id);
 			if (uid_eq(uid, cred->euid))
 				return true;
 		} else if (cap_setid == CAP_SETGID) {
-			kgid_t gid = make_kgid(ns->parent, id);
+			kgid_t gid = map_fsid ? make_kfsgid(ns->parent, id) :
+						make_kgid(ns->parent, id);
 			if (!(ns->flags & USERNS_SETGROUPS_ALLOWED) &&
 			    gid_eq(gid, cred->egid))
 				return true;
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 05/24] proc: task_state(): use from_kfs{g,u}id_munged
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (3 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 04/24] fsuidgid: add fsid mapping helpers Christian Brauner
@ 2020-02-11 16:57 ` Christian Brauner
  2020-02-11 16:57 ` [PATCH 06/24] fs: add is_userns_visible() helper Christian Brauner
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

If fsid mappings have been written, this will cause proc to look at fsid
mappings for the user namespace. If no fsid mappings have been written the
behavior is as before.

Here is part of the output from /proc/<pid>/status from the initial user
namespace for systemd running in an unprivileged container as user namespace
root with id mapping 0 100000 100000 and fsid mapping 0 300000 100000:

Name:	systemd
Umask:	0000
State:	S (sleeping)
Tgid:	13023
Ngid:	0
Pid:	13023
PPid:	13008
TracerPid:	0
Uid:	100000	100000	100000	300000
Gid:	100000	100000	100000	300000
FDSize:	64
Groups:

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/proc/array.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 5efaf3708ec6..d4a04f85a67e 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -91,6 +91,7 @@
 #include <linux/string_helpers.h>
 #include <linux/user_namespace.h>
 #include <linux/fs_struct.h>
+#include <linux/fsuidgid.h>
 
 #include <asm/pgtable.h>
 #include <asm/processor.h>
@@ -193,11 +194,11 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 	seq_put_decimal_ull(m, "\nUid:\t", from_kuid_munged(user_ns, cred->uid));
 	seq_put_decimal_ull(m, "\t", from_kuid_munged(user_ns, cred->euid));
 	seq_put_decimal_ull(m, "\t", from_kuid_munged(user_ns, cred->suid));
-	seq_put_decimal_ull(m, "\t", from_kuid_munged(user_ns, cred->fsuid));
+	seq_put_decimal_ull(m, "\t", from_kfsuid_munged(user_ns, cred->fsuid));
 	seq_put_decimal_ull(m, "\nGid:\t", from_kgid_munged(user_ns, cred->gid));
 	seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->egid));
 	seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->sgid));
-	seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->fsgid));
+	seq_put_decimal_ull(m, "\t", from_kfsgid_munged(user_ns, cred->fsgid));
 	seq_put_decimal_ull(m, "\nFDSize:\t", max_fds);
 
 	seq_puts(m, "\nGroups:\t");
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 06/24] fs: add is_userns_visible() helper
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (4 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 05/24] proc: task_state(): use from_kfs{g,u}id_munged Christian Brauner
@ 2020-02-11 16:57 ` Christian Brauner
  2020-02-11 16:57 ` [PATCH 07/24] namei: may_{o_}create(): handle fsid mappings Christian Brauner
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Introduce a helper which makes it possible to detect fileystems whose
superblock is visible in multiple user namespace. This currently only
means proc and sys. Such filesystems usually have special semantics so their
behavior will not be changed with the introduction of fsid mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/fs.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98e0349adb52..1449cd363fb6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3627,4 +3627,9 @@ static inline int inode_drain_writes(struct inode *inode)
 	return filemap_write_and_wait(inode->i_mapping);
 }
 
+static inline bool is_userns_visible(unsigned long iflags)
+{
+	return (iflags & SB_I_USERNS_VISIBLE);
+}
+
 #endif /* _LINUX_FS_H */
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 07/24] namei: may_{o_}create(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (5 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 06/24] fs: add is_userns_visible() helper Christian Brauner
@ 2020-02-11 16:57 ` Christian Brauner
  2020-02-11 16:57 ` [PATCH 08/24] inode: inode_owner_or_capable(): " Christian Brauner
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch may_{o_}create() to lookup fsids in the fsid mappings. If no fsid
mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
id mappings.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsidmappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/namei.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 4fb61e0754ed..c85c65adfa9d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -39,6 +39,7 @@
 #include <linux/bitops.h>
 #include <linux/init_task.h>
 #include <linux/uaccess.h>
+#include <linux/fsuidgid.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -2771,6 +2772,20 @@ static int may_delete(struct inode *dir, struct dentry *victim, bool isdir)
 	return 0;
 }
 
+static bool fsid_has_mapping(struct user_namespace *ns, struct super_block *sb)
+{
+	if (is_userns_visible(sb->s_iflags)) {
+		if (!kuid_has_mapping(ns, current_fsuid()) ||
+		    !kgid_has_mapping(ns, current_fsgid()))
+			return false;
+	} else if (!kfsuid_has_mapping(ns, current_fsuid()) ||
+		   !kfsgid_has_mapping(ns, current_fsgid())) {
+		return false;
+	}
+
+	return true;
+}
+
 /*	Check whether we can create an object with dentry child in directory
  *  dir.
  *  1. We can't do it if child already exists (open has special treatment for
@@ -2789,8 +2804,7 @@ static inline int may_create(struct inode *dir, struct dentry *child)
 	if (IS_DEADDIR(dir))
 		return -ENOENT;
 	s_user_ns = dir->i_sb->s_user_ns;
-	if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
-	    !kgid_has_mapping(s_user_ns, current_fsgid()))
+	if (!fsid_has_mapping(s_user_ns, dir->i_sb))
 		return -EOVERFLOW;
 	return inode_permission(dir, MAY_WRITE | MAY_EXEC);
 }
@@ -2972,8 +2986,7 @@ static int may_o_create(const struct path *dir, struct dentry *dentry, umode_t m
 		return error;
 
 	s_user_ns = dir->dentry->d_sb->s_user_ns;
-	if (!kuid_has_mapping(s_user_ns, current_fsuid()) ||
-	    !kgid_has_mapping(s_user_ns, current_fsgid()))
+	if (!fsid_has_mapping(s_user_ns, dir->dentry->d_sb))
 		return -EOVERFLOW;
 
 	error = inode_permission(dir->dentry->d_inode, MAY_WRITE | MAY_EXEC);
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 08/24] inode: inode_owner_or_capable(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (6 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 07/24] namei: may_{o_}create(): handle fsid mappings Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 09/24] capability: privileged_wrt_inode_uidgid(): " Christian Brauner
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch inode_owner_or_capable() to lookup fsids in the fsid mappings. If no
fsid mappings are setup the behavior is unchanged, i.e. fsids are looked up in
the id mappings.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsidmappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/inode.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index 96d62d97694e..c1ed43c5054c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -20,6 +20,7 @@
 #include <linux/ratelimit.h>
 #include <linux/list_lru.h>
 #include <linux/iversion.h>
+#include <linux/fsuidgid.h>
 #include <trace/events/writeback.h>
 #include "internal.h"
 
@@ -2083,8 +2084,12 @@ bool inode_owner_or_capable(const struct inode *inode)
 		return true;
 
 	ns = current_user_ns();
-	if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
+	if (is_userns_visible(inode->i_sb->s_iflags)) {
+		if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
+			return true;
+	} else if (kfsuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER)) {
 		return true;
+	}
 	return false;
 }
 EXPORT_SYMBOL(inode_owner_or_capable);
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 09/24] capability: privileged_wrt_inode_uidgid(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (7 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 08/24] inode: inode_owner_or_capable(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 10/24] stat: " Christian Brauner
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch privileged_wrt_inode_uidgid() to lookup fsids in the fsid mappings. If
no fsid mappings are setup the behavior is unchanged, i.e. fsids are looked up
in the id mappings.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsidmappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/capability.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/capability.c b/kernel/capability.c
index 1444f3954d75..de7edd5c9900 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -19,6 +19,8 @@
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
 #include <linux/uaccess.h>
+#include <linux/fsuidgid.h>
+#include <linux/fs.h>
 
 /*
  * Leveraged for setting/resetting capabilities
@@ -484,10 +486,15 @@ EXPORT_SYMBOL(file_ns_capable);
  *
  * Return true if the inode uid and gid are within the namespace.
  */
-bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode)
+bool privileged_wrt_inode_uidgid(struct user_namespace *ns,
+				 const struct inode *inode)
 {
-	return kuid_has_mapping(ns, inode->i_uid) &&
-		kgid_has_mapping(ns, inode->i_gid);
+	if (is_userns_visible(inode->i_sb->s_iflags))
+		return kuid_has_mapping(ns, inode->i_uid) &&
+		       kgid_has_mapping(ns, inode->i_gid);
+
+	return kfsuid_has_mapping(ns, inode->i_uid) &&
+	       kfsgid_has_mapping(ns, inode->i_gid);
 }
 
 /**
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 10/24] stat: handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (8 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 09/24] capability: privileged_wrt_inode_uidgid(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 11/24] open: chown_common(): " Christian Brauner
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch attribute functions looking up fsids to them up in the fsid mappings. If
no fsid mappings are setup the behavior is unchanged, i.e. fsids are looked up
in the id mappings.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsidmappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/stat.c            | 48 +++++++++++++++++++++++++++++++++++---------
 include/linux/stat.h |  1 +
 2 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index c38e4c2e1221..1cced54e79d4 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -10,6 +10,7 @@
 #include <linux/errno.h>
 #include <linux/file.h>
 #include <linux/highuid.h>
+#include <linux/fsuidgid.h>
 #include <linux/fs.h>
 #include <linux/namei.h>
 #include <linux/security.h>
@@ -77,6 +78,8 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	stat->userns_visible = is_userns_visible(inode->i_sb->s_iflags);
+
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path, stat, request_mask,
 					    query_flags);
@@ -229,8 +232,13 @@ static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * sta
 	tmp.st_nlink = stat->nlink;
 	if (tmp.st_nlink != stat->nlink)
 		return -EOVERFLOW;
-	SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
-	SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	if (stat->userns_visible) {
+		SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	} else {
+		SET_UID(tmp.st_uid, from_kfsuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kfsgid_munged(current_user_ns(), stat->gid));
+	}
 	tmp.st_rdev = old_encode_dev(stat->rdev);
 #if BITS_PER_LONG == 32
 	if (stat->size > MAX_NON_LFS)
@@ -317,8 +325,13 @@ static int cp_new_stat(struct kstat *stat, struct stat __user *statbuf)
 	tmp.st_nlink = stat->nlink;
 	if (tmp.st_nlink != stat->nlink)
 		return -EOVERFLOW;
-	SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
-	SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	if (stat->userns_visible) {
+		SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	} else {
+		SET_UID(tmp.st_uid, from_kfsuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kfsgid_munged(current_user_ns(), stat->gid));
+	}
 	tmp.st_rdev = encode_dev(stat->rdev);
 	tmp.st_size = stat->size;
 	tmp.st_atime = stat->atime.tv_sec;
@@ -461,8 +474,13 @@ static long cp_new_stat64(struct kstat *stat, struct stat64 __user *statbuf)
 #endif
 	tmp.st_mode = stat->mode;
 	tmp.st_nlink = stat->nlink;
-	tmp.st_uid = from_kuid_munged(current_user_ns(), stat->uid);
-	tmp.st_gid = from_kgid_munged(current_user_ns(), stat->gid);
+	if (stat->userns_visible) {
+		tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid);
+		tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid);
+	} else {
+		tmp.st_uid, from_kfsuid_munged(current_user_ns(), stat->uid);
+		tmp.st_gid, from_kfsgid_munged(current_user_ns(), stat->gid);
+	}
 	tmp.st_atime = stat->atime.tv_sec;
 	tmp.st_atime_nsec = stat->atime.tv_nsec;
 	tmp.st_mtime = stat->mtime.tv_sec;
@@ -534,8 +552,13 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_blksize = stat->blksize;
 	tmp.stx_attributes = stat->attributes;
 	tmp.stx_nlink = stat->nlink;
-	tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
-	tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
+	if (stat->userns_visible) {
+		tmp.stx_uid = from_kuid_munged(current_user_ns(), stat->uid);
+		tmp.stx_gid = from_kgid_munged(current_user_ns(), stat->gid);
+	} else {
+		tmp.stx_uid = from_kfsuid_munged(current_user_ns(), stat->uid);
+		tmp.stx_gid = from_kfsgid_munged(current_user_ns(), stat->gid);
+	}
 	tmp.stx_mode = stat->mode;
 	tmp.stx_ino = stat->ino;
 	tmp.stx_size = stat->size;
@@ -605,8 +628,13 @@ static int cp_compat_stat(struct kstat *stat, struct compat_stat __user *ubuf)
 	tmp.st_nlink = stat->nlink;
 	if (tmp.st_nlink != stat->nlink)
 		return -EOVERFLOW;
-	SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
-	SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	if (stat->userns_visible) {
+		SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
+	} else {
+		SET_UID(tmp.st_uid, from_kfsuid_munged(current_user_ns(), stat->uid));
+		SET_GID(tmp.st_gid, from_kfsgid_munged(current_user_ns(), stat->gid));
+	}
 	tmp.st_rdev = old_encode_dev(stat->rdev);
 	if ((u64) stat->size > MAX_NON_LFS)
 		return -EOVERFLOW;
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 528c4baad091..e6d4ba73a970 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -47,6 +47,7 @@ struct kstat {
 	struct timespec64 ctime;
 	struct timespec64 btime;			/* File creation time */
 	u64		blocks;
+	bool		userns_visible;
 };
 
 #endif
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 11/24] open: chown_common(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (9 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 10/24] stat: " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 12/24] posix_acl: " Christian Brauner
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch chown_common() to lookup fsids in the fsid mappings. If no fsid
mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
id mappings.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsidmappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/open.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index b62f5c0923a8..e5154841152c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -32,6 +32,7 @@
 #include <linux/ima.h>
 #include <linux/dnotify.h>
 #include <linux/compat.h>
+#include <linux/fsuidgid.h>
 
 #include "internal.h"
 
@@ -626,8 +627,13 @@ static int chown_common(const struct path *path, uid_t user, gid_t group)
 	kuid_t uid;
 	kgid_t gid;
 
-	uid = make_kuid(current_user_ns(), user);
-	gid = make_kgid(current_user_ns(), group);
+	if (is_userns_visible(inode->i_sb->s_iflags)) {
+		uid = make_kuid(current_user_ns(), user);
+		gid = make_kgid(current_user_ns(), group);
+	} else {
+		uid = make_kfsuid(current_user_ns(), user);
+		gid = make_kfsgid(current_user_ns(), group);
+	}
 
 retry_deleg:
 	newattrs.ia_valid =  ATTR_CTIME;
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 12/24] posix_acl: handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (10 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 11/24] open: chown_common(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 13/24] attr: notify_change(): " Christian Brauner
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch posix_acls() to lookup fsids in the fsid mappings. If no fsid
mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
id mappings.

Afaict, all filesystems that share a superblock in all user namespaces
currently do not support acls so this change should be safe to do
unconditionally.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/posix_acl.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 249672bf54fe..763bba24f380 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -22,6 +22,7 @@
 #include <linux/xattr.h>
 #include <linux/export.h>
 #include <linux/user_namespace.h>
+#include <linux/fsuidgid.h>
 
 static struct posix_acl **acl_by_type(struct inode *inode, int type)
 {
@@ -692,12 +693,12 @@ static void posix_acl_fix_xattr_userns(
 	for (end = entry + count; entry != end; entry++) {
 		switch(le16_to_cpu(entry->e_tag)) {
 		case ACL_USER:
-			uid = make_kuid(from, le32_to_cpu(entry->e_id));
-			entry->e_id = cpu_to_le32(from_kuid(to, uid));
+			uid = make_kfsuid(from, le32_to_cpu(entry->e_id));
+			entry->e_id = cpu_to_le32(from_kfsuid(to, uid));
 			break;
 		case ACL_GROUP:
-			gid = make_kgid(from, le32_to_cpu(entry->e_id));
-			entry->e_id = cpu_to_le32(from_kgid(to, gid));
+			gid = make_kfsgid(from, le32_to_cpu(entry->e_id));
+			entry->e_id = cpu_to_le32(from_kfsgid(to, gid));
 			break;
 		default:
 			break;
@@ -746,12 +747,12 @@ posix_acl_from_xattr(struct user_namespace *user_ns,
 		return ERR_PTR(-EINVAL);
 	if (count == 0)
 		return NULL;
-	
+
 	acl = posix_acl_alloc(count, GFP_NOFS);
 	if (!acl)
 		return ERR_PTR(-ENOMEM);
 	acl_e = acl->a_entries;
-	
+
 	for (end = entry + count; entry != end; acl_e++, entry++) {
 		acl_e->e_tag  = le16_to_cpu(entry->e_tag);
 		acl_e->e_perm = le16_to_cpu(entry->e_perm);
@@ -765,14 +766,14 @@ posix_acl_from_xattr(struct user_namespace *user_ns,
 
 			case ACL_USER:
 				acl_e->e_uid =
-					make_kuid(user_ns,
+					make_kfsuid(user_ns,
 						  le32_to_cpu(entry->e_id));
 				if (!uid_valid(acl_e->e_uid))
 					goto fail;
 				break;
 			case ACL_GROUP:
 				acl_e->e_gid =
-					make_kgid(user_ns,
+					make_kfsgid(user_ns,
 						  le32_to_cpu(entry->e_id));
 				if (!gid_valid(acl_e->e_gid))
 					goto fail;
@@ -817,11 +818,11 @@ posix_acl_to_xattr(struct user_namespace *user_ns, const struct posix_acl *acl,
 		switch(acl_e->e_tag) {
 		case ACL_USER:
 			ext_entry->e_id =
-				cpu_to_le32(from_kuid(user_ns, acl_e->e_uid));
+				cpu_to_le32(from_kfsuid(user_ns, acl_e->e_uid));
 			break;
 		case ACL_GROUP:
 			ext_entry->e_id =
-				cpu_to_le32(from_kgid(user_ns, acl_e->e_gid));
+				cpu_to_le32(from_kfsgid(user_ns, acl_e->e_gid));
 			break;
 		default:
 			ext_entry->e_id = cpu_to_le32(ACL_UNDEFINED_ID);
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 13/24] attr: notify_change(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (11 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 12/24] posix_acl: " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 14/24] commoncap: cap_task_fix_setuid(): " Christian Brauner
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch notify_change() to lookup fsids in the fsid mappings. If no fsid
mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
id mappings.

Filesystems that share a superblock in all user namespaces they are mounted in
will retain their old semantics even with the introduction of fsidmappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/attr.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index df28035aa23e..3aa65165fb06 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -17,6 +17,8 @@
 #include <linux/security.h>
 #include <linux/evm.h>
 #include <linux/ima.h>
+#include <linux/fsuidgid.h>
+#include <linux/fs.h>
 
 static bool chown_ok(const struct inode *inode, kuid_t uid)
 {
@@ -311,12 +313,21 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	 * Verify that uid/gid changes are valid in the target
 	 * namespace of the superblock.
 	 */
-	if (ia_valid & ATTR_UID &&
-	    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
-		return -EOVERFLOW;
-	if (ia_valid & ATTR_GID &&
-	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
-		return -EOVERFLOW;
+	if (is_userns_visible(inode->i_sb->s_iflags)) {
+		if (ia_valid & ATTR_UID &&
+		    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
+			return -EOVERFLOW;
+		if (ia_valid & ATTR_GID &&
+		    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
+			return -EOVERFLOW;
+	} else {
+		if (ia_valid & ATTR_UID &&
+		    !kfsuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
+			return -EOVERFLOW;
+		if (ia_valid & ATTR_GID &&
+		    !kfsgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
+			return -EOVERFLOW;
+	}
 
 	/* Don't allow modifications of files with invalid uids or
 	 * gids unless those uids & gids are being made valid.
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 14/24] commoncap: cap_task_fix_setuid(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (12 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 13/24] attr: notify_change(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 15/24] commoncap:cap_bprm_set_creds(): " Christian Brauner
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch cap_task_fix_setuid() to lookup fsids in the fsid mappings. If no fsid
mappings are setup the behavior is unchanged, i.e. fsids are looked up in the
id mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 security/commoncap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index f4ee0ae106b2..ecfa0d0c250e 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -24,6 +24,7 @@
 #include <linux/user_namespace.h>
 #include <linux/binfmts.h>
 #include <linux/personality.h>
+#include <linux/fsuidgid.h>
 
 /*
  * If a non-root user executes a setuid-root binary in
@@ -1051,7 +1052,7 @@ int cap_task_fix_setuid(struct cred *new, const struct cred *old, int flags)
 		 *          if not, we might be a bit too harsh here.
 		 */
 		if (!issecure(SECURE_NO_SETUID_FIXUP)) {
-			kuid_t root_uid = make_kuid(old->user_ns, 0);
+			kuid_t root_uid = make_kfsuid(old->user_ns, 0);
 			if (uid_eq(old->fsuid, root_uid) && !uid_eq(new->fsuid, root_uid))
 				new->cap_effective =
 					cap_drop_fs_set(new->cap_effective);
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 15/24] commoncap:cap_bprm_set_creds(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (13 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 14/24] commoncap: cap_task_fix_setuid(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 16/24] sys: __sys_setfsuid(): " Christian Brauner
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

During exec the kfsids are currently reset to the effective kids. To retain the
same semantics with the introduction of fsid mappings, we lookup the userspace
effective id in the id mappings and translate the effective id into the
corresponding kfsid in the fsidmapping. This means, the behavior is unchanged
when no fsid mappings are setup and the semantics stay the same even when fsid
mappings are setup.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 security/commoncap.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/security/commoncap.c b/security/commoncap.c
index ecfa0d0c250e..8d1a81e98610 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -811,7 +811,10 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
 	struct cred *new = bprm->cred;
 	bool effective = false, has_fcap = false, is_setid;
 	int ret;
-	kuid_t root_uid;
+	kuid_t root_uid, kfsuid;
+	kgid_t kfsgid;
+	uid_t fsuid;
+	gid_t fsgid;
 
 	if (WARN_ON(!cap_ambient_invariant_ok(old)))
 		return -EPERM;
@@ -848,8 +851,15 @@ int cap_bprm_set_creds(struct linux_binprm *bprm)
 						   old->cap_permitted);
 	}
 
-	new->suid = new->fsuid = new->euid;
-	new->sgid = new->fsgid = new->egid;
+	fsuid = from_kuid_munged(new->user_ns, new->euid);
+	kfsuid = make_kfsuid(new->user_ns, fsuid);
+	new->suid = new->euid;
+	new->fsuid = kfsuid;
+
+	fsgid = from_kgid_munged(new->user_ns, new->egid);
+	kfsgid = make_kfsgid(new->user_ns, fsgid);
+	new->sgid = new->egid;
+	new->fsgid = kfsgid;
 
 	/* File caps or setid cancels ambient. */
 	if (has_fcap || is_setid)
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 16/24] sys: __sys_setfsuid(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (14 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 15/24] commoncap:cap_bprm_set_creds(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 17/24] sys: __sys_setfsgid(): " Christian Brauner
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch setfsuid() to lookup fsids in the fsid mappings. If no fsid mappings are
setup the behavior is unchanged, i.e. fsids are looked up in the id mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/sys.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index a9331f101883..ae376d32c1e3 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -59,6 +59,7 @@
 #include <linux/sched/cputime.h>
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
+#include <linux/fsuidgid.h>
 #include <linux/cred.h>
 
 #include <linux/nospec.h>
@@ -802,9 +803,9 @@ long __sys_setfsuid(uid_t uid)
 	kuid_t kuid;
 
 	old = current_cred();
-	old_fsuid = from_kuid_munged(old->user_ns, old->fsuid);
+	old_fsuid = from_kfsuid_munged(old->user_ns, old->fsuid);
 
-	kuid = make_kuid(old->user_ns, uid);
+	kuid = make_kfsuid(old->user_ns, uid);
 	if (!uid_valid(kuid))
 		return old_fsuid;
 
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 17/24] sys: __sys_setfsgid(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (15 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 16/24] sys: __sys_setfsuid(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 18/24] sys:__sys_setuid(): " Christian Brauner
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch setfsgid() to lookup fsids in the fsid mappings. If no fsid mappings are
setup the behavior is unchanged, i.e. fsids are looked up in the id mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/sys.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index ae376d32c1e3..b89334ad0908 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -847,9 +847,9 @@ long __sys_setfsgid(gid_t gid)
 	kgid_t kgid;
 
 	old = current_cred();
-	old_fsgid = from_kgid_munged(old->user_ns, old->fsgid);
+	old_fsgid = from_kfsgid_munged(old->user_ns, old->fsgid);
 
-	kgid = make_kgid(old->user_ns, gid);
+	kgid = make_kfsgid(old->user_ns, gid);
 	if (!gid_valid(kgid))
 		return old_fsgid;
 
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 18/24] sys:__sys_setuid(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (16 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 17/24] sys: __sys_setfsgid(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 19/24] sys:__sys_setgid(): " Christian Brauner
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch setuid() to lookup fsids in the fsid mappings. If no fsid mappings are
setup the behavior is unchanged, i.e. fsids are looked up in the id mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/sys.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index b89334ad0908..afaec8d46bc5 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -574,11 +574,16 @@ long __sys_setuid(uid_t uid)
 	struct cred *new;
 	int retval;
 	kuid_t kuid;
+	kuid_t kfsuid;
 
 	kuid = make_kuid(ns, uid);
 	if (!uid_valid(kuid))
 		return -EINVAL;
 
+	kfsuid = make_kfsuid(ns, uid);
+	if (!uid_valid(kfsuid))
+		return -EINVAL;
+
 	new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
@@ -596,7 +601,8 @@ long __sys_setuid(uid_t uid)
 		goto error;
 	}
 
-	new->fsuid = new->euid = kuid;
+	new->euid = kuid;
+	new->fsuid = kfsuid;
 
 	retval = security_task_fix_setuid(new, old, LSM_SETID_ID);
 	if (retval < 0)
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 19/24] sys:__sys_setgid(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (17 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 18/24] sys:__sys_setuid(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 20/24] sys:__sys_setreuid(): " Christian Brauner
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch setgid() to lookup fsids in the fsid mappings. If no fsid mappings are
setup the behavior is unchanged, i.e. fsids are looked up in the id mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/sys.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index afaec8d46bc5..11f41e0a4974 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -416,24 +416,31 @@ long __sys_setgid(gid_t gid)
 	const struct cred *old;
 	struct cred *new;
 	int retval;
-	kgid_t kgid;
+	kgid_t kgid, kfsgid;
 
 	kgid = make_kgid(ns, gid);
 	if (!gid_valid(kgid))
 		return -EINVAL;
 
+	kfsgid = make_kfsgid(ns, gid);
+	if (!gid_valid(kfsgid))
+		return -EINVAL;
+
 	new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
 	old = current_cred();
 
 	retval = -EPERM;
-	if (ns_capable(old->user_ns, CAP_SETGID))
-		new->gid = new->egid = new->sgid = new->fsgid = kgid;
-	else if (gid_eq(kgid, old->gid) || gid_eq(kgid, old->sgid))
-		new->egid = new->fsgid = kgid;
-	else
+	if (ns_capable(old->user_ns, CAP_SETGID)) {
+		new->gid = new->egid = new->sgid = kgid;
+		new->fsgid = kfsgid;
+	} else if (gid_eq(kgid, old->gid) || gid_eq(kgid, old->sgid)) {
+		new->egid = kgid;
+		new->fsgid = kfsgid;
+	} else {
 		goto error;
+	}
 
 	return commit_creds(new);
 
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 20/24] sys:__sys_setreuid(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (18 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 19/24] sys:__sys_setgid(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 21/24] sys:__sys_setregid(): " Christian Brauner
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch setreuid() to lookup fsids in the fsid mappings. If no fsid mappings are
setup the behavior is unchanged, i.e. fsids are looked up in the id mappings.

During setreuid() the kfsuid is set to the keuid corresponding the euid that is
requested by userspace. If the requested euid is -1 the kfsuid is reset to the
current keuid. For the latter case this means we need to lookup the
corresponding userspace euid corresponding to the current keuid in the id
mappings and translate this euid into the corresponding kfsuid in the fsid
mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/sys.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 11f41e0a4974..ef1104c9df56 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -504,15 +504,18 @@ long __sys_setreuid(uid_t ruid, uid_t euid)
 	const struct cred *old;
 	struct cred *new;
 	int retval;
-	kuid_t kruid, keuid;
+	kuid_t kruid, keuid, kfsuid;
 
 	kruid = make_kuid(ns, ruid);
 	keuid = make_kuid(ns, euid);
+	kfsuid = make_kfsuid(ns, euid);
 
 	if ((ruid != (uid_t) -1) && !uid_valid(kruid))
 		return -EINVAL;
 	if ((euid != (uid_t) -1) && !uid_valid(keuid))
 		return -EINVAL;
+	if ((euid != (uid_t) -1) && !uid_valid(kfsuid))
+		return -EINVAL;
 
 	new = prepare_creds();
 	if (!new)
@@ -535,6 +538,9 @@ long __sys_setreuid(uid_t ruid, uid_t euid)
 		    !uid_eq(old->suid, keuid) &&
 		    !ns_capable_setid(old->user_ns, CAP_SETUID))
 			goto error;
+	} else {
+		uid_t fsuid = from_kuid_munged(new->user_ns, new->euid);
+		kfsuid = make_kfsuid(ns, fsuid);
 	}
 
 	if (!uid_eq(new->uid, old->uid)) {
@@ -545,7 +551,7 @@ long __sys_setreuid(uid_t ruid, uid_t euid)
 	if (ruid != (uid_t) -1 ||
 	    (euid != (uid_t) -1 && !uid_eq(keuid, old->uid)))
 		new->suid = new->euid;
-	new->fsuid = new->euid;
+	new->fsuid = kfsuid;
 
 	retval = security_task_fix_setuid(new, old, LSM_SETID_RE);
 	if (retval < 0)
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 21/24] sys:__sys_setregid(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (19 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 20/24] sys:__sys_setreuid(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 22/24] sys:__sys_setresuid(): " Christian Brauner
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch setregid() to lookup fsids in the fsid mappings. If no fsid mappings are
setup the behavior is unchanged, i.e. fsids are looked up in the id mappings.

During setregid() the kfsgid is set to the kegid corresponding the egid that is
requested by userspace. If the requested egid is -1 the kfsgid is reset to the
current kegid. For the latter case this means we need to lookup the
corresponding userspace egid corresponding to the current kegid in the id
mappings and translate this egid into the corresponding kfsgid in the fsid
mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/sys.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index ef1104c9df56..41551c01c3eb 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -354,15 +354,18 @@ long __sys_setregid(gid_t rgid, gid_t egid)
 	const struct cred *old;
 	struct cred *new;
 	int retval;
-	kgid_t krgid, kegid;
+	kgid_t krgid, kegid, kfsgid;
 
 	krgid = make_kgid(ns, rgid);
 	kegid = make_kgid(ns, egid);
+	kfsgid = make_kfsgid(ns, egid);
 
 	if ((rgid != (gid_t) -1) && !gid_valid(krgid))
 		return -EINVAL;
 	if ((egid != (gid_t) -1) && !gid_valid(kegid))
 		return -EINVAL;
+	if ((egid != (gid_t) -1) && !gid_valid(kfsgid))
+		return -EINVAL;
 
 	new = prepare_creds();
 	if (!new)
@@ -386,12 +389,15 @@ long __sys_setregid(gid_t rgid, gid_t egid)
 			new->egid = kegid;
 		else
 			goto error;
+	} else {
+		gid_t fsgid = from_kgid_munged(new->user_ns, new->egid);
+		kfsgid = make_kfsgid(ns, fsgid);
 	}
 
 	if (rgid != (gid_t) -1 ||
 	    (egid != (gid_t) -1 && !gid_eq(kegid, old->gid)))
 		new->sgid = new->egid;
-	new->fsgid = new->egid;
+	new->fsgid = kfsgid;
 
 	return commit_creds(new);
 
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 22/24] sys:__sys_setresuid(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (20 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 21/24] sys:__sys_setregid(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 23/24] sys:__sys_setresgid(): " Christian Brauner
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch setresuid() to lookup fsids in the fsid mappings. If no fsid mappings
are setup the behavior is unchanged, i.e. fsids are looked up in the id
mappings.

During setresuid() the kfsuid is set to the keuid corresponding the euid that is
requested by userspace. If the requested euid is -1 the kfsuid is reset to the
current keuid. For the latter case this means we need to lookup the
corresponding userspace euid corresponding to the current keuid in the id
mappings and translate this euid into the corresponding kfsuid in the fsid
mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/sys.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 41551c01c3eb..3b98ce84607d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -650,11 +650,12 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
 	const struct cred *old;
 	struct cred *new;
 	int retval;
-	kuid_t kruid, keuid, ksuid;
+	kuid_t kruid, keuid, ksuid, kfsuid;
 
 	kruid = make_kuid(ns, ruid);
 	keuid = make_kuid(ns, euid);
 	ksuid = make_kuid(ns, suid);
+	kfsuid = make_kfsuid(ns, euid);
 
 	if ((ruid != (uid_t) -1) && !uid_valid(kruid))
 		return -EINVAL;
@@ -665,6 +666,9 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
 	if ((suid != (uid_t) -1) && !uid_valid(ksuid))
 		return -EINVAL;
 
+	if ((euid != (uid_t) -1) && !uid_valid(kfsuid))
+		return -EINVAL;
+
 	new = prepare_creds();
 	if (!new)
 		return -ENOMEM;
@@ -692,11 +696,15 @@ long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
 				goto error;
 		}
 	}
-	if (euid != (uid_t) -1)
+	if (euid != (uid_t) -1) {
 		new->euid = keuid;
+	} else {
+		uid_t fsuid = from_kuid_munged(new->user_ns, new->euid);
+		kfsuid = make_kfsuid(ns, fsuid);
+	}
 	if (suid != (uid_t) -1)
 		new->suid = ksuid;
-	new->fsuid = new->euid;
+	new->fsuid = kfsuid;
 
 	retval = security_task_fix_setuid(new, old, LSM_SETID_RES);
 	if (retval < 0)
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 23/24] sys:__sys_setresgid(): handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (21 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 22/24] sys:__sys_setresuid(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 16:57 ` [PATCH 24/24] devpts: " Christian Brauner
  2020-02-11 20:55 ` [PATCH 00/24] user_namespace: introduce " Jann Horn
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

Switch setresgid() to lookup fsids in the fsid mappings. If no fsid mappings are
setup the behavior is unchanged, i.e. fsids are looked up in the id mappings.

During setresgid() the kfsgid is set to the kegid corresponding the egid that is
requested by userspace. If the requested egid is -1 the kfsgid is reset to the
current kegid. For the latter case this means we need to lookup the
corresponding userspace egid corresponding to the current kegid in the id
mappings and translate this egid into the corresponding kfsgid in the fsid
mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/sys.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 3b98ce84607d..674d0ba4887c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -750,11 +750,12 @@ long __sys_setresgid(gid_t rgid, gid_t egid, gid_t sgid)
 	const struct cred *old;
 	struct cred *new;
 	int retval;
-	kgid_t krgid, kegid, ksgid;
+	kgid_t krgid, kegid, ksgid, kfsgid;
 
 	krgid = make_kgid(ns, rgid);
 	kegid = make_kgid(ns, egid);
 	ksgid = make_kgid(ns, sgid);
+	kfsgid = make_kfsgid(ns, egid);
 
 	if ((rgid != (gid_t) -1) && !gid_valid(krgid))
 		return -EINVAL;
@@ -762,6 +763,8 @@ long __sys_setresgid(gid_t rgid, gid_t egid, gid_t sgid)
 		return -EINVAL;
 	if ((sgid != (gid_t) -1) && !gid_valid(ksgid))
 		return -EINVAL;
+	if ((egid != (gid_t) -1) && !gid_valid(kfsgid))
+		return -EINVAL;
 
 	new = prepare_creds();
 	if (!new)
@@ -783,11 +786,15 @@ long __sys_setresgid(gid_t rgid, gid_t egid, gid_t sgid)
 
 	if (rgid != (gid_t) -1)
 		new->gid = krgid;
-	if (egid != (gid_t) -1)
+	if (egid != (gid_t) -1) {
 		new->egid = kegid;
+	} else {
+		gid_t fsgid = from_kgid_munged(new->user_ns, new->egid);
+		kfsgid = make_kfsgid(ns, fsgid);
+	}
 	if (sgid != (gid_t) -1)
 		new->sgid = ksgid;
-	new->fsgid = new->egid;
+	new->fsgid = kfsgid;
 
 	return commit_creds(new);
 
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 24/24] devpts: handle fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (22 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 23/24] sys:__sys_setresgid(): " Christian Brauner
@ 2020-02-11 16:57 ` " Christian Brauner
  2020-02-11 20:55 ` [PATCH 00/24] user_namespace: introduce " Jann Horn
  24 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2020-02-11 16:57 UTC (permalink / raw)
  To: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api,
	Christian Brauner

When a uid or gid mount option is specified with devpts have it lookup the
corresponding kfsids in the fsid mappings. If no fsid mappings are setup the
behavior is unchanged, i.e. fsids are looked up in the id mappings.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/devpts/inode.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index 42e5a766d33c..139958892572 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -24,6 +24,7 @@
 #include <linux/parser.h>
 #include <linux/fsnotify.h>
 #include <linux/seq_file.h>
+#include <linux/fsuidgid.h>
 
 #define DEVPTS_DEFAULT_MODE 0600
 /*
@@ -277,7 +278,7 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts)
 		case Opt_uid:
 			if (match_int(&args[0], &option))
 				return -EINVAL;
-			uid = make_kuid(current_user_ns(), option);
+			uid = make_kfsuid(current_user_ns(), option);
 			if (!uid_valid(uid))
 				return -EINVAL;
 			opts->uid = uid;
@@ -286,7 +287,7 @@ static int parse_mount_options(char *data, int op, struct pts_mount_opts *opts)
 		case Opt_gid:
 			if (match_int(&args[0], &option))
 				return -EINVAL;
-			gid = make_kgid(current_user_ns(), option);
+			gid = make_kfsgid(current_user_ns(), option);
 			if (!gid_valid(gid))
 				return -EINVAL;
 			opts->gid = gid;
@@ -410,7 +411,7 @@ static int devpts_show_options(struct seq_file *seq, struct dentry *root)
 			   from_kuid_munged(&init_user_ns, opts->uid));
 	if (opts->setgid)
 		seq_printf(seq, ",gid=%u",
-			   from_kgid_munged(&init_user_ns, opts->gid));
+			   from_kfsgid_munged(&init_user_ns, opts->gid));
 	seq_printf(seq, ",mode=%03o", opts->mode);
 	seq_printf(seq, ",ptmxmode=%03o", opts->ptmxmode);
 	if (opts->max < NR_UNIX98_PTY_MAX)
-- 
2.25.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/24] user_namespace: introduce fsid mappings infrastructure
  2020-02-11 16:57 ` [PATCH 01/24] user_namespace: introduce fsid mappings infrastructure Christian Brauner
@ 2020-02-11 17:26   ` Randy Dunlap
  0 siblings, 0 replies; 29+ messages in thread
From: Randy Dunlap @ 2020-02-11 17:26 UTC (permalink / raw)
  To: Christian Brauner, Stéphane Graber, Eric W. Biederman,
	Aleksa Sarai, Jann Horn
  Cc: smbarber, Alexander Viro, Alexey Dobriyan, Serge Hallyn,
	James Morris, Kees Cook, Jonathan Corbet, linux-kernel,
	linux-fsdevel, containers, linux-security-module, linux-api

On 2/11/20 8:57 AM, Christian Brauner wrote:
> diff --git a/init/Kconfig b/init/Kconfig
> index a34064a031a5..4da082e4f787 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1102,6 +1102,17 @@ config USER_NS
>  
>  	  If unsure, say N.
>  
> +config USER_NS_FSID
> +	bool "User namespace fsid mappings"
> +	depends on USER_NS
> +	default n
> +	help
> +	  This allows containers, to alter their filesystem id mappings.

                   no comma   ^^^^

> +	  With this containers with different id mappings can still share
> +	  the same filesystem.
> +
> +	  If unsure, say N.
> +
>  config PID_NS
>  	bool "PID Namespaces"
>  	default y


-- 
~Randy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/24] user_namespace: introduce fsid mappings
  2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
                   ` (23 preceding siblings ...)
  2020-02-11 16:57 ` [PATCH 24/24] devpts: " Christian Brauner
@ 2020-02-11 20:55 ` " Jann Horn
  2020-02-12 14:51   ` Christian Brauner
  24 siblings, 1 reply; 29+ messages in thread
From: Jann Horn @ 2020-02-11 20:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Stéphane Graber, Eric W. Biederman, Aleksa Sarai, smbarber,
	Alexander Viro, Alexey Dobriyan, Serge Hallyn, James Morris,
	Kees Cook, Jonathan Corbet, kernel list, linux-fsdevel,
	Linux Containers, linux-security-module, Linux API

On Tue, Feb 11, 2020 at 5:59 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> This is the implementation of shiftfs which was cooked up during lunch at
> Linux Plumbers 2019 the day after the container's microconference. The
> idea is a design-stew from Stéphane, Aleksa, Eric, and myself. Back then
> we all were quite busy with other work and couldn't really sit down and
> implement it. But I took a few days last week to do this work, including
> demos and performance testing.
> This implementation does not require us to touch the vfs substantially
> at all. Instead, we implement shiftfs via fsid mappings.
> With this patch, it took me 20 mins to port both LXD and LXC to support
> shiftfs via fsid mappings.
>
> For anyone wanting to play with this the branch can be pulled from:
> https://github.com/brauner/linux/tree/fsid_mappings
> https://gitlab.com/brauner/linux/-/tree/fsid_mappings
> https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings
>
> The main use case for shiftfs for us is in allowing shared writable
> storage to multiple containers using non-overlapping id mappings.
> In such a scenario you want the fsids to be valid and identical in both
> containers for the shared mount. A demo for this exists in [3].
> If you don't want to read on, go straight to the other demos below in
> [1] and [2].

I guess essentially this means that you want to have UID separation
between containers to prevent the containers - or their owners - from
interfering between each other, but for filesystem access, you don't
want to isolate them from each other using DAC controls on the files
and folders inside the containers' directory hierarchies, instead
relying on mode-0700 parent directories to restrict access to the
container owner? Or would you still have separate UIDs for e.g. the
container's UID range 0-65535, and then map the shared UID range at
100000, or something like that?

> People not as familiar with user namespaces might not be aware that fsid
> mappings already exist. Right now, fsid mappings are always identical to
> id mappings. Specifically, the kernel will lookup fsuids in the uid
> mappings and fsgids in the gid mappings of the relevant user namespace.

That's a bit like saying that a kernel without CONFIG_USER_NS still
has user ID mappings, they just happen to be identity mappings. :P

> With this patch series we simply introduce the ability to create fsid
> mappings that are different from the id mappings of a user namespace.
>
> In the usual case of running an unprivileged container we will have
> setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> correspond to this id mapping, i.e. all files which we want to appear as
> 0:0 inside the user namespace will be chowned to 100000:100000 on the
> host. This works, because whenever the kernel needs to do a filesystem
> access it will lookup the corresponding uid and gid in the idmapping
> tables of the container.
> Now think about the case where we want to have an id mapping of 0 100000
> 100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
> share a single on-disk mapping with multiple containers that all have
> different id mappings.
> This will be problematic. Whenever a filesystem access is requested, the
> kernel will now try to lookup a mapping for 300000 in the id mapping
> tables of the user namespace but since there is none the files will
> appear to be owned by the overflow id, i.e. usually 65534:65534 or
> nobody:nogroup.
>
> With fsid mappings we can solve this by writing an id mapping of 0
> 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> access the kernel will now lookup the mapping for 300000 in the fsid
> mapping tables of the user namespace. And since such a mapping exists,
> the corresponding files will have correct ownership.

Sorry to bring up something as disgusting as setuid execution, but:
What happens when there's a setuid root file with ->i_uid==300000? I
guess the only way to make that work inside the containers would be
something like make_kuid(current_user_ns(),
from_kfsuid(current_user_ns(), inode->i_uid)) in the setuid execve
path?

> A note on proc (and sys), the proc filesystem is special in sofar as it
> only has a single superblock that is (currently but might be about to
> change) visible in all user namespaces (same goes for sys). This means
> it has special semantics in many ways, including how file ownership and
> access works. The fsid mapping implementation does not alter how proc
> (and sys) ownership works. proc and sys will both continue to lookup
> filesystem access in id mapping tables.

In your example, a process with namespaced UID set (0, 0, 0, 0) will
have kernel UIDs (100000, 100000, 100000, 300000), right? And then if
I want to open /proc/$pid/personality of another process with the same
UIDs, may_open() will call inode_permission() -> do_inode_permission()
-> generic_permission() -> acl_permission_check(), which will compare
current_fsuid() (which is 300000) against inode->i_uid. But
inode->i_uid was filled by proc_pid_make_inode()->task_dump_owner(),
which set inode->i_uid to 100000, right?

Also, e.g. __ptrace_may_access() uses cred->fsuid for a comparison
with another task's real/effective/saved UID.

[...]
> # Demos
> [1]: Create a container with different id and fsid mappings.
>      https://asciinema.org/a/300233
> [2]: Create a container with id mappings but without fsid mappings.
>      https://asciinema.org/a/300234
> [3]: Share storage between multiple containers with non-overlapping id
>      mappings.
>      https://asciinema.org/a/300235

(I really dislike this asciinema thing; if you want to quickly glance
through the output instead of reading at the same speed as it was
typed, a simple pastebin works much better unless you absolutely have
to show things that use stuff like ncurses UI.)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/24] user_namespace: introduce fsid mappings
  2020-02-11 20:55 ` [PATCH 00/24] user_namespace: introduce " Jann Horn
@ 2020-02-12 14:51   ` Christian Brauner
  2020-02-12 18:53     ` Jann Horn
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2020-02-12 14:51 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-security-module, Kees Cook, Jonathan Corbet, kernel list,
	Linux Containers, smbarber, Alexander Viro, Linux API,
	linux-fsdevel, Alexey Dobriyan, Eric W. Biederman

On Tue, Feb 11, 2020 at 09:55:46PM +0100, Jann Horn via Containers wrote:
> On Tue, Feb 11, 2020 at 5:59 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > This is the implementation of shiftfs which was cooked up during lunch at
> > Linux Plumbers 2019 the day after the container's microconference. The
> > idea is a design-stew from Stéphane, Aleksa, Eric, and myself. Back then
> > we all were quite busy with other work and couldn't really sit down and
> > implement it. But I took a few days last week to do this work, including
> > demos and performance testing.
> > This implementation does not require us to touch the vfs substantially
> > at all. Instead, we implement shiftfs via fsid mappings.
> > With this patch, it took me 20 mins to port both LXD and LXC to support
> > shiftfs via fsid mappings.
> >
> > For anyone wanting to play with this the branch can be pulled from:
> > https://github.com/brauner/linux/tree/fsid_mappings
> > https://gitlab.com/brauner/linux/-/tree/fsid_mappings
> > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings
> >
> > The main use case for shiftfs for us is in allowing shared writable
> > storage to multiple containers using non-overlapping id mappings.
> > In such a scenario you want the fsids to be valid and identical in both
> > containers for the shared mount. A demo for this exists in [3].
> > If you don't want to read on, go straight to the other demos below in
> > [1] and [2].
> 
> I guess essentially this means that you want to have UID separation
> between containers to prevent the containers - or their owners - from
> interfering between each other, but for filesystem access, you don't
> want to isolate them from each other using DAC controls on the files
> and folders inside the containers' directory hierarchies, instead
> relying on mode-0700 parent directories to restrict access to the
> container owner? Or would you still have separate UIDs for e.g. the
> container's UID range 0-65535, and then map the shared UID range at
> 100000, or something like that?

Yes.
So if you look at the permissions right now for the directory under
which the rootfs for the container and other stuff resides we have
root@wittgenstein|/var/lib/lxd/storage-pools/zfs/containers
> perms *
d--x------ 100 alp1
d--x------ 100 f1
d--x------ 100 f2

We don't really share the rootfs between containers right now since we
treat them as standalone systems but with fsid mappings that's possible
too. Layer-sharing-centric runtimes very much will want something like
that.

> 
> > People not as familiar with user namespaces might not be aware that fsid
> > mappings already exist. Right now, fsid mappings are always identical to
> > id mappings. Specifically, the kernel will lookup fsuids in the uid
> > mappings and fsgids in the gid mappings of the relevant user namespace.
> 
> That's a bit like saying that a kernel without CONFIG_USER_NS still
> has user ID mappings, they just happen to be identity mappings. :P

If you have CONFIG_USER_NS=n then you have (as you're well aware)
[<0, 0>, <1,1>, ..., <n,n>] so yeah that's true and analyzing it like
that makes sense. :P

> 
> > With this patch series we simply introduce the ability to create fsid
> > mappings that are different from the id mappings of a user namespace.
> >
> > In the usual case of running an unprivileged container we will have
> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> > correspond to this id mapping, i.e. all files which we want to appear as
> > 0:0 inside the user namespace will be chowned to 100000:100000 on the
> > host. This works, because whenever the kernel needs to do a filesystem
> > access it will lookup the corresponding uid and gid in the idmapping
> > tables of the container.
> > Now think about the case where we want to have an id mapping of 0 100000
> > 100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
> > share a single on-disk mapping with multiple containers that all have
> > different id mappings.
> > This will be problematic. Whenever a filesystem access is requested, the
> > kernel will now try to lookup a mapping for 300000 in the id mapping
> > tables of the user namespace but since there is none the files will
> > appear to be owned by the overflow id, i.e. usually 65534:65534 or
> > nobody:nogroup.
> >
> > With fsid mappings we can solve this by writing an id mapping of 0
> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > access the kernel will now lookup the mapping for 300000 in the fsid
> > mapping tables of the user namespace. And since such a mapping exists,
> > the corresponding files will have correct ownership.
> 
> Sorry to bring up something as disgusting as setuid execution, but:

No that's exactly what this needs. :)

> What happens when there's a setuid root file with ->i_uid==300000? I
> guess the only way to make that work inside the containers would be
> something like make_kuid(current_user_ns(),
> from_kfsuid(current_user_ns(), inode->i_uid)) in the setuid execve
> path?

What's the specific callpath you're thinking about?

So if you look at patch
https://lore.kernel.org/lkml/20200211165753.356508-16-christian.brauner@ubuntu.com/
it does
-	new->suid = new->fsuid = new->euid;
-	new->sgid = new->fsgid = new->egid;
+	fsuid = from_kuid_munged(new->user_ns, new->euid);
+	kfsuid = make_kfsuid(new->user_ns, fsuid);
+	new->suid = new->euid;
+	new->fsuid = kfsuid;
+
+	fsgid = from_kgid_munged(new->user_ns, new->egid);
+	kfsgid = make_kfsgid(new->user_ns, fsgid);
+	new->sgid = new->egid;
+	new->fsgid = kfsgid;

One thing I definitely missed though in the setuid path is to adapt
fs/exec.c:bprm_fill_uid():

diff --git a/fs/exec.c b/fs/exec.c
index 74d88dab98dd..ad839934fdff 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1547,8 +1547,8 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
        inode_unlock(inode);

        /* We ignore suid/sgid if there are no mappings for them in the ns */
-       if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
-                !kgid_has_mapping(bprm->cred->user_ns, gid))
+       if (!kfsuid_has_mapping(bprm->cred->user_ns, uid) ||
+                !kfsgid_has_mapping(bprm->cred->user_ns, gid))
                return;

        if (mode & S_ISUID) {

> 
> > A note on proc (and sys), the proc filesystem is special in sofar as it
> > only has a single superblock that is (currently but might be about to
> > change) visible in all user namespaces (same goes for sys). This means
> > it has special semantics in many ways, including how file ownership and
> > access works. The fsid mapping implementation does not alter how proc
> > (and sys) ownership works. proc and sys will both continue to lookup
> > filesystem access in id mapping tables.
> 
> In your example, a process with namespaced UID set (0, 0, 0, 0) will
> have kernel UIDs (100000, 100000, 100000, 300000), right? And then if

Yes.

> I want to open /proc/$pid/personality of another process with the same
> UIDs, may_open() will call inode_permission() -> do_inode_permission()
> -> generic_permission() -> acl_permission_check(), which will compare
> current_fsuid() (which is 300000) against inode->i_uid. But
> inode->i_uid was filled by proc_pid_make_inode()->task_dump_owner(),
> which set inode->i_uid to 100000, right?

Yes. That should be fixable by something like below, I think. (And we can
probably shortcut this by adding a helper that does tell us whether there's
been any fsid mapping setup or not for this user namespace.)
 static int acl_permission_check(struct inode *inode, int mask)
 {
+       kuid_t kuid;
        unsigned int mode = inode->i_mode;

-       if (likely(uid_eq(current_fsuid(), inode->i_uid)))
+       if (!is_userns_visible(inode->i_sb->s_iflags)) {
+               kuid = inode->i_uid;
+       } else {
+               kuid = make_kuid(current_user_ns(),
+                                from_kfsuid(current_user_ns(), inode->i_uid));
+       }
+
+       if (likely(uid_eq(current_fsuid(), kuid)))
                mode >>= 6;
        else {&& (mode & S_IRWXG)) {

> 
> Also, e.g. __ptrace_may_access() uses cred->fsuid for a comparison
> with another task's real/effective/saved UID.

Right, you even introduced this check in 2015 iirc.
Both of your points make me think that it'd be easiest to introduce
cred->{kfsuid,kfsgid} and whenever an access decision on a
is_userns_visible() filesystem has to be made those will be used. This avoids
having to do on-the fly translations and ptrace_may_access() can just grow a
flag indicating what fscreds it's supposed to look at?

> 
> [...]
> > # Demos
> > [1]: Create a container with different id and fsid mappings.
> >      https://asciinema.org/a/300233
> > [2]: Create a container with id mappings but without fsid mappings.
> >      https://asciinema.org/a/300234
> > [3]: Share storage between multiple containers with non-overlapping id
> >      mappings.
> >      https://asciinema.org/a/300235
> 
> (I really dislike this asciinema thing; if you want to quickly glance
> through the output instead of reading at the same speed as it was
> typed, a simple pastebin works much better unless you absolutely have
> to show things that use stuff like ncurses UI.)

Hmkay, I went through the trouble of converting the asciinema output to
basic shell for all tree demos. :) I made them available as github gists.
So:
demo1: https://gist.github.com/brauner/8e1117720b3f9fab22e44c17f12184bf
demo2: https://gist.github.com/brauner/41a36026a9a1496af0095dce1545548e
demo3: https://gist.github.com/brauner/4586d6bc680a018bc8e1dd114a45592a

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/24] user_namespace: introduce fsid mappings
  2020-02-12 14:51   ` Christian Brauner
@ 2020-02-12 18:53     ` Jann Horn
  0 siblings, 0 replies; 29+ messages in thread
From: Jann Horn @ 2020-02-12 18:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-security-module, Kees Cook, Jonathan Corbet, kernel list,
	Linux Containers, smbarber, Alexander Viro, Linux API,
	linux-fsdevel, Alexey Dobriyan, Eric W. Biederman

On Wed, Feb 12, 2020 at 3:51 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
> On Tue, Feb 11, 2020 at 09:55:46PM +0100, Jann Horn via Containers wrote:
> > On Tue, Feb 11, 2020 at 5:59 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > > This is the implementation of shiftfs which was cooked up during lunch at
> > > Linux Plumbers 2019 the day after the container's microconference. The
> > > idea is a design-stew from Stéphane, Aleksa, Eric, and myself. Back then
> > > we all were quite busy with other work and couldn't really sit down and
> > > implement it. But I took a few days last week to do this work, including
> > > demos and performance testing.
> > > This implementation does not require us to touch the vfs substantially
> > > at all. Instead, we implement shiftfs via fsid mappings.
> > > With this patch, it took me 20 mins to port both LXD and LXC to support
> > > shiftfs via fsid mappings.
> > >
> > > For anyone wanting to play with this the branch can be pulled from:
> > > https://github.com/brauner/linux/tree/fsid_mappings
> > > https://gitlab.com/brauner/linux/-/tree/fsid_mappings
> > > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings
> > >
> > > The main use case for shiftfs for us is in allowing shared writable
> > > storage to multiple containers using non-overlapping id mappings.
> > > In such a scenario you want the fsids to be valid and identical in both
> > > containers for the shared mount. A demo for this exists in [3].
> > > If you don't want to read on, go straight to the other demos below in
> > > [1] and [2].
> >
> > I guess essentially this means that you want to have UID separation
> > between containers to prevent the containers - or their owners - from
> > interfering between each other, but for filesystem access, you don't
> > want to isolate them from each other using DAC controls on the files
> > and folders inside the containers' directory hierarchies, instead
> > relying on mode-0700 parent directories to restrict access to the
> > container owner? Or would you still have separate UIDs for e.g. the
> > container's UID range 0-65535, and then map the shared UID range at
> > 100000, or something like that?
>
> Yes.
> So if you look at the permissions right now for the directory under
> which the rootfs for the container and other stuff resides we have
> root@wittgenstein|/var/lib/lxd/storage-pools/zfs/containers
> > perms *
> d--x------ 100 alp1
> d--x------ 100 f1
> d--x------ 100 f2
>
> We don't really share the rootfs between containers right now since we
> treat them as standalone systems but with fsid mappings that's possible
> too. Layer-sharing-centric runtimes very much will want something like
> that.
[...]
> > > With this patch series we simply introduce the ability to create fsid
> > > mappings that are different from the id mappings of a user namespace.
> > >
> > > In the usual case of running an unprivileged container we will have
> > > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will
> > > correspond to this id mapping, i.e. all files which we want to appear as
> > > 0:0 inside the user namespace will be chowned to 100000:100000 on the
> > > host. This works, because whenever the kernel needs to do a filesystem
> > > access it will lookup the corresponding uid and gid in the idmapping
> > > tables of the container.
> > > Now think about the case where we want to have an id mapping of 0 100000
> > > 100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g.
> > > share a single on-disk mapping with multiple containers that all have
> > > different id mappings.
> > > This will be problematic. Whenever a filesystem access is requested, the
> > > kernel will now try to lookup a mapping for 300000 in the id mapping
> > > tables of the user namespace but since there is none the files will
> > > appear to be owned by the overflow id, i.e. usually 65534:65534 or
> > > nobody:nogroup.
> > >
> > > With fsid mappings we can solve this by writing an id mapping of 0
> > > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem
> > > access the kernel will now lookup the mapping for 300000 in the fsid
> > > mapping tables of the user namespace. And since such a mapping exists,
> > > the corresponding files will have correct ownership.
> >
> > Sorry to bring up something as disgusting as setuid execution, but:
>
> No that's exactly what this needs. :)
>
> > What happens when there's a setuid root file with ->i_uid==300000? I
> > guess the only way to make that work inside the containers would be
> > something like make_kuid(current_user_ns(),
> > from_kfsuid(current_user_ns(), inode->i_uid)) in the setuid execve
> > path?
>
> What's the specific callpath you're thinking about?
>
> So if you look at patch
> https://lore.kernel.org/lkml/20200211165753.356508-16-christian.brauner@ubuntu.com/
> it does
> -       new->suid = new->fsuid = new->euid;
> -       new->sgid = new->fsgid = new->egid;
> +       fsuid = from_kuid_munged(new->user_ns, new->euid);
> +       kfsuid = make_kfsuid(new->user_ns, fsuid);
> +       new->suid = new->euid;
> +       new->fsuid = kfsuid;
> +
> +       fsgid = from_kgid_munged(new->user_ns, new->egid);
> +       kfsgid = make_kfsgid(new->user_ns, fsgid);
> +       new->sgid = new->egid;
> +       new->fsgid = kfsgid;

Aaah, okay, I missed that.

> One thing I definitely missed though in the setuid path is to adapt
> fs/exec.c:bprm_fill_uid():
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 74d88dab98dd..ad839934fdff 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1547,8 +1547,8 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
>         inode_unlock(inode);
>
>         /* We ignore suid/sgid if there are no mappings for them in the ns */
> -       if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
> -                !kgid_has_mapping(bprm->cred->user_ns, gid))
> +       if (!kfsuid_has_mapping(bprm->cred->user_ns, uid) ||
> +                !kfsgid_has_mapping(bprm->cred->user_ns, gid))
>                 return;
>
>         if (mode & S_ISUID) {
[...]
> > I want to open /proc/$pid/personality of another process with the same
> > UIDs, may_open() will call inode_permission() -> do_inode_permission()
> > -> generic_permission() -> acl_permission_check(), which will compare
> > current_fsuid() (which is 300000) against inode->i_uid. But
> > inode->i_uid was filled by proc_pid_make_inode()->task_dump_owner(),
> > which set inode->i_uid to 100000, right?
>
> Yes. That should be fixable by something like below, I think. (And we can
> probably shortcut this by adding a helper that does tell us whether there's
> been any fsid mapping setup or not for this user namespace.)
>  static int acl_permission_check(struct inode *inode, int mask)
>  {
> +       kuid_t kuid;
>         unsigned int mode = inode->i_mode;
>
> -       if (likely(uid_eq(current_fsuid(), inode->i_uid)))
> +       if (!is_userns_visible(inode->i_sb->s_iflags)) {
> +               kuid = inode->i_uid;
> +       } else {
> +               kuid = make_kuid(current_user_ns(),
> +                                from_kfsuid(current_user_ns(), inode->i_uid));
> +       }
> +
> +       if (likely(uid_eq(current_fsuid(), kuid)))
>                 mode >>= 6;
>         else {&& (mode & S_IRWXG)) {
>
> >
> > Also, e.g. __ptrace_may_access() uses cred->fsuid for a comparison
> > with another task's real/effective/saved UID.
>
> Right, you even introduced this check in 2015 iirc.
> Both of your points make me think that it'd be easiest to introduce
> cred->{kfsuid,kfsgid} and whenever an access decision on a
> is_userns_visible() filesystem has to be made those will be used. This avoids
> having to do on-the fly translations

I guess that might be less ugly.

> and ptrace_may_access() can just grow a
> flag indicating what fscreds it's supposed to look at?

Wouldn't you always end up using the "real" fsuid there?

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, back to index

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-11 16:57 [PATCH 00/24] user_namespace: introduce fsid mappings Christian Brauner
2020-02-11 16:57 ` [PATCH 01/24] user_namespace: introduce fsid mappings infrastructure Christian Brauner
2020-02-11 17:26   ` Randy Dunlap
2020-02-11 16:57 ` [PATCH 02/24] proc: add /proc/<pid>/fsuid_map Christian Brauner
2020-02-11 16:57 ` [PATCH 03/24] proc: add /proc/<pid>/fsgid_map Christian Brauner
2020-02-11 16:57 ` [PATCH 04/24] fsuidgid: add fsid mapping helpers Christian Brauner
2020-02-11 16:57 ` [PATCH 05/24] proc: task_state(): use from_kfs{g,u}id_munged Christian Brauner
2020-02-11 16:57 ` [PATCH 06/24] fs: add is_userns_visible() helper Christian Brauner
2020-02-11 16:57 ` [PATCH 07/24] namei: may_{o_}create(): handle fsid mappings Christian Brauner
2020-02-11 16:57 ` [PATCH 08/24] inode: inode_owner_or_capable(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 09/24] capability: privileged_wrt_inode_uidgid(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 10/24] stat: " Christian Brauner
2020-02-11 16:57 ` [PATCH 11/24] open: chown_common(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 12/24] posix_acl: " Christian Brauner
2020-02-11 16:57 ` [PATCH 13/24] attr: notify_change(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 14/24] commoncap: cap_task_fix_setuid(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 15/24] commoncap:cap_bprm_set_creds(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 16/24] sys: __sys_setfsuid(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 17/24] sys: __sys_setfsgid(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 18/24] sys:__sys_setuid(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 19/24] sys:__sys_setgid(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 20/24] sys:__sys_setreuid(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 21/24] sys:__sys_setregid(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 22/24] sys:__sys_setresuid(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 23/24] sys:__sys_setresgid(): " Christian Brauner
2020-02-11 16:57 ` [PATCH 24/24] devpts: " Christian Brauner
2020-02-11 20:55 ` [PATCH 00/24] user_namespace: introduce " Jann Horn
2020-02-12 14:51   ` Christian Brauner
2020-02-12 18:53     ` Jann Horn

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git