All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/1] ns: introduce binfmt_misc namespace
@ 2019-12-16  9:12 Laurent Vivier
  2019-12-16  9:12 ` [PATCH v8 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier
  2019-12-16  9:46 ` [PATCH v8 0/1] ns: introduce binfmt_misc namespace Christian Brauner
  0 siblings, 2 replies; 19+ messages in thread
From: Laurent Vivier @ 2019-12-16  9:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kurz, Jann Horn, Andrei Vagin, linux-api, Dmitry Safonov,
	James Bottomley, Jan Kiszka, Christian Brauner, linux-fsdevel,
	containers, Alexander Viro, Eric Biederman, Henning Schild,
	Cédric Le Goater, Laurent Vivier

v8: s/file->f_path.dentry/file_dentry(file)/

v7: Use the new mount API

    Replace

      static struct dentry *bm_mount(struct file_system_type *fs_type,
                            int flags, const char *dev_name, void *data)
      {
               struct user_namespace *ns = current_user_ns();

               return mount_ns(fs_type, flags, data, ns, ns,
                               bm_fill_super);
      }

    by

      static void bm_free(struct fs_context *fc)
      {
             if (fc->s_fs_info)
                     put_user_ns(fc->s_fs_info);
      }

      static int bm_get_tree(struct fs_context *fc)
      {
              return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
      }

      static const struct fs_context_operations bm_context_ops = {
              .free           = bm_free,
              .get_tree       = bm_get_tree,
      };

      static int bm_init_fs_context(struct fs_context *fc)
      {
              fc->ops = &bm_context_ops;
              return 0;
      }

v6: Return &init_binfmt_ns instead of NULL in binfmt_ns()
    This should never happen, but to stay safe return a
    value we can use.
    change subject from "RFC" to "PATCH"

v5: Use READ_ONCE()/WRITE_ONCE()
    move mount pointer struct init to bm_fill_super() and add smp_wmb()
    remove useless NULL value init
    add WARN_ON_ONCE()

v4: first user namespace is initialized with &init_binfmt_ns,
    all new user namespaces are initialized with a NULL and use
    the one of the first parent that is not NULL. The pointer
    is initialized to a valid value the first time the binfmt_misc
    fs is mounted in the current user namespace.
    This allows to not change the way it was working before:
    new ns inherits values from its parent, and if parent value is modified
    (or parent creates its own binfmt entry by mounting the fs) child
    inherits it (unless it has itself mounted the fs).

v3: create a structure to store binfmt_misc data,
    add a pointer to this structure in the user_namespace structure,
    in init_user_ns structure this pointer points to an init_binfmt_ns
    structure. And all new user namespaces point to this init structure.
    A new binfmt namespace structure is allocated if the binfmt_misc
    filesystem is mounted in a user namespace that is not the initial
    one but its binfmt namespace pointer points to the initial one.
    add override_creds()/revert_creds() around open_exec() in
    bm_register_write()

v2: no new namespace, binfmt_misc data are now part of
    the mount namespace
    I put this in mount namespace instead of user namespace
    because the mount namespace is already needed and
    I don't want to force to have the user namespace for that.
    As this is a filesystem, it seems logic to have it here.

This allows to define a new interpreter for each new container.

But the main goal is to be able to chroot to a directory
using a binfmt_misc interpreter without being root.

I have a modified version of unshare at:

  https://github.com/vivier/util-linux.git branch unshare-chroot

with some new options to unshare binfmt_misc namespace and to chroot
to a directory.

If you have a directory /chroot/powerpc/jessie containing debian for powerpc
binaries and a qemu-ppc interpreter, you can do for instance:

 $ uname -a
 Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
 $ ./unshare --map-root-user --fork --pid \
   --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \
   --root=/chroot/powerpc/jessie /bin/bash -l
 # uname -a
 Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux
 # id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
 # ls -l
total 5940
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 bin
drwxr-xr-x.   2 nobody nogroup    4096 Jun 17 20:26 boot
drwxr-xr-x.   4 nobody nogroup    4096 Aug 12 00:08 dev
drwxr-xr-x.  42 nobody nogroup    4096 Sep 28 07:25 etc
drwxr-xr-x.   3 nobody nogroup    4096 Sep 28 07:25 home
drwxr-xr-x.   9 nobody nogroup    4096 Aug 12 00:58 lib
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 media
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 mnt
drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 13:09 opt
dr-xr-xr-x. 143 nobody nogroup       0 Sep 30 23:02 proc
-rwxr-xr-x.   1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc
drwx------.   3 nobody nogroup    4096 Aug 12 12:54 root
drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 00:08 run
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 sbin
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 srv
drwxr-xr-x.   2 nobody nogroup    4096 Apr  6  2015 sys
drwxrwxrwt.   2 nobody nogroup    4096 Sep 28 10:31 tmp
drwxr-xr-x.  10 nobody nogroup    4096 Aug 12 00:08 usr
drwxr-xr-x.  11 nobody nogroup    4096 Aug 12 00:08 var

If you want to use the qemu binary provided by your distro, you can use

    --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF"

With the 'F' flag, qemu-ppc-static will be then loaded from the main root
filesystem before switching to the chroot.

Another example is to use the 'P' flag in one chroot and not in another one (useful in a test
environment to test different configurations of the same interpreter):

./unshare --fork --pid --mount-proc --map-root-user --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff://usr/bin/qemu-ppc-noargv0:OCF" --root=/chroot/powerpc/jessie /bin/bash -l
root@localhost:/# sh -c 'echo $0'
/bin/sh

./unshare --fork --pid --mount-proc --map-root-user --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff://usr/bin/qemu-ppc-argv0:OCFP" --root=/chroot/powerpc/jessie /bin/bash -l
root@localhost:/# sh -c 'echo $0'
sh

Laurent Vivier (1):
  ns: add binfmt_misc to the user namespace

 fs/binfmt_misc.c               | 115 +++++++++++++++++++++++++--------
 include/linux/user_namespace.h |  15 +++++
 kernel/user.c                  |  14 ++++
 kernel/user_namespace.c        |   3 +
 4 files changed, 119 insertions(+), 28 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2019-12-16  9:12 [PATCH v8 0/1] ns: introduce binfmt_misc namespace Laurent Vivier
@ 2019-12-16  9:12 ` Laurent Vivier
  2019-12-16 19:08   ` Jann Horn
  2021-01-08  8:22     ` Jan Kiszka
  2019-12-16  9:46 ` [PATCH v8 0/1] ns: introduce binfmt_misc namespace Christian Brauner
  1 sibling, 2 replies; 19+ messages in thread
From: Laurent Vivier @ 2019-12-16  9:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kurz, Jann Horn, Andrei Vagin, linux-api, Dmitry Safonov,
	James Bottomley, Jan Kiszka, Christian Brauner, linux-fsdevel,
	containers, Alexander Viro, Eric Biederman, Henning Schild,
	Cédric Le Goater, Laurent Vivier

This patch allows to have a different binfmt_misc configuration
for each new user namespace. By default, the binfmt_misc configuration
is the one of the previous level, but if the binfmt_misc filesystem is
mounted in the new namespace a new empty binfmt instance is created and
used in this namespace.

For instance, using "unshare" we can start a chroot of another
architecture and configure the binfmt_misc interpreter without being root
to run the binaries in this chroot.

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Acked-by: Andrei Vagin <avagin@gmail.com>
Tested-by: Henning Schild <henning.schild@siemens.com>
---
 fs/binfmt_misc.c               | 115 +++++++++++++++++++++++++--------
 include/linux/user_namespace.h |  15 +++++
 kernel/user.c                  |  14 ++++
 kernel/user_namespace.c        |   3 +
 4 files changed, 119 insertions(+), 28 deletions(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index cdb45829354d..17fa1f56ca2e 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -40,9 +40,6 @@ enum {
 	VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
 };
 
-static LIST_HEAD(entries);
-static int enabled = 1;
-
 enum {Enabled, Magic};
 #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
 #define MISC_FMT_OPEN_BINARY (1 << 30)
@@ -62,10 +59,7 @@ typedef struct {
 	struct file *interp_file;
 } Node;
 
-static DEFINE_RWLOCK(entries_lock);
 static struct file_system_type bm_fs_type;
-static struct vfsmount *bm_mnt;
-static int entry_count;
 
 /*
  * Max length of the register string.  Determined by:
@@ -82,18 +76,37 @@ static int entry_count;
  */
 #define MAX_REGISTER_LENGTH 1920
 
+static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
+{
+	struct binfmt_namespace *b_ns;
+
+	while (ns) {
+		b_ns = READ_ONCE(ns->binfmt_ns);
+		if (b_ns)
+			return b_ns;
+		ns = ns->parent;
+	}
+	/* as the first user namespace is initialized with
+	 * &init_binfmt_ns we should never come here
+	 * but we try to stay safe by logging a warning
+	 * and returning a sane value
+	 */
+	WARN_ON_ONCE(1);
+	return &init_binfmt_ns;
+}
+
 /*
  * Check if we support the binfmt
  * if we do, return the node, else NULL
  * locking is done in load_misc_binary
  */
-static Node *check_file(struct linux_binprm *bprm)
+static Node *check_file(struct binfmt_namespace *ns, struct linux_binprm *bprm)
 {
 	char *p = strrchr(bprm->interp, '.');
 	struct list_head *l;
 
 	/* Walk all the registered handlers. */
-	list_for_each(l, &entries) {
+	list_for_each(l, &ns->entries) {
 		Node *e = list_entry(l, Node, list);
 		char *s;
 		int j;
@@ -135,17 +148,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
 	struct file *interp_file = NULL;
 	int retval;
 	int fd_binary = -1;
+	struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
 
 	retval = -ENOEXEC;
-	if (!enabled)
+	if (!ns->enabled)
 		return retval;
 
 	/* to keep locking time low, we copy the interpreter string */
-	read_lock(&entries_lock);
-	fmt = check_file(bprm);
+	read_lock(&ns->entries_lock);
+	fmt = check_file(ns, bprm);
 	if (fmt)
 		dget(fmt->dentry);
-	read_unlock(&entries_lock);
+	read_unlock(&ns->entries_lock);
 	if (!fmt)
 		return retval;
 
@@ -611,19 +625,19 @@ static void bm_evict_inode(struct inode *inode)
 	kfree(e);
 }
 
-static void kill_node(Node *e)
+static void kill_node(struct binfmt_namespace *ns, Node *e)
 {
 	struct dentry *dentry;
 
-	write_lock(&entries_lock);
+	write_lock(&ns->entries_lock);
 	list_del_init(&e->list);
-	write_unlock(&entries_lock);
+	write_unlock(&ns->entries_lock);
 
 	dentry = e->dentry;
 	drop_nlink(d_inode(dentry));
 	d_drop(dentry);
 	dput(dentry);
-	simple_release_fs(&bm_mnt, &entry_count);
+	simple_release_fs(&ns->bm_mnt, &ns->entry_count);
 }
 
 /* /<entry> */
@@ -653,6 +667,9 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
 	struct dentry *root;
 	Node *e = file_inode(file)->i_private;
 	int res = parse_command(buffer, count);
+	struct binfmt_namespace *ns;
+
+	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
 
 	switch (res) {
 	case 1:
@@ -669,7 +686,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
 		inode_lock(d_inode(root));
 
 		if (!list_empty(&e->list))
-			kill_node(e);
+			kill_node(ns, e);
 
 		inode_unlock(d_inode(root));
 		break;
@@ -695,6 +712,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
 	struct inode *inode;
 	struct super_block *sb = file_inode(file)->i_sb;
 	struct dentry *root = sb->s_root, *dentry;
+	struct binfmt_namespace *ns;
 	int err = 0;
 
 	e = create_entry(buffer, count);
@@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
 	if (!inode)
 		goto out2;
 
-	err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
+	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
+	err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
+			    &ns->entry_count);
 	if (err) {
 		iput(inode);
 		inode = NULL;
@@ -727,12 +747,16 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
 
 	if (e->flags & MISC_FMT_OPEN_FILE) {
 		struct file *f;
+		const struct cred *old_cred;
 
+		old_cred = override_creds(file->f_cred);
 		f = open_exec(e->interpreter);
+		revert_creds(old_cred);
 		if (IS_ERR(f)) {
 			err = PTR_ERR(f);
 			pr_notice("register: failed to install interpreter file %s\n", e->interpreter);
-			simple_release_fs(&bm_mnt, &entry_count);
+			simple_release_fs(&ns->bm_mnt,
+					  &ns->entry_count);
 			iput(inode);
 			inode = NULL;
 			goto out2;
@@ -745,9 +769,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
 	inode->i_fop = &bm_entry_operations;
 
 	d_instantiate(dentry, inode);
-	write_lock(&entries_lock);
-	list_add(&e->list, &entries);
-	write_unlock(&entries_lock);
+	write_lock(&ns->entries_lock);
+	list_add(&e->list, &ns->entries);
+	write_unlock(&ns->entries_lock);
 
 	err = 0;
 out2:
@@ -772,7 +796,9 @@ static const struct file_operations bm_register_operations = {
 static ssize_t
 bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 {
-	char *s = enabled ? "enabled\n" : "disabled\n";
+	struct binfmt_namespace *ns =
+				binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
+	char *s = ns->enabled ? "enabled\n" : "disabled\n";
 
 	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
 }
@@ -780,25 +806,28 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 static ssize_t bm_status_write(struct file *file, const char __user *buffer,
 		size_t count, loff_t *ppos)
 {
+	struct binfmt_namespace *ns;
 	int res = parse_command(buffer, count);
 	struct dentry *root;
 
+	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
 	switch (res) {
 	case 1:
 		/* Disable all handlers. */
-		enabled = 0;
+		ns->enabled = 0;
 		break;
 	case 2:
 		/* Enable all handlers. */
-		enabled = 1;
+		ns->enabled = 1;
 		break;
 	case 3:
 		/* Delete all handlers. */
 		root = file_inode(file)->i_sb->s_root;
 		inode_lock(d_inode(root));
 
-		while (!list_empty(&entries))
-			kill_node(list_first_entry(&entries, Node, list));
+		while (!list_empty(&ns->entries))
+			kill_node(ns, list_first_entry(&ns->entries,
+						       Node, list));
 
 		inode_unlock(d_inode(root));
 		break;
@@ -825,24 +854,53 @@ static const struct super_operations s_ops = {
 static int bm_fill_super(struct super_block *sb, struct fs_context *fc)
 {
 	int err;
+	struct user_namespace *ns = sb->s_user_ns;
 	static const struct tree_descr bm_files[] = {
 		[2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
 		[3] = {"register", &bm_register_operations, S_IWUSR},
 		/* last one */ {""}
 	};
 
+	/* create a new binfmt namespace
+	 * if we are not in the first user namespace
+	 * but the binfmt namespace is the first one
+	 */
+	if (READ_ONCE(ns->binfmt_ns) == NULL) {
+		struct binfmt_namespace *new_ns;
+
+		new_ns = kmalloc(sizeof(struct binfmt_namespace),
+				 GFP_KERNEL);
+		if (new_ns == NULL)
+			return -ENOMEM;
+		INIT_LIST_HEAD(&new_ns->entries);
+		new_ns->enabled = 1;
+		rwlock_init(&new_ns->entries_lock);
+		new_ns->bm_mnt = NULL;
+		new_ns->entry_count = 0;
+		/* ensure new_ns is completely initialized before sharing it */
+		smp_wmb();
+		WRITE_ONCE(ns->binfmt_ns, new_ns);
+	}
+
 	err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
 	if (!err)
 		sb->s_op = &s_ops;
 	return err;
 }
 
+static void bm_free(struct fs_context *fc)
+{
+	if (fc->s_fs_info)
+		put_user_ns(fc->s_fs_info);
+}
+
 static int bm_get_tree(struct fs_context *fc)
 {
-	return get_tree_single(fc, bm_fill_super);
+	return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
 }
 
 static const struct fs_context_operations bm_context_ops = {
+	.free		= bm_free,
 	.get_tree	= bm_get_tree,
 };
 
@@ -861,6 +919,7 @@ static struct file_system_type bm_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "binfmt_misc",
 	.init_fs_context = bm_init_fs_context,
+	.fs_flags	= FS_USERNS_MOUNT,
 	.kill_sb	= kill_litter_super,
 };
 MODULE_ALIAS_FS("binfmt_misc");
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index fb9f4f799554..16e6f3a97a01 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,18 @@ enum ucount_type {
 	UCOUNT_COUNTS,
 };
 
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+struct binfmt_namespace {
+	struct list_head entries;
+	rwlock_t entries_lock;
+	int enabled;
+	struct vfsmount *bm_mnt;
+	int entry_count;
+} __randomize_layout;
+
+extern struct binfmt_namespace init_binfmt_ns;
+#endif
+
 struct user_namespace {
 	struct uid_gid_map	uid_map;
 	struct uid_gid_map	gid_map;
@@ -86,6 +98,9 @@ struct user_namespace {
 #endif
 	struct ucounts		*ucounts;
 	int ucount_max[UCOUNT_COUNTS];
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+	struct binfmt_namespace *binfmt_ns;
+#endif
 } __randomize_layout;
 
 struct ucounts {
diff --git a/kernel/user.c b/kernel/user.c
index 5235d7f49982..092b2b4d47a6 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -20,6 +20,17 @@
 #include <linux/user_namespace.h>
 #include <linux/proc_ns.h>
 
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+struct binfmt_namespace init_binfmt_ns = {
+	.entries = LIST_HEAD_INIT(init_binfmt_ns.entries),
+	.enabled = 1,
+	.entries_lock = __RW_LOCK_UNLOCKED(init_binfmt_ns.entries_lock),
+	.bm_mnt = NULL,
+	.entry_count = 0,
+};
+EXPORT_SYMBOL_GPL(init_binfmt_ns);
+#endif
+
 /*
  * userns count is 1 for root user, 1 for init_uts_ns,
  * and 1 for... ?
@@ -67,6 +78,9 @@ struct user_namespace init_user_ns = {
 	.keyring_name_list = LIST_HEAD_INIT(init_user_ns.keyring_name_list),
 	.keyring_sem = __RWSEM_INITIALIZER(init_user_ns.keyring_sem),
 #endif
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+	.binfmt_ns = &init_binfmt_ns,
+#endif
 };
 EXPORT_SYMBOL_GPL(init_user_ns);
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 8eadadc478f9..f42c32269e20 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -191,6 +191,9 @@ static void free_user_ns(struct work_struct *work)
 			kfree(ns->projid_map.forward);
 			kfree(ns->projid_map.reverse);
 		}
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+		kfree(ns->binfmt_ns);
+#endif
 		retire_userns_sysctls(ns);
 		key_free_user_ns(ns);
 		ns_free_inum(&ns->ns);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 0/1] ns: introduce binfmt_misc namespace
  2019-12-16  9:12 [PATCH v8 0/1] ns: introduce binfmt_misc namespace Laurent Vivier
  2019-12-16  9:12 ` [PATCH v8 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier
@ 2019-12-16  9:46 ` Christian Brauner
  2019-12-16  9:53   ` Laurent Vivier
  1 sibling, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2019-12-16  9:46 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: linux-kernel, Greg Kurz, Jann Horn, Andrei Vagin, linux-api,
	Dmitry Safonov, James Bottomley, Jan Kiszka, linux-fsdevel,
	containers, Alexander Viro, Eric Biederman, Henning Schild,
	Cédric Le Goater, keescook

On Mon, Dec 16, 2019 at 10:12:19AM +0100, Laurent Vivier wrote:
> v8: s/file->f_path.dentry/file_dentry(file)/
> 
> v7: Use the new mount API
> 
>     Replace
> 
>       static struct dentry *bm_mount(struct file_system_type *fs_type,
>                             int flags, const char *dev_name, void *data)
>       {
>                struct user_namespace *ns = current_user_ns();
> 
>                return mount_ns(fs_type, flags, data, ns, ns,
>                                bm_fill_super);
>       }
> 
>     by
> 
>       static void bm_free(struct fs_context *fc)
>       {
>              if (fc->s_fs_info)
>                      put_user_ns(fc->s_fs_info);
>       }
> 
>       static int bm_get_tree(struct fs_context *fc)
>       {
>               return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
>       }
> 
>       static const struct fs_context_operations bm_context_ops = {
>               .free           = bm_free,
>               .get_tree       = bm_get_tree,
>       };
> 
>       static int bm_init_fs_context(struct fs_context *fc)
>       {
>               fc->ops = &bm_context_ops;
>               return 0;
>       }
> 
> v6: Return &init_binfmt_ns instead of NULL in binfmt_ns()
>     This should never happen, but to stay safe return a
>     value we can use.
>     change subject from "RFC" to "PATCH"
> 
> v5: Use READ_ONCE()/WRITE_ONCE()
>     move mount pointer struct init to bm_fill_super() and add smp_wmb()
>     remove useless NULL value init
>     add WARN_ON_ONCE()
> 
> v4: first user namespace is initialized with &init_binfmt_ns,
>     all new user namespaces are initialized with a NULL and use
>     the one of the first parent that is not NULL. The pointer
>     is initialized to a valid value the first time the binfmt_misc
>     fs is mounted in the current user namespace.
>     This allows to not change the way it was working before:
>     new ns inherits values from its parent, and if parent value is modified
>     (or parent creates its own binfmt entry by mounting the fs) child
>     inherits it (unless it has itself mounted the fs).
> 
> v3: create a structure to store binfmt_misc data,
>     add a pointer to this structure in the user_namespace structure,
>     in init_user_ns structure this pointer points to an init_binfmt_ns
>     structure. And all new user namespaces point to this init structure.
>     A new binfmt namespace structure is allocated if the binfmt_misc
>     filesystem is mounted in a user namespace that is not the initial
>     one but its binfmt namespace pointer points to the initial one.
>     add override_creds()/revert_creds() around open_exec() in
>     bm_register_write()
> 
> v2: no new namespace, binfmt_misc data are now part of
>     the mount namespace
>     I put this in mount namespace instead of user namespace
>     because the mount namespace is already needed and
>     I don't want to force to have the user namespace for that.
>     As this is a filesystem, it seems logic to have it here.
> 
> This allows to define a new interpreter for each new container.
> 
> But the main goal is to be able to chroot to a directory
> using a binfmt_misc interpreter without being root.
> 
> I have a modified version of unshare at:
> 
>   https://github.com/vivier/util-linux.git branch unshare-chroot
> 
> with some new options to unshare binfmt_misc namespace and to chroot
> to a directory.
> 
> If you have a directory /chroot/powerpc/jessie containing debian for powerpc
> binaries and a qemu-ppc interpreter, you can do for instance:
> 
>  $ uname -a
>  Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
>  $ ./unshare --map-root-user --fork --pid \
>    --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \
>    --root=/chroot/powerpc/jessie /bin/bash -l
>  # uname -a
>  Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux
>  # id
> uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
>  # ls -l
> total 5940
> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 bin
> drwxr-xr-x.   2 nobody nogroup    4096 Jun 17 20:26 boot
> drwxr-xr-x.   4 nobody nogroup    4096 Aug 12 00:08 dev
> drwxr-xr-x.  42 nobody nogroup    4096 Sep 28 07:25 etc
> drwxr-xr-x.   3 nobody nogroup    4096 Sep 28 07:25 home
> drwxr-xr-x.   9 nobody nogroup    4096 Aug 12 00:58 lib
> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 media
> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 mnt
> drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 13:09 opt
> dr-xr-xr-x. 143 nobody nogroup       0 Sep 30 23:02 proc
> -rwxr-xr-x.   1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc
> drwx------.   3 nobody nogroup    4096 Aug 12 12:54 root
> drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 00:08 run
> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 sbin
> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 srv
> drwxr-xr-x.   2 nobody nogroup    4096 Apr  6  2015 sys
> drwxrwxrwt.   2 nobody nogroup    4096 Sep 28 10:31 tmp
> drwxr-xr-x.  10 nobody nogroup    4096 Aug 12 00:08 usr
> drwxr-xr-x.  11 nobody nogroup    4096 Aug 12 00:08 var
> 
> If you want to use the qemu binary provided by your distro, you can use
> 
>     --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF"
> 
> With the 'F' flag, qemu-ppc-static will be then loaded from the main root
> filesystem before switching to the chroot.
> 
> Another example is to use the 'P' flag in one chroot and not in another one (useful in a test
> environment to test different configurations of the same interpreter):
> 
> ./unshare --fork --pid --mount-proc --map-root-user --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff://usr/bin/qemu-ppc-noargv0:OCF" --root=/chroot/powerpc/jessie /bin/bash -l
> root@localhost:/# sh -c 'echo $0'
> /bin/sh
> 
> ./unshare --fork --pid --mount-proc --map-root-user --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff://usr/bin/qemu-ppc-argv0:OCFP" --root=/chroot/powerpc/jessie /bin/bash -l
> root@localhost:/# sh -c 'echo $0'
> sh

Hey Laurent,

We have quite some time before the v5.6 merge window opens. So I would
really like for this new feature to come with proper testing!

I know you've been waiting on this patch quite some time but adding
tests will not endanger v5.6 inclusion - lest we find significant bugs. ;)

Christian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 0/1] ns: introduce binfmt_misc namespace
  2019-12-16  9:46 ` [PATCH v8 0/1] ns: introduce binfmt_misc namespace Christian Brauner
@ 2019-12-16  9:53   ` Laurent Vivier
  2019-12-16 10:06     ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Laurent Vivier @ 2019-12-16  9:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, Greg Kurz, Jann Horn, Andrei Vagin, linux-api,
	Dmitry Safonov, James Bottomley, Jan Kiszka, linux-fsdevel,
	containers, Alexander Viro, Eric Biederman, Henning Schild,
	Cédric Le Goater, keescook

Le 16/12/2019 à 10:46, Christian Brauner a écrit :
> On Mon, Dec 16, 2019 at 10:12:19AM +0100, Laurent Vivier wrote:
>> v8: s/file->f_path.dentry/file_dentry(file)/
>>
>> v7: Use the new mount API
>>
>>     Replace
>>
>>       static struct dentry *bm_mount(struct file_system_type *fs_type,
>>                             int flags, const char *dev_name, void *data)
>>       {
>>                struct user_namespace *ns = current_user_ns();
>>
>>                return mount_ns(fs_type, flags, data, ns, ns,
>>                                bm_fill_super);
>>       }
>>
>>     by
>>
>>       static void bm_free(struct fs_context *fc)
>>       {
>>              if (fc->s_fs_info)
>>                      put_user_ns(fc->s_fs_info);
>>       }
>>
>>       static int bm_get_tree(struct fs_context *fc)
>>       {
>>               return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
>>       }
>>
>>       static const struct fs_context_operations bm_context_ops = {
>>               .free           = bm_free,
>>               .get_tree       = bm_get_tree,
>>       };
>>
>>       static int bm_init_fs_context(struct fs_context *fc)
>>       {
>>               fc->ops = &bm_context_ops;
>>               return 0;
>>       }
>>
>> v6: Return &init_binfmt_ns instead of NULL in binfmt_ns()
>>     This should never happen, but to stay safe return a
>>     value we can use.
>>     change subject from "RFC" to "PATCH"
>>
>> v5: Use READ_ONCE()/WRITE_ONCE()
>>     move mount pointer struct init to bm_fill_super() and add smp_wmb()
>>     remove useless NULL value init
>>     add WARN_ON_ONCE()
>>
>> v4: first user namespace is initialized with &init_binfmt_ns,
>>     all new user namespaces are initialized with a NULL and use
>>     the one of the first parent that is not NULL. The pointer
>>     is initialized to a valid value the first time the binfmt_misc
>>     fs is mounted in the current user namespace.
>>     This allows to not change the way it was working before:
>>     new ns inherits values from its parent, and if parent value is modified
>>     (or parent creates its own binfmt entry by mounting the fs) child
>>     inherits it (unless it has itself mounted the fs).
>>
>> v3: create a structure to store binfmt_misc data,
>>     add a pointer to this structure in the user_namespace structure,
>>     in init_user_ns structure this pointer points to an init_binfmt_ns
>>     structure. And all new user namespaces point to this init structure.
>>     A new binfmt namespace structure is allocated if the binfmt_misc
>>     filesystem is mounted in a user namespace that is not the initial
>>     one but its binfmt namespace pointer points to the initial one.
>>     add override_creds()/revert_creds() around open_exec() in
>>     bm_register_write()
>>
>> v2: no new namespace, binfmt_misc data are now part of
>>     the mount namespace
>>     I put this in mount namespace instead of user namespace
>>     because the mount namespace is already needed and
>>     I don't want to force to have the user namespace for that.
>>     As this is a filesystem, it seems logic to have it here.
>>
>> This allows to define a new interpreter for each new container.
>>
>> But the main goal is to be able to chroot to a directory
>> using a binfmt_misc interpreter without being root.
>>
>> I have a modified version of unshare at:
>>
>>   https://github.com/vivier/util-linux.git branch unshare-chroot
>>
>> with some new options to unshare binfmt_misc namespace and to chroot
>> to a directory.
>>
>> If you have a directory /chroot/powerpc/jessie containing debian for powerpc
>> binaries and a qemu-ppc interpreter, you can do for instance:
>>
>>  $ uname -a
>>  Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
>>  $ ./unshare --map-root-user --fork --pid \
>>    --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \
>>    --root=/chroot/powerpc/jessie /bin/bash -l
>>  # uname -a
>>  Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux
>>  # id
>> uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
>>  # ls -l
>> total 5940
>> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 bin
>> drwxr-xr-x.   2 nobody nogroup    4096 Jun 17 20:26 boot
>> drwxr-xr-x.   4 nobody nogroup    4096 Aug 12 00:08 dev
>> drwxr-xr-x.  42 nobody nogroup    4096 Sep 28 07:25 etc
>> drwxr-xr-x.   3 nobody nogroup    4096 Sep 28 07:25 home
>> drwxr-xr-x.   9 nobody nogroup    4096 Aug 12 00:58 lib
>> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 media
>> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 mnt
>> drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 13:09 opt
>> dr-xr-xr-x. 143 nobody nogroup       0 Sep 30 23:02 proc
>> -rwxr-xr-x.   1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc
>> drwx------.   3 nobody nogroup    4096 Aug 12 12:54 root
>> drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 00:08 run
>> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 sbin
>> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 srv
>> drwxr-xr-x.   2 nobody nogroup    4096 Apr  6  2015 sys
>> drwxrwxrwt.   2 nobody nogroup    4096 Sep 28 10:31 tmp
>> drwxr-xr-x.  10 nobody nogroup    4096 Aug 12 00:08 usr
>> drwxr-xr-x.  11 nobody nogroup    4096 Aug 12 00:08 var
>>
>> If you want to use the qemu binary provided by your distro, you can use
>>
>>     --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF"
>>
>> With the 'F' flag, qemu-ppc-static will be then loaded from the main root
>> filesystem before switching to the chroot.
>>
>> Another example is to use the 'P' flag in one chroot and not in another one (useful in a test
>> environment to test different configurations of the same interpreter):
>>
>> ./unshare --fork --pid --mount-proc --map-root-user --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff://usr/bin/qemu-ppc-noargv0:OCF" --root=/chroot/powerpc/jessie /bin/bash -l
>> root@localhost:/# sh -c 'echo $0'
>> /bin/sh
>>
>> ./unshare --fork --pid --mount-proc --map-root-user --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff://usr/bin/qemu-ppc-argv0:OCFP" --root=/chroot/powerpc/jessie /bin/bash -l
>> root@localhost:/# sh -c 'echo $0'
>> sh
> 
> Hey Laurent,
> 
> We have quite some time before the v5.6 merge window opens. So I would
> really like for this new feature to come with proper testing!

Are there some already existing tests for binfmt_misc or namespace I can
update to test the new feature?

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 0/1] ns: introduce binfmt_misc namespace
  2019-12-16  9:53   ` Laurent Vivier
@ 2019-12-16 10:06     ` Christian Brauner
  2019-12-16 10:08       ` Laurent Vivier
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2019-12-16 10:06 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: linux-kernel, Greg Kurz, Jann Horn, Andrei Vagin, linux-api,
	Dmitry Safonov, James Bottomley, Jan Kiszka, linux-fsdevel,
	containers, Alexander Viro, Eric Biederman, Henning Schild,
	Cédric Le Goater, keescook

On Mon, Dec 16, 2019 at 10:53:28AM +0100, Laurent Vivier wrote:
> Le 16/12/2019 à 10:46, Christian Brauner a écrit :
> > On Mon, Dec 16, 2019 at 10:12:19AM +0100, Laurent Vivier wrote:
> >> v8: s/file->f_path.dentry/file_dentry(file)/
> >>
> >> v7: Use the new mount API
> >>
> >>     Replace
> >>
> >>       static struct dentry *bm_mount(struct file_system_type *fs_type,
> >>                             int flags, const char *dev_name, void *data)
> >>       {
> >>                struct user_namespace *ns = current_user_ns();
> >>
> >>                return mount_ns(fs_type, flags, data, ns, ns,
> >>                                bm_fill_super);
> >>       }
> >>
> >>     by
> >>
> >>       static void bm_free(struct fs_context *fc)
> >>       {
> >>              if (fc->s_fs_info)
> >>                      put_user_ns(fc->s_fs_info);
> >>       }
> >>
> >>       static int bm_get_tree(struct fs_context *fc)
> >>       {
> >>               return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
> >>       }
> >>
> >>       static const struct fs_context_operations bm_context_ops = {
> >>               .free           = bm_free,
> >>               .get_tree       = bm_get_tree,
> >>       };
> >>
> >>       static int bm_init_fs_context(struct fs_context *fc)
> >>       {
> >>               fc->ops = &bm_context_ops;
> >>               return 0;
> >>       }
> >>
> >> v6: Return &init_binfmt_ns instead of NULL in binfmt_ns()
> >>     This should never happen, but to stay safe return a
> >>     value we can use.
> >>     change subject from "RFC" to "PATCH"
> >>
> >> v5: Use READ_ONCE()/WRITE_ONCE()
> >>     move mount pointer struct init to bm_fill_super() and add smp_wmb()
> >>     remove useless NULL value init
> >>     add WARN_ON_ONCE()
> >>
> >> v4: first user namespace is initialized with &init_binfmt_ns,
> >>     all new user namespaces are initialized with a NULL and use
> >>     the one of the first parent that is not NULL. The pointer
> >>     is initialized to a valid value the first time the binfmt_misc
> >>     fs is mounted in the current user namespace.
> >>     This allows to not change the way it was working before:
> >>     new ns inherits values from its parent, and if parent value is modified
> >>     (or parent creates its own binfmt entry by mounting the fs) child
> >>     inherits it (unless it has itself mounted the fs).
> >>
> >> v3: create a structure to store binfmt_misc data,
> >>     add a pointer to this structure in the user_namespace structure,
> >>     in init_user_ns structure this pointer points to an init_binfmt_ns
> >>     structure. And all new user namespaces point to this init structure.
> >>     A new binfmt namespace structure is allocated if the binfmt_misc
> >>     filesystem is mounted in a user namespace that is not the initial
> >>     one but its binfmt namespace pointer points to the initial one.
> >>     add override_creds()/revert_creds() around open_exec() in
> >>     bm_register_write()
> >>
> >> v2: no new namespace, binfmt_misc data are now part of
> >>     the mount namespace
> >>     I put this in mount namespace instead of user namespace
> >>     because the mount namespace is already needed and
> >>     I don't want to force to have the user namespace for that.
> >>     As this is a filesystem, it seems logic to have it here.
> >>
> >> This allows to define a new interpreter for each new container.
> >>
> >> But the main goal is to be able to chroot to a directory
> >> using a binfmt_misc interpreter without being root.
> >>
> >> I have a modified version of unshare at:
> >>
> >>   https://github.com/vivier/util-linux.git branch unshare-chroot
> >>
> >> with some new options to unshare binfmt_misc namespace and to chroot
> >> to a directory.
> >>
> >> If you have a directory /chroot/powerpc/jessie containing debian for powerpc
> >> binaries and a qemu-ppc interpreter, you can do for instance:
> >>
> >>  $ uname -a
> >>  Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
> >>  $ ./unshare --map-root-user --fork --pid \
> >>    --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \
> >>    --root=/chroot/powerpc/jessie /bin/bash -l
> >>  # uname -a
> >>  Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux
> >>  # id
> >> uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
> >>  # ls -l
> >> total 5940
> >> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 bin
> >> drwxr-xr-x.   2 nobody nogroup    4096 Jun 17 20:26 boot
> >> drwxr-xr-x.   4 nobody nogroup    4096 Aug 12 00:08 dev
> >> drwxr-xr-x.  42 nobody nogroup    4096 Sep 28 07:25 etc
> >> drwxr-xr-x.   3 nobody nogroup    4096 Sep 28 07:25 home
> >> drwxr-xr-x.   9 nobody nogroup    4096 Aug 12 00:58 lib
> >> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 media
> >> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 mnt
> >> drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 13:09 opt
> >> dr-xr-xr-x. 143 nobody nogroup       0 Sep 30 23:02 proc
> >> -rwxr-xr-x.   1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc
> >> drwx------.   3 nobody nogroup    4096 Aug 12 12:54 root
> >> drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 00:08 run
> >> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 sbin
> >> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 srv
> >> drwxr-xr-x.   2 nobody nogroup    4096 Apr  6  2015 sys
> >> drwxrwxrwt.   2 nobody nogroup    4096 Sep 28 10:31 tmp
> >> drwxr-xr-x.  10 nobody nogroup    4096 Aug 12 00:08 usr
> >> drwxr-xr-x.  11 nobody nogroup    4096 Aug 12 00:08 var
> >>
> >> If you want to use the qemu binary provided by your distro, you can use
> >>
> >>     --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF"
> >>
> >> With the 'F' flag, qemu-ppc-static will be then loaded from the main root
> >> filesystem before switching to the chroot.
> >>
> >> Another example is to use the 'P' flag in one chroot and not in another one (useful in a test
> >> environment to test different configurations of the same interpreter):
> >>
> >> ./unshare --fork --pid --mount-proc --map-root-user --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff://usr/bin/qemu-ppc-noargv0:OCF" --root=/chroot/powerpc/jessie /bin/bash -l
> >> root@localhost:/# sh -c 'echo $0'
> >> /bin/sh
> >>
> >> ./unshare --fork --pid --mount-proc --map-root-user --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff://usr/bin/qemu-ppc-argv0:OCFP" --root=/chroot/powerpc/jessie /bin/bash -l
> >> root@localhost:/# sh -c 'echo $0'
> >> sh
> > 
> > Hey Laurent,
> > 
> > We have quite some time before the v5.6 merge window opens. So I would
> > really like for this new feature to come with proper testing!
> 
> Are there some already existing tests for binfmt_misc or namespace I can
> update to test the new feature?

I don't think so but there are tests for other namespace-aware
filesystem. For example, I've added basic tests for binderfs in
tools/testing/selftests/filesystems/binderfs/ and there are some devpts
tests in there (Though the devpts tests don't actually make use of the
kselftest framework so they aren't a great example. I'm not claiming
binderfs is either tbh. :))

You can just place the binfmt_misc tests in there. Helpers for setting
up user namespace and mappings are in there as well. I think you can
just place them in a separate file/header and include it for both
binderfs and binfmt_misc.
I'm happy to review this/answer questions.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 0/1] ns: introduce binfmt_misc namespace
  2019-12-16 10:06     ` Christian Brauner
@ 2019-12-16 10:08       ` Laurent Vivier
  0 siblings, 0 replies; 19+ messages in thread
From: Laurent Vivier @ 2019-12-16 10:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, Greg Kurz, Jann Horn, Andrei Vagin, linux-api,
	Dmitry Safonov, James Bottomley, Jan Kiszka, linux-fsdevel,
	containers, Alexander Viro, Eric Biederman, Henning Schild,
	Cédric Le Goater, keescook

Le 16/12/2019 à 11:06, Christian Brauner a écrit :
> On Mon, Dec 16, 2019 at 10:53:28AM +0100, Laurent Vivier wrote:
>> Le 16/12/2019 à 10:46, Christian Brauner a écrit :
>>> On Mon, Dec 16, 2019 at 10:12:19AM +0100, Laurent Vivier wrote:
>>>> v8: s/file->f_path.dentry/file_dentry(file)/
>>>>
>>>> v7: Use the new mount API
>>>>
>>>>     Replace
>>>>
>>>>       static struct dentry *bm_mount(struct file_system_type *fs_type,
>>>>                             int flags, const char *dev_name, void *data)
>>>>       {
>>>>                struct user_namespace *ns = current_user_ns();
>>>>
>>>>                return mount_ns(fs_type, flags, data, ns, ns,
>>>>                                bm_fill_super);
>>>>       }
>>>>
>>>>     by
>>>>
>>>>       static void bm_free(struct fs_context *fc)
>>>>       {
>>>>              if (fc->s_fs_info)
>>>>                      put_user_ns(fc->s_fs_info);
>>>>       }
>>>>
>>>>       static int bm_get_tree(struct fs_context *fc)
>>>>       {
>>>>               return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
>>>>       }
>>>>
>>>>       static const struct fs_context_operations bm_context_ops = {
>>>>               .free           = bm_free,
>>>>               .get_tree       = bm_get_tree,
>>>>       };
>>>>
>>>>       static int bm_init_fs_context(struct fs_context *fc)
>>>>       {
>>>>               fc->ops = &bm_context_ops;
>>>>               return 0;
>>>>       }
>>>>
>>>> v6: Return &init_binfmt_ns instead of NULL in binfmt_ns()
>>>>     This should never happen, but to stay safe return a
>>>>     value we can use.
>>>>     change subject from "RFC" to "PATCH"
>>>>
>>>> v5: Use READ_ONCE()/WRITE_ONCE()
>>>>     move mount pointer struct init to bm_fill_super() and add smp_wmb()
>>>>     remove useless NULL value init
>>>>     add WARN_ON_ONCE()
>>>>
>>>> v4: first user namespace is initialized with &init_binfmt_ns,
>>>>     all new user namespaces are initialized with a NULL and use
>>>>     the one of the first parent that is not NULL. The pointer
>>>>     is initialized to a valid value the first time the binfmt_misc
>>>>     fs is mounted in the current user namespace.
>>>>     This allows to not change the way it was working before:
>>>>     new ns inherits values from its parent, and if parent value is modified
>>>>     (or parent creates its own binfmt entry by mounting the fs) child
>>>>     inherits it (unless it has itself mounted the fs).
>>>>
>>>> v3: create a structure to store binfmt_misc data,
>>>>     add a pointer to this structure in the user_namespace structure,
>>>>     in init_user_ns structure this pointer points to an init_binfmt_ns
>>>>     structure. And all new user namespaces point to this init structure.
>>>>     A new binfmt namespace structure is allocated if the binfmt_misc
>>>>     filesystem is mounted in a user namespace that is not the initial
>>>>     one but its binfmt namespace pointer points to the initial one.
>>>>     add override_creds()/revert_creds() around open_exec() in
>>>>     bm_register_write()
>>>>
>>>> v2: no new namespace, binfmt_misc data are now part of
>>>>     the mount namespace
>>>>     I put this in mount namespace instead of user namespace
>>>>     because the mount namespace is already needed and
>>>>     I don't want to force to have the user namespace for that.
>>>>     As this is a filesystem, it seems logic to have it here.
>>>>
>>>> This allows to define a new interpreter for each new container.
>>>>
>>>> But the main goal is to be able to chroot to a directory
>>>> using a binfmt_misc interpreter without being root.
>>>>
>>>> I have a modified version of unshare at:
>>>>
>>>>   https://github.com/vivier/util-linux.git branch unshare-chroot
>>>>
>>>> with some new options to unshare binfmt_misc namespace and to chroot
>>>> to a directory.
>>>>
>>>> If you have a directory /chroot/powerpc/jessie containing debian for powerpc
>>>> binaries and a qemu-ppc interpreter, you can do for instance:
>>>>
>>>>  $ uname -a
>>>>  Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
>>>>  $ ./unshare --map-root-user --fork --pid \
>>>>    --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \
>>>>    --root=/chroot/powerpc/jessie /bin/bash -l
>>>>  # uname -a
>>>>  Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux
>>>>  # id
>>>> uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
>>>>  # ls -l
>>>> total 5940
>>>> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 bin
>>>> drwxr-xr-x.   2 nobody nogroup    4096 Jun 17 20:26 boot
>>>> drwxr-xr-x.   4 nobody nogroup    4096 Aug 12 00:08 dev
>>>> drwxr-xr-x.  42 nobody nogroup    4096 Sep 28 07:25 etc
>>>> drwxr-xr-x.   3 nobody nogroup    4096 Sep 28 07:25 home
>>>> drwxr-xr-x.   9 nobody nogroup    4096 Aug 12 00:58 lib
>>>> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 media
>>>> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 mnt
>>>> drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 13:09 opt
>>>> dr-xr-xr-x. 143 nobody nogroup       0 Sep 30 23:02 proc
>>>> -rwxr-xr-x.   1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc
>>>> drwx------.   3 nobody nogroup    4096 Aug 12 12:54 root
>>>> drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 00:08 run
>>>> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 sbin
>>>> drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 srv
>>>> drwxr-xr-x.   2 nobody nogroup    4096 Apr  6  2015 sys
>>>> drwxrwxrwt.   2 nobody nogroup    4096 Sep 28 10:31 tmp
>>>> drwxr-xr-x.  10 nobody nogroup    4096 Aug 12 00:08 usr
>>>> drwxr-xr-x.  11 nobody nogroup    4096 Aug 12 00:08 var
>>>>
>>>> If you want to use the qemu binary provided by your distro, you can use
>>>>
>>>>     --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF"
>>>>
>>>> With the 'F' flag, qemu-ppc-static will be then loaded from the main root
>>>> filesystem before switching to the chroot.
>>>>
>>>> Another example is to use the 'P' flag in one chroot and not in another one (useful in a test
>>>> environment to test different configurations of the same interpreter):
>>>>
>>>> ./unshare --fork --pid --mount-proc --map-root-user --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff://usr/bin/qemu-ppc-noargv0:OCF" --root=/chroot/powerpc/jessie /bin/bash -l
>>>> root@localhost:/# sh -c 'echo $0'
>>>> /bin/sh
>>>>
>>>> ./unshare --fork --pid --mount-proc --map-root-user --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff://usr/bin/qemu-ppc-argv0:OCFP" --root=/chroot/powerpc/jessie /bin/bash -l
>>>> root@localhost:/# sh -c 'echo $0'
>>>> sh
>>>
>>> Hey Laurent,
>>>
>>> We have quite some time before the v5.6 merge window opens. So I would
>>> really like for this new feature to come with proper testing!
>>
>> Are there some already existing tests for binfmt_misc or namespace I can
>> update to test the new feature?
> 
> I don't think so but there are tests for other namespace-aware
> filesystem. For example, I've added basic tests for binderfs in
> tools/testing/selftests/filesystems/binderfs/ and there are some devpts
> tests in there (Though the devpts tests don't actually make use of the
> kselftest framework so they aren't a great example. I'm not claiming
> binderfs is either tbh. :))
> 
> You can just place the binfmt_misc tests in there. Helpers for setting
> up user namespace and mappings are in there as well. I think you can
> just place them in a separate file/header and include it for both
> binderfs and binfmt_misc.
> I'm happy to review this/answer questions.
> 

OK, thank you, I will try to add some tests here.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2019-12-16  9:12 ` [PATCH v8 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier
@ 2019-12-16 19:08   ` Jann Horn
  2019-12-16 20:05     ` Laurent Vivier
  2021-01-08  8:22     ` Jan Kiszka
  1 sibling, 1 reply; 19+ messages in thread
From: Jann Horn @ 2019-12-16 19:08 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: kernel list, Greg Kurz, Andrei Vagin, Linux API, Dmitry Safonov,
	James Bottomley, Jan Kiszka, Christian Brauner, linux-fsdevel,
	Linux Containers, Alexander Viro, Eric Biederman, Henning Schild,
	Cédric Le Goater

On Mon, Dec 16, 2019 at 10:12 AM Laurent Vivier <laurent@vivier.eu> wrote:
> This patch allows to have a different binfmt_misc configuration
> for each new user namespace. By default, the binfmt_misc configuration
> is the one of the previous level, but if the binfmt_misc filesystem is
> mounted in the new namespace a new empty binfmt instance is created and
> used in this namespace.
>
> For instance, using "unshare" we can start a chroot of another
> architecture and configure the binfmt_misc interpreter without being root
> to run the binaries in this chroot.

How do you ensure that when userspace is no longer using the user
namespace and mount namespace, the entries and the binfmt_misc
superblock are deleted? As far as I can tell from looking at the code,
at the moment, if I create a user namespace+mount namespace, mount
binfmt_misc in there, register a file format and then let all
processes inside the namespaces exit, the binfmt_misc mount will be
kept alive by the simple_pin_fs() stuff, and the binfmt_misc entries
will also stay in memory.

[...]
> @@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>         if (!inode)
>                 goto out2;
>
> -       err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
> +       ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
> +       err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
> +                           &ns->entry_count);

When you call simple_pin_fs() here, the user namespace of `current`
and the user namespace of the superblock are not necessarily related.
So simple_pin_fs() may end up taking a reference on the mountpoint for
a user namespace that has nothing to do with the namespace for which
an entry is being created.

[...]
>  static int bm_fill_super(struct super_block *sb, struct fs_context *fc)
>  {
>         int err;
> +       struct user_namespace *ns = sb->s_user_ns;
[...]
> +       /* create a new binfmt namespace
> +        * if we are not in the first user namespace
> +        * but the binfmt namespace is the first one
> +        */
> +       if (READ_ONCE(ns->binfmt_ns) == NULL) {

The READ_ONCE() here is unnecessary, right? AFAIK the VFS layer is
going to ensure that bm_fill_super() can't run concurrently for the
same namespace?

> +               struct binfmt_namespace *new_ns;
> +
> +               new_ns = kmalloc(sizeof(struct binfmt_namespace),
> +                                GFP_KERNEL);
> +               if (new_ns == NULL)
> +                       return -ENOMEM;
> +               INIT_LIST_HEAD(&new_ns->entries);
> +               new_ns->enabled = 1;
> +               rwlock_init(&new_ns->entries_lock);
> +               new_ns->bm_mnt = NULL;
> +               new_ns->entry_count = 0;
> +               /* ensure new_ns is completely initialized before sharing it */
> +               smp_wmb();
> +               WRITE_ONCE(ns->binfmt_ns, new_ns);

Nit: This would be a little bit semantically clearer if you used
smp_store_release() instead of smp_wmb()+WRITE_ONCE().

> +       }
> +
>         err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
[...]
> +static void bm_free(struct fs_context *fc)
> +{
> +       if (fc->s_fs_info)
> +               put_user_ns(fc->s_fs_info);
> +}

Silly question: Why the "if"? Can you ever reach this with fc->s_fs_info==NULL?

> +
>  static int bm_get_tree(struct fs_context *fc)
>  {
> -       return get_tree_single(fc, bm_fill_super);
> +       return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));

get_user_ns() increments the refcount of the namespace, but in the
case where a binfmt_misc mount already exists, that refcount is never
dropped, right? That would be a security bug, since an attacker could
overflow the refcount of the user namespace and then trigger a UAF.
(And the refcount hardening won't catch it because user namespaces
still use raw atomics instead of refcount_t.)

[...]
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)

Nit: Isn't this kind of check normally written as "#ifdef"?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2019-12-16 19:08   ` Jann Horn
@ 2019-12-16 20:05     ` Laurent Vivier
  2019-12-16 22:53       ` Jann Horn
  0 siblings, 1 reply; 19+ messages in thread
From: Laurent Vivier @ 2019-12-16 20:05 UTC (permalink / raw)
  To: Jann Horn
  Cc: kernel list, Greg Kurz, Andrei Vagin, Linux API, Dmitry Safonov,
	James Bottomley, Jan Kiszka, Christian Brauner, linux-fsdevel,
	Linux Containers, Alexander Viro, Eric Biederman, Henning Schild,
	Cédric Le Goater

Le 16/12/2019 à 20:08, Jann Horn a écrit :
> On Mon, Dec 16, 2019 at 10:12 AM Laurent Vivier <laurent@vivier.eu> wrote:
>> This patch allows to have a different binfmt_misc configuration
>> for each new user namespace. By default, the binfmt_misc configuration
>> is the one of the previous level, but if the binfmt_misc filesystem is
>> mounted in the new namespace a new empty binfmt instance is created and
>> used in this namespace.
>>
>> For instance, using "unshare" we can start a chroot of another
>> architecture and configure the binfmt_misc interpreter without being root
>> to run the binaries in this chroot.
> 
> How do you ensure that when userspace is no longer using the user
> namespace and mount namespace, the entries and the binfmt_misc
> superblock are deleted? As far as I can tell from looking at the code,
> at the moment, if I create a user namespace+mount namespace, mount
> binfmt_misc in there, register a file format and then let all
> processes inside the namespaces exit, the binfmt_misc mount will be
> kept alive by the simple_pin_fs() stuff, and the binfmt_misc entries
> will also stay in memory.
> 
> [...]

Do you have any idea how I can fix this issue?

>> @@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>>         if (!inode)
>>                 goto out2;
>>
>> -       err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
>> +       ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>> +       err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
>> +                           &ns->entry_count);
> 
> When you call simple_pin_fs() here, the user namespace of `current`
> and the user namespace of the superblock are not necessarily related.
> So simple_pin_fs() may end up taking a reference on the mountpoint for
> a user namespace that has nothing to do with the namespace for which
> an entry is being created.

Do you have any idea how I can fix this issue?

> 
> [...]
>>  static int bm_fill_super(struct super_block *sb, struct fs_context *fc)
>>  {
>>         int err;
>> +       struct user_namespace *ns = sb->s_user_ns;
> [...]
>> +       /* create a new binfmt namespace
>> +        * if we are not in the first user namespace
>> +        * but the binfmt namespace is the first one
>> +        */
>> +       if (READ_ONCE(ns->binfmt_ns) == NULL) {
> 
> The READ_ONCE() here is unnecessary, right? AFAIK the VFS layer is
> going to ensure that bm_fill_super() can't run concurrently for the
> same namespace?

So I understand the "READ_ONCE()" is unnecessary and I will remove it.

> 
>> +               struct binfmt_namespace *new_ns;
>> +
>> +               new_ns = kmalloc(sizeof(struct binfmt_namespace),
>> +                                GFP_KERNEL);
>> +               if (new_ns == NULL)
>> +                       return -ENOMEM;
>> +               INIT_LIST_HEAD(&new_ns->entries);
>> +               new_ns->enabled = 1;
>> +               rwlock_init(&new_ns->entries_lock);
>> +               new_ns->bm_mnt = NULL;
>> +               new_ns->entry_count = 0;
>> +               /* ensure new_ns is completely initialized before sharing it */
>> +               smp_wmb();
>> +               WRITE_ONCE(ns->binfmt_ns, new_ns);
> 
> Nit: This would be a little bit semantically clearer if you used
> smp_store_release() instead of smp_wmb()+WRITE_ONCE().

I will.

> 
>> +       }
>> +
>>         err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
> [...]
>> +static void bm_free(struct fs_context *fc)
>> +{
>> +       if (fc->s_fs_info)
>> +               put_user_ns(fc->s_fs_info);
>> +}
> 
> Silly question: Why the "if"? Can you ever reach this with fc->s_fs_info==NULL?

So I understand the if is unnecessary and I will remove it.

> 
>> +
>>  static int bm_get_tree(struct fs_context *fc)
>>  {
>> -       return get_tree_single(fc, bm_fill_super);
>> +       return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
> 
> get_user_ns() increments the refcount of the namespace, but in the
> case where a binfmt_misc mount already exists, that refcount is never
> dropped, right? That would be a security bug, since an attacker could
> overflow the refcount of the user namespace and then trigger a UAF.
> (And the refcount hardening won't catch it because user namespaces
> still use raw atomics instead of refcount_t.)

Do you have any idea how I can fix this issue?

> [...]
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> 
> Nit: Isn't this kind of check normally written as "#ifdef"?
> 

What is the difference?

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2019-12-16 20:05     ` Laurent Vivier
@ 2019-12-16 22:53       ` Jann Horn
  0 siblings, 0 replies; 19+ messages in thread
From: Jann Horn @ 2019-12-16 22:53 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: kernel list, Greg Kurz, Andrei Vagin, Linux API, Dmitry Safonov,
	James Bottomley, Jan Kiszka, Christian Brauner, linux-fsdevel,
	Linux Containers, Alexander Viro, Eric Biederman, Henning Schild,
	Cédric Le Goater

On Mon, Dec 16, 2019 at 9:05 PM Laurent Vivier <laurent@vivier.eu> wrote:
> Le 16/12/2019 à 20:08, Jann Horn a écrit :
> > On Mon, Dec 16, 2019 at 10:12 AM Laurent Vivier <laurent@vivier.eu> wrote:
> >> This patch allows to have a different binfmt_misc configuration
> >> for each new user namespace. By default, the binfmt_misc configuration
> >> is the one of the previous level, but if the binfmt_misc filesystem is
> >> mounted in the new namespace a new empty binfmt instance is created and
> >> used in this namespace.
> >>
> >> For instance, using "unshare" we can start a chroot of another
> >> architecture and configure the binfmt_misc interpreter without being root
> >> to run the binaries in this chroot.
> >
> > How do you ensure that when userspace is no longer using the user
> > namespace and mount namespace, the entries and the binfmt_misc
> > superblock are deleted? As far as I can tell from looking at the code,
> > at the moment, if I create a user namespace+mount namespace, mount
> > binfmt_misc in there, register a file format and then let all
> > processes inside the namespaces exit, the binfmt_misc mount will be
> > kept alive by the simple_pin_fs() stuff, and the binfmt_misc entries
> > will also stay in memory.
> >
> > [...]
>
> Do you have any idea how I can fix this issue?

I think the easiest way (keeping in mind that we want to avoid having
to fiddle around with reference loops, where e.g. interpreter
executable files opened by binfmt_misc have references back to the
user namespace through ->f_cred) would be to add a new patch in front
of this one that changes the semantics such that when binfmt_misc is
unmounted, all the existing format registrations are deleted. That's
probably also nicer from the perspective of inspectability. It could
in theory break stuff, but I think that's probably somewhat unlikely.
Still, it'd be an API change, and therefore you should CC linux-api@
on such a change.

> >> @@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
> >>         if (!inode)
> >>                 goto out2;
> >>
> >> -       err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
> >> +       ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
> >> +       err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
> >> +                           &ns->entry_count);
> >
> > When you call simple_pin_fs() here, the user namespace of `current`
> > and the user namespace of the superblock are not necessarily related.
> > So simple_pin_fs() may end up taking a reference on the mountpoint for
> > a user namespace that has nothing to do with the namespace for which
> > an entry is being created.
>
> Do you have any idea how I can fix this issue?

If you fix the memory leak the way I suggested, this wouldn't be a
problem anymore either.

> >> +static void bm_free(struct fs_context *fc)
> >> +{
> >> +       if (fc->s_fs_info)
> >> +               put_user_ns(fc->s_fs_info);
> >> +}
> >
> > Silly question: Why the "if"? Can you ever reach this with fc->s_fs_info==NULL?
>
> So I understand the if is unnecessary and I will remove it.

Your code was actually exactly right, I didn't understand how
fc->s_fs_info works.

> >>  static int bm_get_tree(struct fs_context *fc)
> >>  {
> >> -       return get_tree_single(fc, bm_fill_super);
> >> +       return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
> >
> > get_user_ns() increments the refcount of the namespace, but in the
> > case where a binfmt_misc mount already exists, that refcount is never
> > dropped, right? That would be a security bug, since an attacker could
> > overflow the refcount of the user namespace and then trigger a UAF.
> > (And the refcount hardening won't catch it because user namespaces
> > still use raw atomics instead of refcount_t.)
>
> Do you have any idea how I can fix this issue?

Ah, this was actually fine. I missed that get_tree_keyed() stashes
that pointer in fc->s_fs_info.

> >> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> >
> > Nit: Isn't this kind of check normally written as "#ifdef"?
> >
>
> What is the difference?

As explained in Documentation/process/coding-style.rst and the
relevant header, IS_ENABLED() is for inline use in C expressions.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2019-12-16  9:12 ` [PATCH v8 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier
@ 2021-01-08  8:22     ` Jan Kiszka
  2021-01-08  8:22     ` Jan Kiszka
  1 sibling, 0 replies; 19+ messages in thread
From: Jan Kiszka @ 2021-01-08  8:22 UTC (permalink / raw)
  To: Laurent Vivier, linux-kernel
  Cc: Henning Schild, Dmitry Safonov, linux-api, containers, Jann Horn,
	Greg Kurz, James Bottomley, Cédric Le Goater,
	Alexander Viro, linux-fsdevel, Eric Biederman

On 16.12.19 10:12, Laurent Vivier wrote:
> This patch allows to have a different binfmt_misc configuration
> for each new user namespace. By default, the binfmt_misc configuration
> is the one of the previous level, but if the binfmt_misc filesystem is
> mounted in the new namespace a new empty binfmt instance is created and
> used in this namespace.
> 
> For instance, using "unshare" we can start a chroot of another
> architecture and configure the binfmt_misc interpreter without being root
> to run the binaries in this chroot.
> 
> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
> Acked-by: Andrei Vagin <avagin@gmail.com>
> Tested-by: Henning Schild <henning.schild@siemens.com>
> ---
>  fs/binfmt_misc.c               | 115 +++++++++++++++++++++++++--------
>  include/linux/user_namespace.h |  15 +++++
>  kernel/user.c                  |  14 ++++
>  kernel/user_namespace.c        |   3 +
>  4 files changed, 119 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> index cdb45829354d..17fa1f56ca2e 100644
> --- a/fs/binfmt_misc.c
> +++ b/fs/binfmt_misc.c
> @@ -40,9 +40,6 @@ enum {
>  	VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
>  };
>  
> -static LIST_HEAD(entries);
> -static int enabled = 1;
> -
>  enum {Enabled, Magic};
>  #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
>  #define MISC_FMT_OPEN_BINARY (1 << 30)
> @@ -62,10 +59,7 @@ typedef struct {
>  	struct file *interp_file;
>  } Node;
>  
> -static DEFINE_RWLOCK(entries_lock);
>  static struct file_system_type bm_fs_type;
> -static struct vfsmount *bm_mnt;
> -static int entry_count;
>  
>  /*
>   * Max length of the register string.  Determined by:
> @@ -82,18 +76,37 @@ static int entry_count;
>   */
>  #define MAX_REGISTER_LENGTH 1920
>  
> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> +{
> +	struct binfmt_namespace *b_ns;
> +
> +	while (ns) {
> +		b_ns = READ_ONCE(ns->binfmt_ns);
> +		if (b_ns)
> +			return b_ns;
> +		ns = ns->parent;
> +	}
> +	/* as the first user namespace is initialized with
> +	 * &init_binfmt_ns we should never come here
> +	 * but we try to stay safe by logging a warning
> +	 * and returning a sane value
> +	 */
> +	WARN_ON_ONCE(1);
> +	return &init_binfmt_ns;
> +}
> +
>  /*
>   * Check if we support the binfmt
>   * if we do, return the node, else NULL
>   * locking is done in load_misc_binary
>   */
> -static Node *check_file(struct linux_binprm *bprm)
> +static Node *check_file(struct binfmt_namespace *ns, struct linux_binprm *bprm)
>  {
>  	char *p = strrchr(bprm->interp, '.');
>  	struct list_head *l;
>  
>  	/* Walk all the registered handlers. */
> -	list_for_each(l, &entries) {
> +	list_for_each(l, &ns->entries) {
>  		Node *e = list_entry(l, Node, list);
>  		char *s;
>  		int j;
> @@ -135,17 +148,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>  	struct file *interp_file = NULL;
>  	int retval;
>  	int fd_binary = -1;
> +	struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>  
>  	retval = -ENOEXEC;
> -	if (!enabled)
> +	if (!ns->enabled)
>  		return retval;
>  
>  	/* to keep locking time low, we copy the interpreter string */
> -	read_lock(&entries_lock);
> -	fmt = check_file(bprm);
> +	read_lock(&ns->entries_lock);
> +	fmt = check_file(ns, bprm);
>  	if (fmt)
>  		dget(fmt->dentry);
> -	read_unlock(&entries_lock);
> +	read_unlock(&ns->entries_lock);
>  	if (!fmt)
>  		return retval;
>  
> @@ -611,19 +625,19 @@ static void bm_evict_inode(struct inode *inode)
>  	kfree(e);
>  }
>  
> -static void kill_node(Node *e)
> +static void kill_node(struct binfmt_namespace *ns, Node *e)
>  {
>  	struct dentry *dentry;
>  
> -	write_lock(&entries_lock);
> +	write_lock(&ns->entries_lock);
>  	list_del_init(&e->list);
> -	write_unlock(&entries_lock);
> +	write_unlock(&ns->entries_lock);
>  
>  	dentry = e->dentry;
>  	drop_nlink(d_inode(dentry));
>  	d_drop(dentry);
>  	dput(dentry);
> -	simple_release_fs(&bm_mnt, &entry_count);
> +	simple_release_fs(&ns->bm_mnt, &ns->entry_count);
>  }
>  
>  /* /<entry> */
> @@ -653,6 +667,9 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>  	struct dentry *root;
>  	Node *e = file_inode(file)->i_private;
>  	int res = parse_command(buffer, count);
> +	struct binfmt_namespace *ns;
> +
> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>  
>  	switch (res) {
>  	case 1:
> @@ -669,7 +686,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>  		inode_lock(d_inode(root));
>  
>  		if (!list_empty(&e->list))
> -			kill_node(e);
> +			kill_node(ns, e);
>  
>  		inode_unlock(d_inode(root));
>  		break;
> @@ -695,6 +712,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  	struct inode *inode;
>  	struct super_block *sb = file_inode(file)->i_sb;
>  	struct dentry *root = sb->s_root, *dentry;
> +	struct binfmt_namespace *ns;
>  	int err = 0;
>  
>  	e = create_entry(buffer, count);
> @@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  	if (!inode)
>  		goto out2;
>  
> -	err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
> +	err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
> +			    &ns->entry_count);
>  	if (err) {
>  		iput(inode);
>  		inode = NULL;
> @@ -727,12 +747,16 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  
>  	if (e->flags & MISC_FMT_OPEN_FILE) {
>  		struct file *f;
> +		const struct cred *old_cred;
>  
> +		old_cred = override_creds(file->f_cred);
>  		f = open_exec(e->interpreter);
> +		revert_creds(old_cred);
>  		if (IS_ERR(f)) {
>  			err = PTR_ERR(f);
>  			pr_notice("register: failed to install interpreter file %s\n", e->interpreter);
> -			simple_release_fs(&bm_mnt, &entry_count);
> +			simple_release_fs(&ns->bm_mnt,
> +					  &ns->entry_count);
>  			iput(inode);
>  			inode = NULL;
>  			goto out2;
> @@ -745,9 +769,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  	inode->i_fop = &bm_entry_operations;
>  
>  	d_instantiate(dentry, inode);
> -	write_lock(&entries_lock);
> -	list_add(&e->list, &entries);
> -	write_unlock(&entries_lock);
> +	write_lock(&ns->entries_lock);
> +	list_add(&e->list, &ns->entries);
> +	write_unlock(&ns->entries_lock);
>  
>  	err = 0;
>  out2:
> @@ -772,7 +796,9 @@ static const struct file_operations bm_register_operations = {
>  static ssize_t
>  bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
>  {
> -	char *s = enabled ? "enabled\n" : "disabled\n";
> +	struct binfmt_namespace *ns =
> +				binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
> +	char *s = ns->enabled ? "enabled\n" : "disabled\n";
>  
>  	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
>  }
> @@ -780,25 +806,28 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
>  static ssize_t bm_status_write(struct file *file, const char __user *buffer,
>  		size_t count, loff_t *ppos)
>  {
> +	struct binfmt_namespace *ns;
>  	int res = parse_command(buffer, count);
>  	struct dentry *root;
>  
> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>  	switch (res) {
>  	case 1:
>  		/* Disable all handlers. */
> -		enabled = 0;
> +		ns->enabled = 0;
>  		break;
>  	case 2:
>  		/* Enable all handlers. */
> -		enabled = 1;
> +		ns->enabled = 1;
>  		break;
>  	case 3:
>  		/* Delete all handlers. */
>  		root = file_inode(file)->i_sb->s_root;
>  		inode_lock(d_inode(root));
>  
> -		while (!list_empty(&entries))
> -			kill_node(list_first_entry(&entries, Node, list));
> +		while (!list_empty(&ns->entries))
> +			kill_node(ns, list_first_entry(&ns->entries,
> +						       Node, list));
>  
>  		inode_unlock(d_inode(root));
>  		break;
> @@ -825,24 +854,53 @@ static const struct super_operations s_ops = {
>  static int bm_fill_super(struct super_block *sb, struct fs_context *fc)
>  {
>  	int err;
> +	struct user_namespace *ns = sb->s_user_ns;
>  	static const struct tree_descr bm_files[] = {
>  		[2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
>  		[3] = {"register", &bm_register_operations, S_IWUSR},
>  		/* last one */ {""}
>  	};
>  
> +	/* create a new binfmt namespace
> +	 * if we are not in the first user namespace
> +	 * but the binfmt namespace is the first one
> +	 */
> +	if (READ_ONCE(ns->binfmt_ns) == NULL) {
> +		struct binfmt_namespace *new_ns;
> +
> +		new_ns = kmalloc(sizeof(struct binfmt_namespace),
> +				 GFP_KERNEL);
> +		if (new_ns == NULL)
> +			return -ENOMEM;
> +		INIT_LIST_HEAD(&new_ns->entries);
> +		new_ns->enabled = 1;
> +		rwlock_init(&new_ns->entries_lock);
> +		new_ns->bm_mnt = NULL;
> +		new_ns->entry_count = 0;
> +		/* ensure new_ns is completely initialized before sharing it */
> +		smp_wmb();
> +		WRITE_ONCE(ns->binfmt_ns, new_ns);
> +	}
> +
>  	err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
>  	if (!err)
>  		sb->s_op = &s_ops;
>  	return err;
>  }
>  
> +static void bm_free(struct fs_context *fc)
> +{
> +	if (fc->s_fs_info)
> +		put_user_ns(fc->s_fs_info);
> +}
> +
>  static int bm_get_tree(struct fs_context *fc)
>  {
> -	return get_tree_single(fc, bm_fill_super);
> +	return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
>  }
>  
>  static const struct fs_context_operations bm_context_ops = {
> +	.free		= bm_free,
>  	.get_tree	= bm_get_tree,
>  };
>  
> @@ -861,6 +919,7 @@ static struct file_system_type bm_fs_type = {
>  	.owner		= THIS_MODULE,
>  	.name		= "binfmt_misc",
>  	.init_fs_context = bm_init_fs_context,
> +	.fs_flags	= FS_USERNS_MOUNT,
>  	.kill_sb	= kill_litter_super,
>  };
>  MODULE_ALIAS_FS("binfmt_misc");
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index fb9f4f799554..16e6f3a97a01 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -52,6 +52,18 @@ enum ucount_type {
>  	UCOUNT_COUNTS,
>  };
>  
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +struct binfmt_namespace {
> +	struct list_head entries;
> +	rwlock_t entries_lock;
> +	int enabled;
> +	struct vfsmount *bm_mnt;
> +	int entry_count;
> +} __randomize_layout;
> +
> +extern struct binfmt_namespace init_binfmt_ns;
> +#endif
> +
>  struct user_namespace {
>  	struct uid_gid_map	uid_map;
>  	struct uid_gid_map	gid_map;
> @@ -86,6 +98,9 @@ struct user_namespace {
>  #endif
>  	struct ucounts		*ucounts;
>  	int ucount_max[UCOUNT_COUNTS];
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +	struct binfmt_namespace *binfmt_ns;
> +#endif
>  } __randomize_layout;
>  
>  struct ucounts {
> diff --git a/kernel/user.c b/kernel/user.c
> index 5235d7f49982..092b2b4d47a6 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -20,6 +20,17 @@
>  #include <linux/user_namespace.h>
>  #include <linux/proc_ns.h>
>  
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +struct binfmt_namespace init_binfmt_ns = {
> +	.entries = LIST_HEAD_INIT(init_binfmt_ns.entries),
> +	.enabled = 1,
> +	.entries_lock = __RW_LOCK_UNLOCKED(init_binfmt_ns.entries_lock),
> +	.bm_mnt = NULL,
> +	.entry_count = 0,
> +};
> +EXPORT_SYMBOL_GPL(init_binfmt_ns);
> +#endif
> +
>  /*
>   * userns count is 1 for root user, 1 for init_uts_ns,
>   * and 1 for... ?
> @@ -67,6 +78,9 @@ struct user_namespace init_user_ns = {
>  	.keyring_name_list = LIST_HEAD_INIT(init_user_ns.keyring_name_list),
>  	.keyring_sem = __RWSEM_INITIALIZER(init_user_ns.keyring_sem),
>  #endif
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +	.binfmt_ns = &init_binfmt_ns,
> +#endif
>  };
>  EXPORT_SYMBOL_GPL(init_user_ns);
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 8eadadc478f9..f42c32269e20 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -191,6 +191,9 @@ static void free_user_ns(struct work_struct *work)
>  			kfree(ns->projid_map.forward);
>  			kfree(ns->projid_map.reverse);
>  		}
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +		kfree(ns->binfmt_ns);
> +#endif
>  		retire_userns_sysctls(ns);
>  		key_free_user_ns(ns);
>  		ns_free_inum(&ns->ns);
> 

What happened with this proposal since then?

As there is quite some delay between the feature finally hitting
upstream and a random distro-based containter host providing it to
unprivileged users, the longer we wait, the longer the pain persists
(e.g. when building cross-arch containers or when dealing with different
qemu-user-static versions...). Is there anything we can do to help with it?

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
@ 2021-01-08  8:22     ` Jan Kiszka
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kiszka @ 2021-01-08  8:22 UTC (permalink / raw)
  To: Laurent Vivier, linux-kernel
  Cc: Greg Kurz, Jann Horn, Andrei Vagin, linux-api, Dmitry Safonov,
	James Bottomley, Christian Brauner, linux-fsdevel, containers,
	Alexander Viro, Eric Biederman, Henning Schild,
	Cédric Le Goater

On 16.12.19 10:12, Laurent Vivier wrote:
> This patch allows to have a different binfmt_misc configuration
> for each new user namespace. By default, the binfmt_misc configuration
> is the one of the previous level, but if the binfmt_misc filesystem is
> mounted in the new namespace a new empty binfmt instance is created and
> used in this namespace.
> 
> For instance, using "unshare" we can start a chroot of another
> architecture and configure the binfmt_misc interpreter without being root
> to run the binaries in this chroot.
> 
> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
> Acked-by: Andrei Vagin <avagin@gmail.com>
> Tested-by: Henning Schild <henning.schild@siemens.com>
> ---
>  fs/binfmt_misc.c               | 115 +++++++++++++++++++++++++--------
>  include/linux/user_namespace.h |  15 +++++
>  kernel/user.c                  |  14 ++++
>  kernel/user_namespace.c        |   3 +
>  4 files changed, 119 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> index cdb45829354d..17fa1f56ca2e 100644
> --- a/fs/binfmt_misc.c
> +++ b/fs/binfmt_misc.c
> @@ -40,9 +40,6 @@ enum {
>  	VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
>  };
>  
> -static LIST_HEAD(entries);
> -static int enabled = 1;
> -
>  enum {Enabled, Magic};
>  #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
>  #define MISC_FMT_OPEN_BINARY (1 << 30)
> @@ -62,10 +59,7 @@ typedef struct {
>  	struct file *interp_file;
>  } Node;
>  
> -static DEFINE_RWLOCK(entries_lock);
>  static struct file_system_type bm_fs_type;
> -static struct vfsmount *bm_mnt;
> -static int entry_count;
>  
>  /*
>   * Max length of the register string.  Determined by:
> @@ -82,18 +76,37 @@ static int entry_count;
>   */
>  #define MAX_REGISTER_LENGTH 1920
>  
> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> +{
> +	struct binfmt_namespace *b_ns;
> +
> +	while (ns) {
> +		b_ns = READ_ONCE(ns->binfmt_ns);
> +		if (b_ns)
> +			return b_ns;
> +		ns = ns->parent;
> +	}
> +	/* as the first user namespace is initialized with
> +	 * &init_binfmt_ns we should never come here
> +	 * but we try to stay safe by logging a warning
> +	 * and returning a sane value
> +	 */
> +	WARN_ON_ONCE(1);
> +	return &init_binfmt_ns;
> +}
> +
>  /*
>   * Check if we support the binfmt
>   * if we do, return the node, else NULL
>   * locking is done in load_misc_binary
>   */
> -static Node *check_file(struct linux_binprm *bprm)
> +static Node *check_file(struct binfmt_namespace *ns, struct linux_binprm *bprm)
>  {
>  	char *p = strrchr(bprm->interp, '.');
>  	struct list_head *l;
>  
>  	/* Walk all the registered handlers. */
> -	list_for_each(l, &entries) {
> +	list_for_each(l, &ns->entries) {
>  		Node *e = list_entry(l, Node, list);
>  		char *s;
>  		int j;
> @@ -135,17 +148,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>  	struct file *interp_file = NULL;
>  	int retval;
>  	int fd_binary = -1;
> +	struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>  
>  	retval = -ENOEXEC;
> -	if (!enabled)
> +	if (!ns->enabled)
>  		return retval;
>  
>  	/* to keep locking time low, we copy the interpreter string */
> -	read_lock(&entries_lock);
> -	fmt = check_file(bprm);
> +	read_lock(&ns->entries_lock);
> +	fmt = check_file(ns, bprm);
>  	if (fmt)
>  		dget(fmt->dentry);
> -	read_unlock(&entries_lock);
> +	read_unlock(&ns->entries_lock);
>  	if (!fmt)
>  		return retval;
>  
> @@ -611,19 +625,19 @@ static void bm_evict_inode(struct inode *inode)
>  	kfree(e);
>  }
>  
> -static void kill_node(Node *e)
> +static void kill_node(struct binfmt_namespace *ns, Node *e)
>  {
>  	struct dentry *dentry;
>  
> -	write_lock(&entries_lock);
> +	write_lock(&ns->entries_lock);
>  	list_del_init(&e->list);
> -	write_unlock(&entries_lock);
> +	write_unlock(&ns->entries_lock);
>  
>  	dentry = e->dentry;
>  	drop_nlink(d_inode(dentry));
>  	d_drop(dentry);
>  	dput(dentry);
> -	simple_release_fs(&bm_mnt, &entry_count);
> +	simple_release_fs(&ns->bm_mnt, &ns->entry_count);
>  }
>  
>  /* /<entry> */
> @@ -653,6 +667,9 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>  	struct dentry *root;
>  	Node *e = file_inode(file)->i_private;
>  	int res = parse_command(buffer, count);
> +	struct binfmt_namespace *ns;
> +
> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>  
>  	switch (res) {
>  	case 1:
> @@ -669,7 +686,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>  		inode_lock(d_inode(root));
>  
>  		if (!list_empty(&e->list))
> -			kill_node(e);
> +			kill_node(ns, e);
>  
>  		inode_unlock(d_inode(root));
>  		break;
> @@ -695,6 +712,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  	struct inode *inode;
>  	struct super_block *sb = file_inode(file)->i_sb;
>  	struct dentry *root = sb->s_root, *dentry;
> +	struct binfmt_namespace *ns;
>  	int err = 0;
>  
>  	e = create_entry(buffer, count);
> @@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  	if (!inode)
>  		goto out2;
>  
> -	err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
> +	err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
> +			    &ns->entry_count);
>  	if (err) {
>  		iput(inode);
>  		inode = NULL;
> @@ -727,12 +747,16 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  
>  	if (e->flags & MISC_FMT_OPEN_FILE) {
>  		struct file *f;
> +		const struct cred *old_cred;
>  
> +		old_cred = override_creds(file->f_cred);
>  		f = open_exec(e->interpreter);
> +		revert_creds(old_cred);
>  		if (IS_ERR(f)) {
>  			err = PTR_ERR(f);
>  			pr_notice("register: failed to install interpreter file %s\n", e->interpreter);
> -			simple_release_fs(&bm_mnt, &entry_count);
> +			simple_release_fs(&ns->bm_mnt,
> +					  &ns->entry_count);
>  			iput(inode);
>  			inode = NULL;
>  			goto out2;
> @@ -745,9 +769,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  	inode->i_fop = &bm_entry_operations;
>  
>  	d_instantiate(dentry, inode);
> -	write_lock(&entries_lock);
> -	list_add(&e->list, &entries);
> -	write_unlock(&entries_lock);
> +	write_lock(&ns->entries_lock);
> +	list_add(&e->list, &ns->entries);
> +	write_unlock(&ns->entries_lock);
>  
>  	err = 0;
>  out2:
> @@ -772,7 +796,9 @@ static const struct file_operations bm_register_operations = {
>  static ssize_t
>  bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
>  {
> -	char *s = enabled ? "enabled\n" : "disabled\n";
> +	struct binfmt_namespace *ns =
> +				binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
> +	char *s = ns->enabled ? "enabled\n" : "disabled\n";
>  
>  	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
>  }
> @@ -780,25 +806,28 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
>  static ssize_t bm_status_write(struct file *file, const char __user *buffer,
>  		size_t count, loff_t *ppos)
>  {
> +	struct binfmt_namespace *ns;
>  	int res = parse_command(buffer, count);
>  	struct dentry *root;
>  
> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>  	switch (res) {
>  	case 1:
>  		/* Disable all handlers. */
> -		enabled = 0;
> +		ns->enabled = 0;
>  		break;
>  	case 2:
>  		/* Enable all handlers. */
> -		enabled = 1;
> +		ns->enabled = 1;
>  		break;
>  	case 3:
>  		/* Delete all handlers. */
>  		root = file_inode(file)->i_sb->s_root;
>  		inode_lock(d_inode(root));
>  
> -		while (!list_empty(&entries))
> -			kill_node(list_first_entry(&entries, Node, list));
> +		while (!list_empty(&ns->entries))
> +			kill_node(ns, list_first_entry(&ns->entries,
> +						       Node, list));
>  
>  		inode_unlock(d_inode(root));
>  		break;
> @@ -825,24 +854,53 @@ static const struct super_operations s_ops = {
>  static int bm_fill_super(struct super_block *sb, struct fs_context *fc)
>  {
>  	int err;
> +	struct user_namespace *ns = sb->s_user_ns;
>  	static const struct tree_descr bm_files[] = {
>  		[2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
>  		[3] = {"register", &bm_register_operations, S_IWUSR},
>  		/* last one */ {""}
>  	};
>  
> +	/* create a new binfmt namespace
> +	 * if we are not in the first user namespace
> +	 * but the binfmt namespace is the first one
> +	 */
> +	if (READ_ONCE(ns->binfmt_ns) == NULL) {
> +		struct binfmt_namespace *new_ns;
> +
> +		new_ns = kmalloc(sizeof(struct binfmt_namespace),
> +				 GFP_KERNEL);
> +		if (new_ns == NULL)
> +			return -ENOMEM;
> +		INIT_LIST_HEAD(&new_ns->entries);
> +		new_ns->enabled = 1;
> +		rwlock_init(&new_ns->entries_lock);
> +		new_ns->bm_mnt = NULL;
> +		new_ns->entry_count = 0;
> +		/* ensure new_ns is completely initialized before sharing it */
> +		smp_wmb();
> +		WRITE_ONCE(ns->binfmt_ns, new_ns);
> +	}
> +
>  	err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
>  	if (!err)
>  		sb->s_op = &s_ops;
>  	return err;
>  }
>  
> +static void bm_free(struct fs_context *fc)
> +{
> +	if (fc->s_fs_info)
> +		put_user_ns(fc->s_fs_info);
> +}
> +
>  static int bm_get_tree(struct fs_context *fc)
>  {
> -	return get_tree_single(fc, bm_fill_super);
> +	return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
>  }
>  
>  static const struct fs_context_operations bm_context_ops = {
> +	.free		= bm_free,
>  	.get_tree	= bm_get_tree,
>  };
>  
> @@ -861,6 +919,7 @@ static struct file_system_type bm_fs_type = {
>  	.owner		= THIS_MODULE,
>  	.name		= "binfmt_misc",
>  	.init_fs_context = bm_init_fs_context,
> +	.fs_flags	= FS_USERNS_MOUNT,
>  	.kill_sb	= kill_litter_super,
>  };
>  MODULE_ALIAS_FS("binfmt_misc");
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index fb9f4f799554..16e6f3a97a01 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -52,6 +52,18 @@ enum ucount_type {
>  	UCOUNT_COUNTS,
>  };
>  
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +struct binfmt_namespace {
> +	struct list_head entries;
> +	rwlock_t entries_lock;
> +	int enabled;
> +	struct vfsmount *bm_mnt;
> +	int entry_count;
> +} __randomize_layout;
> +
> +extern struct binfmt_namespace init_binfmt_ns;
> +#endif
> +
>  struct user_namespace {
>  	struct uid_gid_map	uid_map;
>  	struct uid_gid_map	gid_map;
> @@ -86,6 +98,9 @@ struct user_namespace {
>  #endif
>  	struct ucounts		*ucounts;
>  	int ucount_max[UCOUNT_COUNTS];
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +	struct binfmt_namespace *binfmt_ns;
> +#endif
>  } __randomize_layout;
>  
>  struct ucounts {
> diff --git a/kernel/user.c b/kernel/user.c
> index 5235d7f49982..092b2b4d47a6 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -20,6 +20,17 @@
>  #include <linux/user_namespace.h>
>  #include <linux/proc_ns.h>
>  
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +struct binfmt_namespace init_binfmt_ns = {
> +	.entries = LIST_HEAD_INIT(init_binfmt_ns.entries),
> +	.enabled = 1,
> +	.entries_lock = __RW_LOCK_UNLOCKED(init_binfmt_ns.entries_lock),
> +	.bm_mnt = NULL,
> +	.entry_count = 0,
> +};
> +EXPORT_SYMBOL_GPL(init_binfmt_ns);
> +#endif
> +
>  /*
>   * userns count is 1 for root user, 1 for init_uts_ns,
>   * and 1 for... ?
> @@ -67,6 +78,9 @@ struct user_namespace init_user_ns = {
>  	.keyring_name_list = LIST_HEAD_INIT(init_user_ns.keyring_name_list),
>  	.keyring_sem = __RWSEM_INITIALIZER(init_user_ns.keyring_sem),
>  #endif
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +	.binfmt_ns = &init_binfmt_ns,
> +#endif
>  };
>  EXPORT_SYMBOL_GPL(init_user_ns);
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 8eadadc478f9..f42c32269e20 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -191,6 +191,9 @@ static void free_user_ns(struct work_struct *work)
>  			kfree(ns->projid_map.forward);
>  			kfree(ns->projid_map.reverse);
>  		}
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +		kfree(ns->binfmt_ns);
> +#endif
>  		retire_userns_sysctls(ns);
>  		key_free_user_ns(ns);
>  		ns_free_inum(&ns->ns);
> 

What happened with this proposal since then?

As there is quite some delay between the feature finally hitting
upstream and a random distro-based containter host providing it to
unprivileged users, the longer we wait, the longer the pain persists
(e.g. when building cross-arch containers or when dealing with different
qemu-user-static versions...). Is there anything we can do to help with it?

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2021-01-08  8:22     ` Jan Kiszka
@ 2021-01-18 19:51       ` Laurent Vivier
  -1 siblings, 0 replies; 19+ messages in thread
From: Laurent Vivier @ 2021-01-18 19:51 UTC (permalink / raw)
  To: Jan Kiszka, linux-kernel
  Cc: Henning Schild, Dmitry Safonov, linux-api, containers, Jann Horn,
	Greg Kurz, James Bottomley, Cédric Le Goater,
	Alexander Viro, linux-fsdevel, Eric Biederman

Le 08/01/2021 à 09:22, Jan Kiszka a écrit :
> On 16.12.19 10:12, Laurent Vivier wrote:
>> This patch allows to have a different binfmt_misc configuration
>> for each new user namespace. By default, the binfmt_misc configuration
>> is the one of the previous level, but if the binfmt_misc filesystem is
>> mounted in the new namespace a new empty binfmt instance is created and
>> used in this namespace.
>>
>> For instance, using "unshare" we can start a chroot of another
>> architecture and configure the binfmt_misc interpreter without being root
>> to run the binaries in this chroot.
>>
>> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
>> Acked-by: Andrei Vagin <avagin@gmail.com>
>> Tested-by: Henning Schild <henning.schild@siemens.com>
>> ---
>>  fs/binfmt_misc.c               | 115 +++++++++++++++++++++++++--------
>>  include/linux/user_namespace.h |  15 +++++
>>  kernel/user.c                  |  14 ++++
>>  kernel/user_namespace.c        |   3 +
>>  4 files changed, 119 insertions(+), 28 deletions(-)
>>
>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>> index cdb45829354d..17fa1f56ca2e 100644
>> --- a/fs/binfmt_misc.c
>> +++ b/fs/binfmt_misc.c
>> @@ -40,9 +40,6 @@ enum {
>>  	VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
>>  };
>>  
>> -static LIST_HEAD(entries);
>> -static int enabled = 1;
>> -
>>  enum {Enabled, Magic};
>>  #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
>>  #define MISC_FMT_OPEN_BINARY (1 << 30)
>> @@ -62,10 +59,7 @@ typedef struct {
>>  	struct file *interp_file;
>>  } Node;
>>  
>> -static DEFINE_RWLOCK(entries_lock);
>>  static struct file_system_type bm_fs_type;
>> -static struct vfsmount *bm_mnt;
>> -static int entry_count;
>>  
>>  /*
>>   * Max length of the register string.  Determined by:
>> @@ -82,18 +76,37 @@ static int entry_count;
>>   */
>>  #define MAX_REGISTER_LENGTH 1920
>>  
>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>> +{
>> +	struct binfmt_namespace *b_ns;
>> +
>> +	while (ns) {
>> +		b_ns = READ_ONCE(ns->binfmt_ns);
>> +		if (b_ns)
>> +			return b_ns;
>> +		ns = ns->parent;
>> +	}
>> +	/* as the first user namespace is initialized with
>> +	 * &init_binfmt_ns we should never come here
>> +	 * but we try to stay safe by logging a warning
>> +	 * and returning a sane value
>> +	 */
>> +	WARN_ON_ONCE(1);
>> +	return &init_binfmt_ns;
>> +}
>> +
>>  /*
>>   * Check if we support the binfmt
>>   * if we do, return the node, else NULL
>>   * locking is done in load_misc_binary
>>   */
>> -static Node *check_file(struct linux_binprm *bprm)
>> +static Node *check_file(struct binfmt_namespace *ns, struct linux_binprm *bprm)
>>  {
>>  	char *p = strrchr(bprm->interp, '.');
>>  	struct list_head *l;
>>  
>>  	/* Walk all the registered handlers. */
>> -	list_for_each(l, &entries) {
>> +	list_for_each(l, &ns->entries) {
>>  		Node *e = list_entry(l, Node, list);
>>  		char *s;
>>  		int j;
>> @@ -135,17 +148,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>>  	struct file *interp_file = NULL;
>>  	int retval;
>>  	int fd_binary = -1;
>> +	struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>>  
>>  	retval = -ENOEXEC;
>> -	if (!enabled)
>> +	if (!ns->enabled)
>>  		return retval;
>>  
>>  	/* to keep locking time low, we copy the interpreter string */
>> -	read_lock(&entries_lock);
>> -	fmt = check_file(bprm);
>> +	read_lock(&ns->entries_lock);
>> +	fmt = check_file(ns, bprm);
>>  	if (fmt)
>>  		dget(fmt->dentry);
>> -	read_unlock(&entries_lock);
>> +	read_unlock(&ns->entries_lock);
>>  	if (!fmt)
>>  		return retval;
>>  
>> @@ -611,19 +625,19 @@ static void bm_evict_inode(struct inode *inode)
>>  	kfree(e);
>>  }
>>  
>> -static void kill_node(Node *e)
>> +static void kill_node(struct binfmt_namespace *ns, Node *e)
>>  {
>>  	struct dentry *dentry;
>>  
>> -	write_lock(&entries_lock);
>> +	write_lock(&ns->entries_lock);
>>  	list_del_init(&e->list);
>> -	write_unlock(&entries_lock);
>> +	write_unlock(&ns->entries_lock);
>>  
>>  	dentry = e->dentry;
>>  	drop_nlink(d_inode(dentry));
>>  	d_drop(dentry);
>>  	dput(dentry);
>> -	simple_release_fs(&bm_mnt, &entry_count);
>> +	simple_release_fs(&ns->bm_mnt, &ns->entry_count);
>>  }
>>  
>>  /* /<entry> */
>> @@ -653,6 +667,9 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>>  	struct dentry *root;
>>  	Node *e = file_inode(file)->i_private;
>>  	int res = parse_command(buffer, count);
>> +	struct binfmt_namespace *ns;
>> +
>> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>>  
>>  	switch (res) {
>>  	case 1:
>> @@ -669,7 +686,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>>  		inode_lock(d_inode(root));
>>  
>>  		if (!list_empty(&e->list))
>> -			kill_node(e);
>> +			kill_node(ns, e);
>>  
>>  		inode_unlock(d_inode(root));
>>  		break;
>> @@ -695,6 +712,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>>  	struct inode *inode;
>>  	struct super_block *sb = file_inode(file)->i_sb;
>>  	struct dentry *root = sb->s_root, *dentry;
>> +	struct binfmt_namespace *ns;
>>  	int err = 0;
>>  
>>  	e = create_entry(buffer, count);
>> @@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>>  	if (!inode)
>>  		goto out2;
>>  
>> -	err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
>> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>> +	err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
>> +			    &ns->entry_count);
>>  	if (err) {
>>  		iput(inode);
>>  		inode = NULL;
>> @@ -727,12 +747,16 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>>  
>>  	if (e->flags & MISC_FMT_OPEN_FILE) {
>>  		struct file *f;
>> +		const struct cred *old_cred;
>>  
>> +		old_cred = override_creds(file->f_cred);
>>  		f = open_exec(e->interpreter);
>> +		revert_creds(old_cred);
>>  		if (IS_ERR(f)) {
>>  			err = PTR_ERR(f);
>>  			pr_notice("register: failed to install interpreter file %s\n", e->interpreter);
>> -			simple_release_fs(&bm_mnt, &entry_count);
>> +			simple_release_fs(&ns->bm_mnt,
>> +					  &ns->entry_count);
>>  			iput(inode);
>>  			inode = NULL;
>>  			goto out2;
>> @@ -745,9 +769,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>>  	inode->i_fop = &bm_entry_operations;
>>  
>>  	d_instantiate(dentry, inode);
>> -	write_lock(&entries_lock);
>> -	list_add(&e->list, &entries);
>> -	write_unlock(&entries_lock);
>> +	write_lock(&ns->entries_lock);
>> +	list_add(&e->list, &ns->entries);
>> +	write_unlock(&ns->entries_lock);
>>  
>>  	err = 0;
>>  out2:
>> @@ -772,7 +796,9 @@ static const struct file_operations bm_register_operations = {
>>  static ssize_t
>>  bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
>>  {
>> -	char *s = enabled ? "enabled\n" : "disabled\n";
>> +	struct binfmt_namespace *ns =
>> +				binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>> +	char *s = ns->enabled ? "enabled\n" : "disabled\n";
>>  
>>  	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
>>  }
>> @@ -780,25 +806,28 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
>>  static ssize_t bm_status_write(struct file *file, const char __user *buffer,
>>  		size_t count, loff_t *ppos)
>>  {
>> +	struct binfmt_namespace *ns;
>>  	int res = parse_command(buffer, count);
>>  	struct dentry *root;
>>  
>> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>>  	switch (res) {
>>  	case 1:
>>  		/* Disable all handlers. */
>> -		enabled = 0;
>> +		ns->enabled = 0;
>>  		break;
>>  	case 2:
>>  		/* Enable all handlers. */
>> -		enabled = 1;
>> +		ns->enabled = 1;
>>  		break;
>>  	case 3:
>>  		/* Delete all handlers. */
>>  		root = file_inode(file)->i_sb->s_root;
>>  		inode_lock(d_inode(root));
>>  
>> -		while (!list_empty(&entries))
>> -			kill_node(list_first_entry(&entries, Node, list));
>> +		while (!list_empty(&ns->entries))
>> +			kill_node(ns, list_first_entry(&ns->entries,
>> +						       Node, list));
>>  
>>  		inode_unlock(d_inode(root));
>>  		break;
>> @@ -825,24 +854,53 @@ static const struct super_operations s_ops = {
>>  static int bm_fill_super(struct super_block *sb, struct fs_context *fc)
>>  {
>>  	int err;
>> +	struct user_namespace *ns = sb->s_user_ns;
>>  	static const struct tree_descr bm_files[] = {
>>  		[2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
>>  		[3] = {"register", &bm_register_operations, S_IWUSR},
>>  		/* last one */ {""}
>>  	};
>>  
>> +	/* create a new binfmt namespace
>> +	 * if we are not in the first user namespace
>> +	 * but the binfmt namespace is the first one
>> +	 */
>> +	if (READ_ONCE(ns->binfmt_ns) == NULL) {
>> +		struct binfmt_namespace *new_ns;
>> +
>> +		new_ns = kmalloc(sizeof(struct binfmt_namespace),
>> +				 GFP_KERNEL);
>> +		if (new_ns == NULL)
>> +			return -ENOMEM;
>> +		INIT_LIST_HEAD(&new_ns->entries);
>> +		new_ns->enabled = 1;
>> +		rwlock_init(&new_ns->entries_lock);
>> +		new_ns->bm_mnt = NULL;
>> +		new_ns->entry_count = 0;
>> +		/* ensure new_ns is completely initialized before sharing it */
>> +		smp_wmb();
>> +		WRITE_ONCE(ns->binfmt_ns, new_ns);
>> +	}
>> +
>>  	err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
>>  	if (!err)
>>  		sb->s_op = &s_ops;
>>  	return err;
>>  }
>>  
>> +static void bm_free(struct fs_context *fc)
>> +{
>> +	if (fc->s_fs_info)
>> +		put_user_ns(fc->s_fs_info);
>> +}
>> +
>>  static int bm_get_tree(struct fs_context *fc)
>>  {
>> -	return get_tree_single(fc, bm_fill_super);
>> +	return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
>>  }
>>  
>>  static const struct fs_context_operations bm_context_ops = {
>> +	.free		= bm_free,
>>  	.get_tree	= bm_get_tree,
>>  };
>>  
>> @@ -861,6 +919,7 @@ static struct file_system_type bm_fs_type = {
>>  	.owner		= THIS_MODULE,
>>  	.name		= "binfmt_misc",
>>  	.init_fs_context = bm_init_fs_context,
>> +	.fs_flags	= FS_USERNS_MOUNT,
>>  	.kill_sb	= kill_litter_super,
>>  };
>>  MODULE_ALIAS_FS("binfmt_misc");
>> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
>> index fb9f4f799554..16e6f3a97a01 100644
>> --- a/include/linux/user_namespace.h
>> +++ b/include/linux/user_namespace.h
>> @@ -52,6 +52,18 @@ enum ucount_type {
>>  	UCOUNT_COUNTS,
>>  };
>>  
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
>> +struct binfmt_namespace {
>> +	struct list_head entries;
>> +	rwlock_t entries_lock;
>> +	int enabled;
>> +	struct vfsmount *bm_mnt;
>> +	int entry_count;
>> +} __randomize_layout;
>> +
>> +extern struct binfmt_namespace init_binfmt_ns;
>> +#endif
>> +
>>  struct user_namespace {
>>  	struct uid_gid_map	uid_map;
>>  	struct uid_gid_map	gid_map;
>> @@ -86,6 +98,9 @@ struct user_namespace {
>>  #endif
>>  	struct ucounts		*ucounts;
>>  	int ucount_max[UCOUNT_COUNTS];
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
>> +	struct binfmt_namespace *binfmt_ns;
>> +#endif
>>  } __randomize_layout;
>>  
>>  struct ucounts {
>> diff --git a/kernel/user.c b/kernel/user.c
>> index 5235d7f49982..092b2b4d47a6 100644
>> --- a/kernel/user.c
>> +++ b/kernel/user.c
>> @@ -20,6 +20,17 @@
>>  #include <linux/user_namespace.h>
>>  #include <linux/proc_ns.h>
>>  
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
>> +struct binfmt_namespace init_binfmt_ns = {
>> +	.entries = LIST_HEAD_INIT(init_binfmt_ns.entries),
>> +	.enabled = 1,
>> +	.entries_lock = __RW_LOCK_UNLOCKED(init_binfmt_ns.entries_lock),
>> +	.bm_mnt = NULL,
>> +	.entry_count = 0,
>> +};
>> +EXPORT_SYMBOL_GPL(init_binfmt_ns);
>> +#endif
>> +
>>  /*
>>   * userns count is 1 for root user, 1 for init_uts_ns,
>>   * and 1 for... ?
>> @@ -67,6 +78,9 @@ struct user_namespace init_user_ns = {
>>  	.keyring_name_list = LIST_HEAD_INIT(init_user_ns.keyring_name_list),
>>  	.keyring_sem = __RWSEM_INITIALIZER(init_user_ns.keyring_sem),
>>  #endif
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
>> +	.binfmt_ns = &init_binfmt_ns,
>> +#endif
>>  };
>>  EXPORT_SYMBOL_GPL(init_user_ns);
>>  
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index 8eadadc478f9..f42c32269e20 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -191,6 +191,9 @@ static void free_user_ns(struct work_struct *work)
>>  			kfree(ns->projid_map.forward);
>>  			kfree(ns->projid_map.reverse);
>>  		}
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
>> +		kfree(ns->binfmt_ns);
>> +#endif
>>  		retire_userns_sysctls(ns);
>>  		key_free_user_ns(ns);
>>  		ns_free_inum(&ns->ns);
>>
> 
> What happened with this proposal since then?
> 
> As there is quite some delay between the feature finally hitting
> upstream and a random distro-based containter host providing it to
> unprivileged users, the longer we wait, the longer the pain persists
> (e.g. when building cross-arch containers or when dealing with different
> qemu-user-static versions...). Is there anything we can do to help with it?

Hi Jan,

I'm sorry, but I have no time to work on this for the moment.

So, if someone else wants to address problems of this current version I will be happy (I can help a
little).

Thanks,
Laurent
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
@ 2021-01-18 19:51       ` Laurent Vivier
  0 siblings, 0 replies; 19+ messages in thread
From: Laurent Vivier @ 2021-01-18 19:51 UTC (permalink / raw)
  To: Jan Kiszka, linux-kernel
  Cc: Greg Kurz, Jann Horn, Andrei Vagin, linux-api, Dmitry Safonov,
	James Bottomley, Christian Brauner, linux-fsdevel, containers,
	Alexander Viro, Eric Biederman, Henning Schild,
	Cédric Le Goater

Le 08/01/2021 à 09:22, Jan Kiszka a écrit :
> On 16.12.19 10:12, Laurent Vivier wrote:
>> This patch allows to have a different binfmt_misc configuration
>> for each new user namespace. By default, the binfmt_misc configuration
>> is the one of the previous level, but if the binfmt_misc filesystem is
>> mounted in the new namespace a new empty binfmt instance is created and
>> used in this namespace.
>>
>> For instance, using "unshare" we can start a chroot of another
>> architecture and configure the binfmt_misc interpreter without being root
>> to run the binaries in this chroot.
>>
>> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
>> Acked-by: Andrei Vagin <avagin@gmail.com>
>> Tested-by: Henning Schild <henning.schild@siemens.com>
>> ---
>>  fs/binfmt_misc.c               | 115 +++++++++++++++++++++++++--------
>>  include/linux/user_namespace.h |  15 +++++
>>  kernel/user.c                  |  14 ++++
>>  kernel/user_namespace.c        |   3 +
>>  4 files changed, 119 insertions(+), 28 deletions(-)
>>
>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>> index cdb45829354d..17fa1f56ca2e 100644
>> --- a/fs/binfmt_misc.c
>> +++ b/fs/binfmt_misc.c
>> @@ -40,9 +40,6 @@ enum {
>>  	VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
>>  };
>>  
>> -static LIST_HEAD(entries);
>> -static int enabled = 1;
>> -
>>  enum {Enabled, Magic};
>>  #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
>>  #define MISC_FMT_OPEN_BINARY (1 << 30)
>> @@ -62,10 +59,7 @@ typedef struct {
>>  	struct file *interp_file;
>>  } Node;
>>  
>> -static DEFINE_RWLOCK(entries_lock);
>>  static struct file_system_type bm_fs_type;
>> -static struct vfsmount *bm_mnt;
>> -static int entry_count;
>>  
>>  /*
>>   * Max length of the register string.  Determined by:
>> @@ -82,18 +76,37 @@ static int entry_count;
>>   */
>>  #define MAX_REGISTER_LENGTH 1920
>>  
>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>> +{
>> +	struct binfmt_namespace *b_ns;
>> +
>> +	while (ns) {
>> +		b_ns = READ_ONCE(ns->binfmt_ns);
>> +		if (b_ns)
>> +			return b_ns;
>> +		ns = ns->parent;
>> +	}
>> +	/* as the first user namespace is initialized with
>> +	 * &init_binfmt_ns we should never come here
>> +	 * but we try to stay safe by logging a warning
>> +	 * and returning a sane value
>> +	 */
>> +	WARN_ON_ONCE(1);
>> +	return &init_binfmt_ns;
>> +}
>> +
>>  /*
>>   * Check if we support the binfmt
>>   * if we do, return the node, else NULL
>>   * locking is done in load_misc_binary
>>   */
>> -static Node *check_file(struct linux_binprm *bprm)
>> +static Node *check_file(struct binfmt_namespace *ns, struct linux_binprm *bprm)
>>  {
>>  	char *p = strrchr(bprm->interp, '.');
>>  	struct list_head *l;
>>  
>>  	/* Walk all the registered handlers. */
>> -	list_for_each(l, &entries) {
>> +	list_for_each(l, &ns->entries) {
>>  		Node *e = list_entry(l, Node, list);
>>  		char *s;
>>  		int j;
>> @@ -135,17 +148,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>>  	struct file *interp_file = NULL;
>>  	int retval;
>>  	int fd_binary = -1;
>> +	struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>>  
>>  	retval = -ENOEXEC;
>> -	if (!enabled)
>> +	if (!ns->enabled)
>>  		return retval;
>>  
>>  	/* to keep locking time low, we copy the interpreter string */
>> -	read_lock(&entries_lock);
>> -	fmt = check_file(bprm);
>> +	read_lock(&ns->entries_lock);
>> +	fmt = check_file(ns, bprm);
>>  	if (fmt)
>>  		dget(fmt->dentry);
>> -	read_unlock(&entries_lock);
>> +	read_unlock(&ns->entries_lock);
>>  	if (!fmt)
>>  		return retval;
>>  
>> @@ -611,19 +625,19 @@ static void bm_evict_inode(struct inode *inode)
>>  	kfree(e);
>>  }
>>  
>> -static void kill_node(Node *e)
>> +static void kill_node(struct binfmt_namespace *ns, Node *e)
>>  {
>>  	struct dentry *dentry;
>>  
>> -	write_lock(&entries_lock);
>> +	write_lock(&ns->entries_lock);
>>  	list_del_init(&e->list);
>> -	write_unlock(&entries_lock);
>> +	write_unlock(&ns->entries_lock);
>>  
>>  	dentry = e->dentry;
>>  	drop_nlink(d_inode(dentry));
>>  	d_drop(dentry);
>>  	dput(dentry);
>> -	simple_release_fs(&bm_mnt, &entry_count);
>> +	simple_release_fs(&ns->bm_mnt, &ns->entry_count);
>>  }
>>  
>>  /* /<entry> */
>> @@ -653,6 +667,9 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>>  	struct dentry *root;
>>  	Node *e = file_inode(file)->i_private;
>>  	int res = parse_command(buffer, count);
>> +	struct binfmt_namespace *ns;
>> +
>> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>>  
>>  	switch (res) {
>>  	case 1:
>> @@ -669,7 +686,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>>  		inode_lock(d_inode(root));
>>  
>>  		if (!list_empty(&e->list))
>> -			kill_node(e);
>> +			kill_node(ns, e);
>>  
>>  		inode_unlock(d_inode(root));
>>  		break;
>> @@ -695,6 +712,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>>  	struct inode *inode;
>>  	struct super_block *sb = file_inode(file)->i_sb;
>>  	struct dentry *root = sb->s_root, *dentry;
>> +	struct binfmt_namespace *ns;
>>  	int err = 0;
>>  
>>  	e = create_entry(buffer, count);
>> @@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>>  	if (!inode)
>>  		goto out2;
>>  
>> -	err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
>> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>> +	err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
>> +			    &ns->entry_count);
>>  	if (err) {
>>  		iput(inode);
>>  		inode = NULL;
>> @@ -727,12 +747,16 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>>  
>>  	if (e->flags & MISC_FMT_OPEN_FILE) {
>>  		struct file *f;
>> +		const struct cred *old_cred;
>>  
>> +		old_cred = override_creds(file->f_cred);
>>  		f = open_exec(e->interpreter);
>> +		revert_creds(old_cred);
>>  		if (IS_ERR(f)) {
>>  			err = PTR_ERR(f);
>>  			pr_notice("register: failed to install interpreter file %s\n", e->interpreter);
>> -			simple_release_fs(&bm_mnt, &entry_count);
>> +			simple_release_fs(&ns->bm_mnt,
>> +					  &ns->entry_count);
>>  			iput(inode);
>>  			inode = NULL;
>>  			goto out2;
>> @@ -745,9 +769,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>>  	inode->i_fop = &bm_entry_operations;
>>  
>>  	d_instantiate(dentry, inode);
>> -	write_lock(&entries_lock);
>> -	list_add(&e->list, &entries);
>> -	write_unlock(&entries_lock);
>> +	write_lock(&ns->entries_lock);
>> +	list_add(&e->list, &ns->entries);
>> +	write_unlock(&ns->entries_lock);
>>  
>>  	err = 0;
>>  out2:
>> @@ -772,7 +796,9 @@ static const struct file_operations bm_register_operations = {
>>  static ssize_t
>>  bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
>>  {
>> -	char *s = enabled ? "enabled\n" : "disabled\n";
>> +	struct binfmt_namespace *ns =
>> +				binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>> +	char *s = ns->enabled ? "enabled\n" : "disabled\n";
>>  
>>  	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
>>  }
>> @@ -780,25 +806,28 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
>>  static ssize_t bm_status_write(struct file *file, const char __user *buffer,
>>  		size_t count, loff_t *ppos)
>>  {
>> +	struct binfmt_namespace *ns;
>>  	int res = parse_command(buffer, count);
>>  	struct dentry *root;
>>  
>> +	ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns);
>>  	switch (res) {
>>  	case 1:
>>  		/* Disable all handlers. */
>> -		enabled = 0;
>> +		ns->enabled = 0;
>>  		break;
>>  	case 2:
>>  		/* Enable all handlers. */
>> -		enabled = 1;
>> +		ns->enabled = 1;
>>  		break;
>>  	case 3:
>>  		/* Delete all handlers. */
>>  		root = file_inode(file)->i_sb->s_root;
>>  		inode_lock(d_inode(root));
>>  
>> -		while (!list_empty(&entries))
>> -			kill_node(list_first_entry(&entries, Node, list));
>> +		while (!list_empty(&ns->entries))
>> +			kill_node(ns, list_first_entry(&ns->entries,
>> +						       Node, list));
>>  
>>  		inode_unlock(d_inode(root));
>>  		break;
>> @@ -825,24 +854,53 @@ static const struct super_operations s_ops = {
>>  static int bm_fill_super(struct super_block *sb, struct fs_context *fc)
>>  {
>>  	int err;
>> +	struct user_namespace *ns = sb->s_user_ns;
>>  	static const struct tree_descr bm_files[] = {
>>  		[2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
>>  		[3] = {"register", &bm_register_operations, S_IWUSR},
>>  		/* last one */ {""}
>>  	};
>>  
>> +	/* create a new binfmt namespace
>> +	 * if we are not in the first user namespace
>> +	 * but the binfmt namespace is the first one
>> +	 */
>> +	if (READ_ONCE(ns->binfmt_ns) == NULL) {
>> +		struct binfmt_namespace *new_ns;
>> +
>> +		new_ns = kmalloc(sizeof(struct binfmt_namespace),
>> +				 GFP_KERNEL);
>> +		if (new_ns == NULL)
>> +			return -ENOMEM;
>> +		INIT_LIST_HEAD(&new_ns->entries);
>> +		new_ns->enabled = 1;
>> +		rwlock_init(&new_ns->entries_lock);
>> +		new_ns->bm_mnt = NULL;
>> +		new_ns->entry_count = 0;
>> +		/* ensure new_ns is completely initialized before sharing it */
>> +		smp_wmb();
>> +		WRITE_ONCE(ns->binfmt_ns, new_ns);
>> +	}
>> +
>>  	err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
>>  	if (!err)
>>  		sb->s_op = &s_ops;
>>  	return err;
>>  }
>>  
>> +static void bm_free(struct fs_context *fc)
>> +{
>> +	if (fc->s_fs_info)
>> +		put_user_ns(fc->s_fs_info);
>> +}
>> +
>>  static int bm_get_tree(struct fs_context *fc)
>>  {
>> -	return get_tree_single(fc, bm_fill_super);
>> +	return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns));
>>  }
>>  
>>  static const struct fs_context_operations bm_context_ops = {
>> +	.free		= bm_free,
>>  	.get_tree	= bm_get_tree,
>>  };
>>  
>> @@ -861,6 +919,7 @@ static struct file_system_type bm_fs_type = {
>>  	.owner		= THIS_MODULE,
>>  	.name		= "binfmt_misc",
>>  	.init_fs_context = bm_init_fs_context,
>> +	.fs_flags	= FS_USERNS_MOUNT,
>>  	.kill_sb	= kill_litter_super,
>>  };
>>  MODULE_ALIAS_FS("binfmt_misc");
>> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
>> index fb9f4f799554..16e6f3a97a01 100644
>> --- a/include/linux/user_namespace.h
>> +++ b/include/linux/user_namespace.h
>> @@ -52,6 +52,18 @@ enum ucount_type {
>>  	UCOUNT_COUNTS,
>>  };
>>  
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
>> +struct binfmt_namespace {
>> +	struct list_head entries;
>> +	rwlock_t entries_lock;
>> +	int enabled;
>> +	struct vfsmount *bm_mnt;
>> +	int entry_count;
>> +} __randomize_layout;
>> +
>> +extern struct binfmt_namespace init_binfmt_ns;
>> +#endif
>> +
>>  struct user_namespace {
>>  	struct uid_gid_map	uid_map;
>>  	struct uid_gid_map	gid_map;
>> @@ -86,6 +98,9 @@ struct user_namespace {
>>  #endif
>>  	struct ucounts		*ucounts;
>>  	int ucount_max[UCOUNT_COUNTS];
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
>> +	struct binfmt_namespace *binfmt_ns;
>> +#endif
>>  } __randomize_layout;
>>  
>>  struct ucounts {
>> diff --git a/kernel/user.c b/kernel/user.c
>> index 5235d7f49982..092b2b4d47a6 100644
>> --- a/kernel/user.c
>> +++ b/kernel/user.c
>> @@ -20,6 +20,17 @@
>>  #include <linux/user_namespace.h>
>>  #include <linux/proc_ns.h>
>>  
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
>> +struct binfmt_namespace init_binfmt_ns = {
>> +	.entries = LIST_HEAD_INIT(init_binfmt_ns.entries),
>> +	.enabled = 1,
>> +	.entries_lock = __RW_LOCK_UNLOCKED(init_binfmt_ns.entries_lock),
>> +	.bm_mnt = NULL,
>> +	.entry_count = 0,
>> +};
>> +EXPORT_SYMBOL_GPL(init_binfmt_ns);
>> +#endif
>> +
>>  /*
>>   * userns count is 1 for root user, 1 for init_uts_ns,
>>   * and 1 for... ?
>> @@ -67,6 +78,9 @@ struct user_namespace init_user_ns = {
>>  	.keyring_name_list = LIST_HEAD_INIT(init_user_ns.keyring_name_list),
>>  	.keyring_sem = __RWSEM_INITIALIZER(init_user_ns.keyring_sem),
>>  #endif
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
>> +	.binfmt_ns = &init_binfmt_ns,
>> +#endif
>>  };
>>  EXPORT_SYMBOL_GPL(init_user_ns);
>>  
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index 8eadadc478f9..f42c32269e20 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -191,6 +191,9 @@ static void free_user_ns(struct work_struct *work)
>>  			kfree(ns->projid_map.forward);
>>  			kfree(ns->projid_map.reverse);
>>  		}
>> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
>> +		kfree(ns->binfmt_ns);
>> +#endif
>>  		retire_userns_sysctls(ns);
>>  		key_free_user_ns(ns);
>>  		ns_free_inum(&ns->ns);
>>
> 
> What happened with this proposal since then?
> 
> As there is quite some delay between the feature finally hitting
> upstream and a random distro-based containter host providing it to
> unprivileged users, the longer we wait, the longer the pain persists
> (e.g. when building cross-arch containers or when dealing with different
> qemu-user-static versions...). Is there anything we can do to help with it?

Hi Jan,

I'm sorry, but I have no time to work on this for the moment.

So, if someone else wants to address problems of this current version I will be happy (I can help a
little).

Thanks,
Laurent

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2021-01-18 19:51       ` Laurent Vivier
  (?)
@ 2023-06-30  8:38       ` Norbert Lange
  2023-06-30  8:52         ` Laurent Vivier
  -1 siblings, 1 reply; 19+ messages in thread
From: Norbert Lange @ 2023-06-30  8:38 UTC (permalink / raw)
  To: Laurent Vivier, linux-kernel
  Cc: linux-api, linux-fsdevel, containers, jan.kiszka, jannh, avagin,
	dima, James.Bottomley, christian.brauner

Any news on this? What remains to be done, who needs to be harrassed?

Regards, Norbert

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2023-06-30  8:38       ` Norbert Lange
@ 2023-06-30  8:52         ` Laurent Vivier
  2023-06-30  9:06           ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Laurent Vivier @ 2023-06-30  8:52 UTC (permalink / raw)
  To: Norbert Lange, linux-kernel, Christian Brauner
  Cc: linux-api, linux-fsdevel, containers, jan.kiszka, jannh, avagin,
	dima, James.Bottomley

Hi Norbert,

Le 30/06/2023 à 10:38, Norbert Lange a écrit :
> Any news on this? What remains to be done, who needs to be harrassed?
> 
> Regards, Norbert

Christian was working on a new version but there is no update for 1 year.

[PATCH v2 1/2] binfmt_misc: cleanup on filesystem umount
https://lkml.org/lkml/2021/12/16/406
[PATCH v2 2/2] binfmt_misc: enable sandboxed mounts
https://lkml.org/lkml/2021/12/16/407

And personally I don't have the time to work on this.

Thanks,
Laurent

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2023-06-30  8:52         ` Laurent Vivier
@ 2023-06-30  9:06           ` Christian Brauner
  2023-07-12 19:40             ` Kees Cook
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2023-06-30  9:06 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: Norbert Lange, linux-kernel, linux-api, linux-fsdevel,
	containers, jan.kiszka, jannh, avagin, dima, James.Bottomley

On Fri, Jun 30, 2023 at 10:52:22AM +0200, Laurent Vivier wrote:
> Hi Norbert,
> 
> Le 30/06/2023 à 10:38, Norbert Lange a écrit :
> > Any news on this? What remains to be done, who needs to be harrassed?
> > 
> > Regards, Norbert
> 
> Christian was working on a new version but there is no update for 1 year.
> 
> [PATCH v2 1/2] binfmt_misc: cleanup on filesystem umount
> https://lkml.org/lkml/2021/12/16/406
> [PATCH v2 2/2] binfmt_misc: enable sandboxed mounts
> https://lkml.org/lkml/2021/12/16/407
> 
> And personally I don't have the time to work on this.

I've actually rebased this a few weeks ago:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.binfmt_misc
It has Acks, it's done. The only thing back then was Kees had wanted to
take this but never did. I'll ping him.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2023-06-30  9:06           ` Christian Brauner
@ 2023-07-12 19:40             ` Kees Cook
  2023-09-06 10:28               ` Norbert Lange
  0 siblings, 1 reply; 19+ messages in thread
From: Kees Cook @ 2023-07-12 19:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Laurent Vivier, Norbert Lange, linux-kernel, linux-api,
	linux-fsdevel, containers, jan.kiszka, jannh, avagin, dima,
	James.Bottomley

On Fri, Jun 30, 2023 at 11:06:59AM +0200, Christian Brauner wrote:
> On Fri, Jun 30, 2023 at 10:52:22AM +0200, Laurent Vivier wrote:
> > Hi Norbert,
> > 
> > Le 30/06/2023 à 10:38, Norbert Lange a écrit :
> > > Any news on this? What remains to be done, who needs to be harrassed?
> > > 
> > > Regards, Norbert
> > 
> > Christian was working on a new version but there is no update for 1 year.
> > 
> > [PATCH v2 1/2] binfmt_misc: cleanup on filesystem umount
> > https://lkml.org/lkml/2021/12/16/406
> > [PATCH v2 2/2] binfmt_misc: enable sandboxed mounts
> > https://lkml.org/lkml/2021/12/16/407
> > 
> > And personally I don't have the time to work on this.
> 
> I've actually rebased this a few weeks ago:
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.binfmt_misc
> It has Acks, it's done. The only thing back then was Kees had wanted to
> take this but never did. I'll ping him.

Hi! Can you resend this now that the merge window is closed? I looked at
it in your tree and it seems okay. I remain a bit nervous about exposing
it to unpriv access, but I'd like to give it a try. It'd be very useful!

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2023-07-12 19:40             ` Kees Cook
@ 2023-09-06 10:28               ` Norbert Lange
  2023-10-11  0:36                 ` Kees Cook
  0 siblings, 1 reply; 19+ messages in thread
From: Norbert Lange @ 2023-09-06 10:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christian Brauner, Laurent Vivier, linux-kernel, linux-api,
	linux-fsdevel, containers, jan.kiszka, jannh, avagin, dima,
	James.Bottomley

Am Mi., 12. Juli 2023 um 21:40 Uhr schrieb Kees Cook <keescook@chromium.org>:
>
> On Fri, Jun 30, 2023 at 11:06:59AM +0200, Christian Brauner wrote:
> > On Fri, Jun 30, 2023 at 10:52:22AM +0200, Laurent Vivier wrote:
> > > Hi Norbert,
> > >
> > > Le 30/06/2023 à 10:38, Norbert Lange a écrit :
> > > > Any news on this? What remains to be done, who needs to be harrassed?
> > > >
> > > > Regards, Norbert
> > >
> > > Christian was working on a new version but there is no update for 1 year.
> > >
> > > [PATCH v2 1/2] binfmt_misc: cleanup on filesystem umount
> > > https://lkml.org/lkml/2021/12/16/406
> > > [PATCH v2 2/2] binfmt_misc: enable sandboxed mounts
> > > https://lkml.org/lkml/2021/12/16/407
> > >
> > > And personally I don't have the time to work on this.
> >
> > I've actually rebased this a few weeks ago:
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.binfmt_misc
> > It has Acks, it's done. The only thing back then was Kees had wanted to
> > take this but never did. I'll ping him.
>
> Hi! Can you resend this now that the merge window is closed? I looked at
> it in your tree and it seems okay. I remain a bit nervous about exposing
> it to unpriv access, but I'd like to give it a try. It'd be very useful!
>
> -Kees
>
> --
> Kees Cook

Hate to be that guy, but did anything move closer towards upstream
since that post?

Norbert

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v8 1/1] ns: add binfmt_misc to the user namespace
  2023-09-06 10:28               ` Norbert Lange
@ 2023-10-11  0:36                 ` Kees Cook
  0 siblings, 0 replies; 19+ messages in thread
From: Kees Cook @ 2023-10-11  0:36 UTC (permalink / raw)
  To: Norbert Lange
  Cc: Christian Brauner, Laurent Vivier, linux-kernel, linux-api,
	linux-fsdevel, containers, jan.kiszka, jannh, avagin, dima,
	James.Bottomley

On Wed, Sep 06, 2023 at 12:28:27PM +0200, Norbert Lange wrote:
> Am Mi., 12. Juli 2023 um 21:40 Uhr schrieb Kees Cook <keescook@chromium.org>:
> >
> > On Fri, Jun 30, 2023 at 11:06:59AM +0200, Christian Brauner wrote:
> > > On Fri, Jun 30, 2023 at 10:52:22AM +0200, Laurent Vivier wrote:
> > > > Hi Norbert,
> > > >
> > > > Le 30/06/2023 à 10:38, Norbert Lange a écrit :
> > > > > Any news on this? What remains to be done, who needs to be harrassed?
> > > > >
> > > > > Regards, Norbert
> > > >
> > > > Christian was working on a new version but there is no update for 1 year.
> > > >
> > > > [PATCH v2 1/2] binfmt_misc: cleanup on filesystem umount
> > > > https://lkml.org/lkml/2021/12/16/406
> > > > [PATCH v2 2/2] binfmt_misc: enable sandboxed mounts
> > > > https://lkml.org/lkml/2021/12/16/407
> > > >
> > > > And personally I don't have the time to work on this.
> > >
> > > I've actually rebased this a few weeks ago:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.binfmt_misc
> > > It has Acks, it's done. The only thing back then was Kees had wanted to
> > > take this but never did. I'll ping him.
> >
> > Hi! Can you resend this now that the merge window is closed? I looked at
> > it in your tree and it seems okay. I remain a bit nervous about exposing
> > it to unpriv access, but I'd like to give it a try. It'd be very useful!
> >
> > -Kees
> >
> > --
> > Kees Cook
> 
> Hate to be that guy, but did anything move closer towards upstream
> since that post?

No rebase was needed -- I've dropped this in -next now. Let's see how it
goes!

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-10-11  0:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-16  9:12 [PATCH v8 0/1] ns: introduce binfmt_misc namespace Laurent Vivier
2019-12-16  9:12 ` [PATCH v8 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier
2019-12-16 19:08   ` Jann Horn
2019-12-16 20:05     ` Laurent Vivier
2019-12-16 22:53       ` Jann Horn
2021-01-08  8:22   ` Jan Kiszka
2021-01-08  8:22     ` Jan Kiszka
2021-01-18 19:51     ` Laurent Vivier
2021-01-18 19:51       ` Laurent Vivier
2023-06-30  8:38       ` Norbert Lange
2023-06-30  8:52         ` Laurent Vivier
2023-06-30  9:06           ` Christian Brauner
2023-07-12 19:40             ` Kees Cook
2023-09-06 10:28               ` Norbert Lange
2023-10-11  0:36                 ` Kees Cook
2019-12-16  9:46 ` [PATCH v8 0/1] ns: introduce binfmt_misc namespace Christian Brauner
2019-12-16  9:53   ` Laurent Vivier
2019-12-16 10:06     ` Christian Brauner
2019-12-16 10:08       ` Laurent Vivier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.