linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v5 0/1] ns: introduce binfmt_misc namespace
@ 2018-10-09 10:37 Laurent Vivier
  2018-10-09 10:37 ` [RFC v5 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier
  0 siblings, 1 reply; 16+ messages in thread
From: Laurent Vivier @ 2018-10-09 10:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Eric Biederman, Dmitry Safonov, linux-api, James Bottomley,
	Alexander Viro, linux-fsdevel, Andrei Vagin, containers,
	Jann Horn, Laurent Vivier

v5: Use READ_ONCE()/WRITE_ONCE()
    move mount pointer struct init to bm_fill_super() and add smp_wmb()
    remove useless NULL value init
    add WARN_ON_ONCE()

v4: first user namespace is initialized with &init_binfmt_ns,
    all new user namespaces are initialized with a NULL and use
    the one of the first parent that is not NULL. The pointer
    is initialized to a valid value the first time the binfmt_misc
    fs is mounted in the current user namespace.
    This allows to not change the way it was working before:
    new ns inherits values from its parent, and if parent value is modified
    (or parent creates its own binfmt entry by mounting the fs) child
    inherits it (unless it has itself mounted the fs).

v3: create a structure to store binfmt_misc data,
    add a pointer to this structure in the user_namespace structure,
    in init_user_ns structure this pointer points to an init_binfmt_ns
    structure. And all new user namespaces point to this init structure.
    A new binfmt namespace structure is allocated if the binfmt_misc
    filesystem is mounted in a user namespace that is not the initial
    one but its binfmt namespace pointer points to the initial one.
    add override_creds()/revert_creds() around open_exec() in
    bm_register_write()

v2: no new namespace, binfmt_misc data are now part of
    the mount namespace
    I put this in mount namespace instead of user namespace
    because the mount namespace is already needed and
    I don't want to force to have the user namespace for that.
    As this is a filesystem, it seems logic to have it here.

This allows to define a new interpreter for each new container.

But the main goal is to be able to chroot to a directory
using a binfmt_misc interpreter without being root.

I have a modified version of unshare at:

  git@github.com:vivier/util-linux.git branch unshare-chroot

with some new options to unshare binfmt_misc namespace and to chroot
to a directory.

If you have a directory /chroot/powerpc/jessie containing debian for powerpc
binaries and a qemu-ppc interpreter, you can do for instance:

 $ uname -a
 Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux
 $ ./unshare --map-root-user --fork --pid \
   --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \
   --root=/chroot/powerpc/jessie /bin/bash -l
 # uname -a
 Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux
 # id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
 # ls -l
total 5940
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 bin
drwxr-xr-x.   2 nobody nogroup    4096 Jun 17 20:26 boot
drwxr-xr-x.   4 nobody nogroup    4096 Aug 12 00:08 dev
drwxr-xr-x.  42 nobody nogroup    4096 Sep 28 07:25 etc
drwxr-xr-x.   3 nobody nogroup    4096 Sep 28 07:25 home
drwxr-xr-x.   9 nobody nogroup    4096 Aug 12 00:58 lib
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 media
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 mnt
drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 13:09 opt
dr-xr-xr-x. 143 nobody nogroup       0 Sep 30 23:02 proc
-rwxr-xr-x.   1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc
drwx------.   3 nobody nogroup    4096 Aug 12 12:54 root
drwxr-xr-x.   3 nobody nogroup    4096 Aug 12 00:08 run
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:58 sbin
drwxr-xr-x.   2 nobody nogroup    4096 Aug 12 00:08 srv
drwxr-xr-x.   2 nobody nogroup    4096 Apr  6  2015 sys
drwxrwxrwt.   2 nobody nogroup    4096 Sep 28 10:31 tmp
drwxr-xr-x.  10 nobody nogroup    4096 Aug 12 00:08 usr
drwxr-xr-x.  11 nobody nogroup    4096 Aug 12 00:08 var

If you want to use the qemu binary provided by your distro, you can use

    --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF"

With the 'F' flag, qemu-ppc-static will be then loaded from the main root
filesystem before switching to the chroot.

Laurent Vivier (1):
  ns: add binfmt_misc to the user namespace

 fs/binfmt_misc.c               | 106 ++++++++++++++++++++++++---------
 include/linux/user_namespace.h |  13 ++++
 kernel/user.c                  |  13 ++++
 kernel/user_namespace.c        |   3 +
 4 files changed, 107 insertions(+), 28 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 10:37 [RFC v5 0/1] ns: introduce binfmt_misc namespace Laurent Vivier
@ 2018-10-09 10:37 ` Laurent Vivier
  2018-10-09 12:43   ` Jann Horn
                     ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Laurent Vivier @ 2018-10-09 10:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Eric Biederman, Dmitry Safonov, linux-api, James Bottomley,
	Alexander Viro, linux-fsdevel, Andrei Vagin, containers,
	Jann Horn, Laurent Vivier

This patch allows to have a different binfmt_misc configuration
for each new user namespace. By default, the binfmt_misc configuration
is the one of the previous level, but if the binfmt_misc filesystem is
mounted in the new namespace a new empty binfmt instance is created and
used in this namespace.

For instance, using "unshare" we can start a chroot of an another
architecture and configure the binfmt_misc interpreter without being root
to run the binaries in this chroot.

Signed-off-by: Laurent Vivier <laurent@vivier.eu>
---
 fs/binfmt_misc.c               | 106 ++++++++++++++++++++++++---------
 include/linux/user_namespace.h |  13 ++++
 kernel/user.c                  |  13 ++++
 kernel/user_namespace.c        |   3 +
 4 files changed, 107 insertions(+), 28 deletions(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index aa4a7a23ff99..1e0029d097d9 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -38,9 +38,6 @@ enum {
 	VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
 };
 
-static LIST_HEAD(entries);
-static int enabled = 1;
-
 enum {Enabled, Magic};
 #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
 #define MISC_FMT_OPEN_BINARY (1 << 30)
@@ -60,10 +57,7 @@ typedef struct {
 	struct file *interp_file;
 } Node;
 
-static DEFINE_RWLOCK(entries_lock);
 static struct file_system_type bm_fs_type;
-static struct vfsmount *bm_mnt;
-static int entry_count;
 
 /*
  * Max length of the register string.  Determined by:
@@ -80,18 +74,32 @@ static int entry_count;
  */
 #define MAX_REGISTER_LENGTH 1920
 
+static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
+{
+	struct binfmt_namespace *b_ns;
+
+	while (ns) {
+		b_ns = READ_ONCE(ns->binfmt_ns);
+		if (b_ns)
+			return b_ns;
+		ns = ns->parent;
+	}
+	WARN_ON_ONCE(1);
+	return NULL;
+}
+
 /*
  * Check if we support the binfmt
  * if we do, return the node, else NULL
  * locking is done in load_misc_binary
  */
-static Node *check_file(struct linux_binprm *bprm)
+static Node *check_file(struct binfmt_namespace *ns, struct linux_binprm *bprm)
 {
 	char *p = strrchr(bprm->interp, '.');
 	struct list_head *l;
 
 	/* Walk all the registered handlers. */
-	list_for_each(l, &entries) {
+	list_for_each(l, &ns->entries) {
 		Node *e = list_entry(l, Node, list);
 		char *s;
 		int j;
@@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
 	struct file *interp_file = NULL;
 	int retval;
 	int fd_binary = -1;
+	struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
 
 	retval = -ENOEXEC;
-	if (!enabled)
+	if (!ns->enabled)
 		return retval;
 
 	/* to keep locking time low, we copy the interpreter string */
-	read_lock(&entries_lock);
-	fmt = check_file(bprm);
+	read_lock(&ns->entries_lock);
+	fmt = check_file(ns, bprm);
 	if (fmt)
 		dget(fmt->dentry);
-	read_unlock(&entries_lock);
+	read_unlock(&ns->entries_lock);
 	if (!fmt)
 		return retval;
 
@@ -609,19 +618,19 @@ static void bm_evict_inode(struct inode *inode)
 	kfree(e);
 }
 
-static void kill_node(Node *e)
+static void kill_node(struct binfmt_namespace *ns, Node *e)
 {
 	struct dentry *dentry;
 
-	write_lock(&entries_lock);
+	write_lock(&ns->entries_lock);
 	list_del_init(&e->list);
-	write_unlock(&entries_lock);
+	write_unlock(&ns->entries_lock);
 
 	dentry = e->dentry;
 	drop_nlink(d_inode(dentry));
 	d_drop(dentry);
 	dput(dentry);
-	simple_release_fs(&bm_mnt, &entry_count);
+	simple_release_fs(&ns->bm_mnt, &ns->entry_count);
 }
 
 /* /<entry> */
@@ -651,6 +660,9 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
 	struct dentry *root;
 	Node *e = file_inode(file)->i_private;
 	int res = parse_command(buffer, count);
+	struct binfmt_namespace *ns;
+
+	ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns);
 
 	switch (res) {
 	case 1:
@@ -667,7 +679,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
 		inode_lock(d_inode(root));
 
 		if (!list_empty(&e->list))
-			kill_node(e);
+			kill_node(ns, e);
 
 		inode_unlock(d_inode(root));
 		break;
@@ -693,6 +705,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
 	struct inode *inode;
 	struct super_block *sb = file_inode(file)->i_sb;
 	struct dentry *root = sb->s_root, *dentry;
+	struct binfmt_namespace *ns;
 	int err = 0;
 
 	e = create_entry(buffer, count);
@@ -716,7 +729,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
 	if (!inode)
 		goto out2;
 
-	err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
+	ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns);
+	err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
+			    &ns->entry_count);
 	if (err) {
 		iput(inode);
 		inode = NULL;
@@ -725,12 +740,16 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
 
 	if (e->flags & MISC_FMT_OPEN_FILE) {
 		struct file *f;
+		const struct cred *old_cred;
 
+		old_cred = override_creds(file->f_cred);
 		f = open_exec(e->interpreter);
+		revert_creds(old_cred);
 		if (IS_ERR(f)) {
 			err = PTR_ERR(f);
 			pr_notice("register: failed to install interpreter file %s\n", e->interpreter);
-			simple_release_fs(&bm_mnt, &entry_count);
+			simple_release_fs(&ns->bm_mnt,
+					  &ns->entry_count);
 			iput(inode);
 			inode = NULL;
 			goto out2;
@@ -743,9 +762,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
 	inode->i_fop = &bm_entry_operations;
 
 	d_instantiate(dentry, inode);
-	write_lock(&entries_lock);
-	list_add(&e->list, &entries);
-	write_unlock(&entries_lock);
+	write_lock(&ns->entries_lock);
+	list_add(&e->list, &ns->entries);
+	write_unlock(&ns->entries_lock);
 
 	err = 0;
 out2:
@@ -770,7 +789,9 @@ static const struct file_operations bm_register_operations = {
 static ssize_t
 bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 {
-	char *s = enabled ? "enabled\n" : "disabled\n";
+	struct binfmt_namespace *ns =
+				binfmt_ns(file->f_path.dentry->d_sb->s_user_ns);
+	char *s = ns->enabled ? "enabled\n" : "disabled\n";
 
 	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
 }
@@ -778,25 +799,28 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
 static ssize_t bm_status_write(struct file *file, const char __user *buffer,
 		size_t count, loff_t *ppos)
 {
+	struct binfmt_namespace *ns;
 	int res = parse_command(buffer, count);
 	struct dentry *root;
 
+	ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns);
 	switch (res) {
 	case 1:
 		/* Disable all handlers. */
-		enabled = 0;
+		ns->enabled = 0;
 		break;
 	case 2:
 		/* Enable all handlers. */
-		enabled = 1;
+		ns->enabled = 1;
 		break;
 	case 3:
 		/* Delete all handlers. */
 		root = file_inode(file)->i_sb->s_root;
 		inode_lock(d_inode(root));
 
-		while (!list_empty(&entries))
-			kill_node(list_first_entry(&entries, Node, list));
+		while (!list_empty(&ns->entries))
+			kill_node(ns, list_first_entry(&ns->entries,
+						       Node, list));
 
 		inode_unlock(d_inode(root));
 		break;
@@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
 static int bm_fill_super(struct super_block *sb, void *data, int silent)
 {
 	int err;
+	struct user_namespace *ns = sb->s_user_ns;
 	static const struct tree_descr bm_files[] = {
 		[2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
 		[3] = {"register", &bm_register_operations, S_IWUSR},
 		/* last one */ {""}
 	};
 
+	/* create a new binfmt namespace
+	 * if we are not in the first user namespace
+	 * but the binfmt namespace is the first one
+	 */
+	if (READ_ONCE(ns->binfmt_ns) == NULL) {
+		struct binfmt_namespace *new_ns;
+
+		new_ns = kmalloc(sizeof(struct binfmt_namespace),
+				 GFP_KERNEL);
+		if (new_ns == NULL)
+			return -ENOMEM;
+		INIT_LIST_HEAD(&new_ns->entries);
+		new_ns->enabled = 1;
+		rwlock_init(&new_ns->entries_lock);
+		new_ns->bm_mnt = NULL;
+		new_ns->entry_count = 0;
+		/* ensure new_ns is completely initialized before sharing it */
+		smp_wmb();
+		WRITE_ONCE(ns->binfmt_ns, new_ns);
+	}
+
 	err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
 	if (!err)
 		sb->s_op = &s_ops;
@@ -838,7 +884,10 @@ static int bm_fill_super(struct super_block *sb, void *data, int silent)
 static struct dentry *bm_mount(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data)
 {
-	return mount_single(fs_type, flags, data, bm_fill_super);
+	struct user_namespace *ns = current_user_ns();
+
+	return mount_ns(fs_type, flags, data, ns, ns,
+			bm_fill_super);
 }
 
 static struct linux_binfmt misc_format = {
@@ -849,6 +898,7 @@ static struct linux_binfmt misc_format = {
 static struct file_system_type bm_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "binfmt_misc",
+	.fs_flags	= FS_USERNS_MOUNT,
 	.mount		= bm_mount,
 	.kill_sb	= kill_litter_super,
 };
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index d6b74b91096b..5c6e7e63b97e 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -52,6 +52,16 @@ enum ucount_type {
 	UCOUNT_COUNTS,
 };
 
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+struct binfmt_namespace {
+	struct list_head entries;
+	rwlock_t entries_lock;
+	int enabled;
+	struct vfsmount *bm_mnt;
+	int entry_count;
+} __randomize_layout;
+#endif
+
 struct user_namespace {
 	struct uid_gid_map	uid_map;
 	struct uid_gid_map	gid_map;
@@ -76,6 +86,9 @@ struct user_namespace {
 #endif
 	struct ucounts		*ucounts;
 	int ucount_max[UCOUNT_COUNTS];
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+	struct binfmt_namespace *binfmt_ns;
+#endif
 } __randomize_layout;
 
 struct ucounts {
diff --git a/kernel/user.c b/kernel/user.c
index 0df9b1640b2a..912916d435aa 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -19,6 +19,16 @@
 #include <linux/user_namespace.h>
 #include <linux/proc_ns.h>
 
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+static struct binfmt_namespace init_binfmt_ns = {
+	.entries = LIST_HEAD_INIT(init_binfmt_ns.entries),
+	.enabled = 1,
+	.entries_lock = __RW_LOCK_UNLOCKED(init_binfmt_ns.entries_lock),
+	.bm_mnt = NULL,
+	.entry_count = 0,
+};
+#endif
+
 /*
  * userns count is 1 for root user, 1 for init_uts_ns,
  * and 1 for... ?
@@ -66,6 +76,9 @@ struct user_namespace init_user_ns = {
 	.persistent_keyring_register_sem =
 	__RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
 #endif
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+	.binfmt_ns = &init_binfmt_ns,
+#endif
 };
 EXPORT_SYMBOL_GPL(init_user_ns);
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index e5222b5fb4fe..990cf5950a89 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -195,6 +195,9 @@ static void free_user_ns(struct work_struct *work)
 			kfree(ns->projid_map.forward);
 			kfree(ns->projid_map.reverse);
 		}
+#if IS_ENABLED(CONFIG_BINFMT_MISC)
+		kfree(ns->binfmt_ns);
+#endif
 		retire_userns_sysctls(ns);
 #ifdef CONFIG_PERSISTENT_KEYRINGS
 		key_put(ns->persistent_keyring_register);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 10:37 ` [RFC v5 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier
@ 2018-10-09 12:43   ` Jann Horn
  2018-10-09 13:06     ` Laurent Vivier
  2018-10-09 15:16   ` Tycho Andersen
  2018-10-09 16:15   ` Kirill Tkhai
  2 siblings, 1 reply; 16+ messages in thread
From: Jann Horn @ 2018-10-09 12:43 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: kernel list, Eric W. Biederman, dima, Linux API, James Bottomley,
	Al Viro, linux-fsdevel, avagin, containers

On Tue, Oct 9, 2018 at 12:38 PM Laurent Vivier <laurent@vivier.eu> wrote:
> This patch allows to have a different binfmt_misc configuration
> for each new user namespace. By default, the binfmt_misc configuration
> is the one of the previous level, but if the binfmt_misc filesystem is
> mounted in the new namespace a new empty binfmt instance is created and
> used in this namespace.
>
> For instance, using "unshare" we can start a chroot of an another
> architecture and configure the binfmt_misc interpreter without being root
> to run the binaries in this chroot.
[...]
> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>  {
>         int err;
> +       struct user_namespace *ns = sb->s_user_ns;
>         static const struct tree_descr bm_files[] = {
>                 [2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
>                 [3] = {"register", &bm_register_operations, S_IWUSR},
>                 /* last one */ {""}
>         };
>
> +       /* create a new binfmt namespace
> +        * if we are not in the first user namespace
> +        * but the binfmt namespace is the first one
> +        */
> +       if (READ_ONCE(ns->binfmt_ns) == NULL) {
> +               struct binfmt_namespace *new_ns;
> +
> +               new_ns = kmalloc(sizeof(struct binfmt_namespace),
> +                                GFP_KERNEL);
> +               if (new_ns == NULL)
> +                       return -ENOMEM;
> +               INIT_LIST_HEAD(&new_ns->entries);
> +               new_ns->enabled = 1;
> +               rwlock_init(&new_ns->entries_lock);
> +               new_ns->bm_mnt = NULL;
> +               new_ns->entry_count = 0;
> +               /* ensure new_ns is completely initialized before sharing it */
> +               smp_wmb();
> +               WRITE_ONCE(ns->binfmt_ns, new_ns);
> +       }

You're still not preventing a concurrent race of two mount() calls,
right? What prevents two instances of this code block from running
concurrently in two different namespaces? I think you want to take
some sort of global lock around this.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 12:43   ` Jann Horn
@ 2018-10-09 13:06     ` Laurent Vivier
  2018-10-09 13:15       ` Jann Horn
  0 siblings, 1 reply; 16+ messages in thread
From: Laurent Vivier @ 2018-10-09 13:06 UTC (permalink / raw)
  To: Jann Horn
  Cc: kernel list, Eric W. Biederman, dima, Linux API, James Bottomley,
	Al Viro, linux-fsdevel, avagin, containers

Le 09/10/2018 à 14:43, Jann Horn a écrit :
> On Tue, Oct 9, 2018 at 12:38 PM Laurent Vivier <laurent@vivier.eu> wrote:
>> This patch allows to have a different binfmt_misc configuration
>> for each new user namespace. By default, the binfmt_misc configuration
>> is the one of the previous level, but if the binfmt_misc filesystem is
>> mounted in the new namespace a new empty binfmt instance is created and
>> used in this namespace.
>>
>> For instance, using "unshare" we can start a chroot of an another
>> architecture and configure the binfmt_misc interpreter without being root
>> to run the binaries in this chroot.
> [...]
>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>>  {
>>         int err;
>> +       struct user_namespace *ns = sb->s_user_ns;
>>         static const struct tree_descr bm_files[] = {
>>                 [2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
>>                 [3] = {"register", &bm_register_operations, S_IWUSR},
>>                 /* last one */ {""}
>>         };
>>
>> +       /* create a new binfmt namespace
>> +        * if we are not in the first user namespace
>> +        * but the binfmt namespace is the first one
>> +        */
>> +       if (READ_ONCE(ns->binfmt_ns) == NULL) {
>> +               struct binfmt_namespace *new_ns;
>> +
>> +               new_ns = kmalloc(sizeof(struct binfmt_namespace),
>> +                                GFP_KERNEL);
>> +               if (new_ns == NULL)
>> +                       return -ENOMEM;
>> +               INIT_LIST_HEAD(&new_ns->entries);
>> +               new_ns->enabled = 1;
>> +               rwlock_init(&new_ns->entries_lock);
>> +               new_ns->bm_mnt = NULL;
>> +               new_ns->entry_count = 0;
>> +               /* ensure new_ns is completely initialized before sharing it */
>> +               smp_wmb();
>> +               WRITE_ONCE(ns->binfmt_ns, new_ns);
>> +       }
> 
> You're still not preventing a concurrent race of two mount() calls,
> right? What prevents two instances of this code block from running
> concurrently in two different namespaces? I think you want to take
> some sort of global lock around this.
> 

My guess was we have only one binfmt superblock by user namespace, so as
we can't have duplicate superblock, we will not have duplicate binfmt_ns
structure. This function is only called once in the namespace and I
think the superblock creation is already protected by some kind of lock.

But I'm not a VFS expert, if someone wants to clarify the situation,
please go ahead.

Thanks,
Laurent




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 13:06     ` Laurent Vivier
@ 2018-10-09 13:15       ` Jann Horn
  0 siblings, 0 replies; 16+ messages in thread
From: Jann Horn @ 2018-10-09 13:15 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: kernel list, Eric W. Biederman, dima, Linux API, James Bottomley,
	Al Viro, linux-fsdevel, avagin, containers

On Tue, Oct 9, 2018 at 3:06 PM Laurent Vivier <laurent@vivier.eu> wrote:
>
> Le 09/10/2018 à 14:43, Jann Horn a écrit :
> > On Tue, Oct 9, 2018 at 12:38 PM Laurent Vivier <laurent@vivier.eu> wrote:
> >> This patch allows to have a different binfmt_misc configuration
> >> for each new user namespace. By default, the binfmt_misc configuration
> >> is the one of the previous level, but if the binfmt_misc filesystem is
> >> mounted in the new namespace a new empty binfmt instance is created and
> >> used in this namespace.
> >>
> >> For instance, using "unshare" we can start a chroot of an another
> >> architecture and configure the binfmt_misc interpreter without being root
> >> to run the binaries in this chroot.
> > [...]
> >> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
> >>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
> >>  {
> >>         int err;
> >> +       struct user_namespace *ns = sb->s_user_ns;
> >>         static const struct tree_descr bm_files[] = {
> >>                 [2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
> >>                 [3] = {"register", &bm_register_operations, S_IWUSR},
> >>                 /* last one */ {""}
> >>         };
> >>
> >> +       /* create a new binfmt namespace
> >> +        * if we are not in the first user namespace
> >> +        * but the binfmt namespace is the first one
> >> +        */
> >> +       if (READ_ONCE(ns->binfmt_ns) == NULL) {
> >> +               struct binfmt_namespace *new_ns;
> >> +
> >> +               new_ns = kmalloc(sizeof(struct binfmt_namespace),
> >> +                                GFP_KERNEL);
> >> +               if (new_ns == NULL)
> >> +                       return -ENOMEM;
> >> +               INIT_LIST_HEAD(&new_ns->entries);
> >> +               new_ns->enabled = 1;
> >> +               rwlock_init(&new_ns->entries_lock);
> >> +               new_ns->bm_mnt = NULL;
> >> +               new_ns->entry_count = 0;
> >> +               /* ensure new_ns is completely initialized before sharing it */
> >> +               smp_wmb();
> >> +               WRITE_ONCE(ns->binfmt_ns, new_ns);
> >> +       }
> >
> > You're still not preventing a concurrent race of two mount() calls,
> > right? What prevents two instances of this code block from running
> > concurrently in two different namespaces? I think you want to take
> > some sort of global lock around this.
> >
>
> My guess was we have only one binfmt superblock by user namespace, so as
> we can't have duplicate superblock, we will not have duplicate binfmt_ns
> structure. This function is only called once in the namespace and I
> think the superblock creation is already protected by some kind of lock.

Ah! Nevermind, I missed the mount_ns().

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 10:37 ` [RFC v5 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier
  2018-10-09 12:43   ` Jann Horn
@ 2018-10-09 15:16   ` Tycho Andersen
  2018-10-09 15:19     ` Laurent Vivier
  2018-10-09 16:15   ` Kirill Tkhai
  2 siblings, 1 reply; 16+ messages in thread
From: Tycho Andersen @ 2018-10-09 15:16 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: linux-kernel, Dmitry Safonov, linux-api, containers, Jann Horn,
	James Bottomley, Eric Biederman, linux-fsdevel, Alexander Viro

On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
> @@ -80,18 +74,32 @@ static int entry_count;
>   */
>  #define MAX_REGISTER_LENGTH 1920
>  
> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> +{
> +	struct binfmt_namespace *b_ns;
> +
> +	while (ns) {
> +		b_ns = READ_ONCE(ns->binfmt_ns);
> +		if (b_ns)
> +			return b_ns;
> +		ns = ns->parent;
> +	}
> +	WARN_ON_ONCE(1);

It looks like we warn here,

> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>  	struct file *interp_file = NULL;
>  	int retval;
>  	int fd_binary = -1;
> +	struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>  
>  	retval = -ENOEXEC;
> -	if (!enabled)
> +	if (!ns->enabled)

...but then in cases like this we immediately dereference the pointer
anyways and crash. Can we return some other error code here in the !ns
case so we don't crash?

Tycho

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 15:16   ` Tycho Andersen
@ 2018-10-09 15:19     ` Laurent Vivier
  2018-10-10  7:14       ` Aleksa Sarai
  2018-10-10 10:11       ` Laurent Vivier
  0 siblings, 2 replies; 16+ messages in thread
From: Laurent Vivier @ 2018-10-09 15:19 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: linux-kernel, Dmitry Safonov, linux-api, containers, Jann Horn,
	James Bottomley, Eric Biederman, linux-fsdevel, Alexander Viro

Le 09/10/2018 à 17:16, Tycho Andersen a écrit :
> On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
>> @@ -80,18 +74,32 @@ static int entry_count;
>>   */
>>  #define MAX_REGISTER_LENGTH 1920
>>  
>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>> +{
>> +	struct binfmt_namespace *b_ns;
>> +
>> +	while (ns) {
>> +		b_ns = READ_ONCE(ns->binfmt_ns);
>> +		if (b_ns)
>> +			return b_ns;
>> +		ns = ns->parent;
>> +	}
>> +	WARN_ON_ONCE(1);
> 
> It looks like we warn here,
> 
>> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>>  	struct file *interp_file = NULL;
>>  	int retval;
>>  	int fd_binary = -1;
>> +	struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>>  
>>  	retval = -ENOEXEC;
>> -	if (!enabled)
>> +	if (!ns->enabled)
> 
> ...but then in cases like this we immediately dereference the pointer
> anyways and crash. Can we return some other error code here in the !ns
> case so we don't crash?

My concern here is I don't want to add code to check an error case that
cannot happen. The first namespace binfmt_ns pointer is initialized with
&init_binfmt_ns, so the return value cannot be NULL.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 10:37 ` [RFC v5 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier
  2018-10-09 12:43   ` Jann Horn
  2018-10-09 15:16   ` Tycho Andersen
@ 2018-10-09 16:15   ` Kirill Tkhai
  2018-10-09 16:45     ` Laurent Vivier
  2 siblings, 1 reply; 16+ messages in thread
From: Kirill Tkhai @ 2018-10-09 16:15 UTC (permalink / raw)
  To: Laurent Vivier, linux-kernel
  Cc: Eric Biederman, Dmitry Safonov, linux-api, James Bottomley,
	Alexander Viro, linux-fsdevel, Andrei Vagin (C),
	containers, Jann Horn

On 09.10.2018 13:37, Laurent Vivier wrote:
> This patch allows to have a different binfmt_misc configuration
> for each new user namespace. By default, the binfmt_misc configuration
> is the one of the previous level, but if the binfmt_misc filesystem is
> mounted in the new namespace a new empty binfmt instance is created and
> used in this namespace.
> 
> For instance, using "unshare" we can start a chroot of an another
> architecture and configure the binfmt_misc interpreter without being root
> to run the binaries in this chroot.
> 
> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
> ---
>  fs/binfmt_misc.c               | 106 ++++++++++++++++++++++++---------
>  include/linux/user_namespace.h |  13 ++++
>  kernel/user.c                  |  13 ++++
>  kernel/user_namespace.c        |   3 +
>  4 files changed, 107 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> index aa4a7a23ff99..1e0029d097d9 100644
> --- a/fs/binfmt_misc.c
> +++ b/fs/binfmt_misc.c
> @@ -38,9 +38,6 @@ enum {
>  	VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
>  };
>  
> -static LIST_HEAD(entries);
> -static int enabled = 1;
> -
>  enum {Enabled, Magic};
>  #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
>  #define MISC_FMT_OPEN_BINARY (1 << 30)
> @@ -60,10 +57,7 @@ typedef struct {
>  	struct file *interp_file;
>  } Node;
>  
> -static DEFINE_RWLOCK(entries_lock);
>  static struct file_system_type bm_fs_type;
> -static struct vfsmount *bm_mnt;
> -static int entry_count;
>  
>  /*
>   * Max length of the register string.  Determined by:
> @@ -80,18 +74,32 @@ static int entry_count;
>   */
>  #define MAX_REGISTER_LENGTH 1920
>  
> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> +{
> +	struct binfmt_namespace *b_ns;
> +
> +	while (ns) {
> +		b_ns = READ_ONCE(ns->binfmt_ns);
> +		if (b_ns)
> +			return b_ns;
> +		ns = ns->parent;
> +	}
> +	WARN_ON_ONCE(1);
> +	return NULL;
> +}
> +
>  /*
>   * Check if we support the binfmt
>   * if we do, return the node, else NULL
>   * locking is done in load_misc_binary
>   */
> -static Node *check_file(struct linux_binprm *bprm)
> +static Node *check_file(struct binfmt_namespace *ns, struct linux_binprm *bprm)
>  {
>  	char *p = strrchr(bprm->interp, '.');
>  	struct list_head *l;
>  
>  	/* Walk all the registered handlers. */
> -	list_for_each(l, &entries) {
> +	list_for_each(l, &ns->entries) {
>  		Node *e = list_entry(l, Node, list);
>  		char *s;
>  		int j;
> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>  	struct file *interp_file = NULL;
>  	int retval;
>  	int fd_binary = -1;
> +	struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>  
>  	retval = -ENOEXEC;
> -	if (!enabled)
> +	if (!ns->enabled)
>  		return retval;
>  
>  	/* to keep locking time low, we copy the interpreter string */
> -	read_lock(&entries_lock);
> -	fmt = check_file(bprm);
> +	read_lock(&ns->entries_lock);
> +	fmt = check_file(ns, bprm);
>  	if (fmt)
>  		dget(fmt->dentry);
> -	read_unlock(&entries_lock);
> +	read_unlock(&ns->entries_lock);
>  	if (!fmt)
>  		return retval;
>  
> @@ -609,19 +618,19 @@ static void bm_evict_inode(struct inode *inode)
>  	kfree(e);
>  }
>  
> -static void kill_node(Node *e)
> +static void kill_node(struct binfmt_namespace *ns, Node *e)
>  {
>  	struct dentry *dentry;
>  
> -	write_lock(&entries_lock);
> +	write_lock(&ns->entries_lock);
>  	list_del_init(&e->list);
> -	write_unlock(&entries_lock);
> +	write_unlock(&ns->entries_lock);
>  
>  	dentry = e->dentry;
>  	drop_nlink(d_inode(dentry));
>  	d_drop(dentry);
>  	dput(dentry);
> -	simple_release_fs(&bm_mnt, &entry_count);
> +	simple_release_fs(&ns->bm_mnt, &ns->entry_count);
>  }
>  
>  /* /<entry> */
> @@ -651,6 +660,9 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>  	struct dentry *root;
>  	Node *e = file_inode(file)->i_private;
>  	int res = parse_command(buffer, count);
> +	struct binfmt_namespace *ns;
> +
> +	ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns);
>  
>  	switch (res) {
>  	case 1:
> @@ -667,7 +679,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer,
>  		inode_lock(d_inode(root));
>  
>  		if (!list_empty(&e->list))
> -			kill_node(e);
> +			kill_node(ns, e);
>  
>  		inode_unlock(d_inode(root));
>  		break;
> @@ -693,6 +705,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  	struct inode *inode;
>  	struct super_block *sb = file_inode(file)->i_sb;
>  	struct dentry *root = sb->s_root, *dentry;
> +	struct binfmt_namespace *ns;
>  	int err = 0;
>  
>  	e = create_entry(buffer, count);
> @@ -716,7 +729,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  	if (!inode)
>  		goto out2;
>  
> -	err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count);
> +	ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns);
> +	err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt,
> +			    &ns->entry_count);
>  	if (err) {
>  		iput(inode);
>  		inode = NULL;
> @@ -725,12 +740,16 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  
>  	if (e->flags & MISC_FMT_OPEN_FILE) {
>  		struct file *f;
> +		const struct cred *old_cred;
>  
> +		old_cred = override_creds(file->f_cred);
>  		f = open_exec(e->interpreter);
> +		revert_creds(old_cred);
>  		if (IS_ERR(f)) {
>  			err = PTR_ERR(f);
>  			pr_notice("register: failed to install interpreter file %s\n", e->interpreter);
> -			simple_release_fs(&bm_mnt, &entry_count);
> +			simple_release_fs(&ns->bm_mnt,
> +					  &ns->entry_count);
>  			iput(inode);
>  			inode = NULL;
>  			goto out2;
> @@ -743,9 +762,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer,
>  	inode->i_fop = &bm_entry_operations;
>  
>  	d_instantiate(dentry, inode);
> -	write_lock(&entries_lock);
> -	list_add(&e->list, &entries);
> -	write_unlock(&entries_lock);
> +	write_lock(&ns->entries_lock);
> +	list_add(&e->list, &ns->entries);
> +	write_unlock(&ns->entries_lock);
>  
>  	err = 0;
>  out2:
> @@ -770,7 +789,9 @@ static const struct file_operations bm_register_operations = {
>  static ssize_t
>  bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
>  {
> -	char *s = enabled ? "enabled\n" : "disabled\n";
> +	struct binfmt_namespace *ns =
> +				binfmt_ns(file->f_path.dentry->d_sb->s_user_ns);
> +	char *s = ns->enabled ? "enabled\n" : "disabled\n";
>  
>  	return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s));
>  }
> @@ -778,25 +799,28 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos)
>  static ssize_t bm_status_write(struct file *file, const char __user *buffer,
>  		size_t count, loff_t *ppos)
>  {
> +	struct binfmt_namespace *ns;
>  	int res = parse_command(buffer, count);
>  	struct dentry *root;
>  
> +	ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns);
>  	switch (res) {
>  	case 1:
>  		/* Disable all handlers. */
> -		enabled = 0;
> +		ns->enabled = 0;
>  		break;
>  	case 2:
>  		/* Enable all handlers. */
> -		enabled = 1;
> +		ns->enabled = 1;
>  		break;
>  	case 3:
>  		/* Delete all handlers. */
>  		root = file_inode(file)->i_sb->s_root;
>  		inode_lock(d_inode(root));
>  
> -		while (!list_empty(&entries))
> -			kill_node(list_first_entry(&entries, Node, list));
> +		while (!list_empty(&ns->entries))
> +			kill_node(ns, list_first_entry(&ns->entries,
> +						       Node, list));
>  
>  		inode_unlock(d_inode(root));
>  		break;
> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>  {
>  	int err;
> +	struct user_namespace *ns = sb->s_user_ns;
>  	static const struct tree_descr bm_files[] = {
>  		[2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
>  		[3] = {"register", &bm_register_operations, S_IWUSR},
>  		/* last one */ {""}
>  	};
>  
> +	/* create a new binfmt namespace
> +	 * if we are not in the first user namespace
> +	 * but the binfmt namespace is the first one
> +	 */
> +	if (READ_ONCE(ns->binfmt_ns) == NULL) {
> +		struct binfmt_namespace *new_ns;
> +
> +		new_ns = kmalloc(sizeof(struct binfmt_namespace),
> +				 GFP_KERNEL);
> +		if (new_ns == NULL)
> +			return -ENOMEM;
> +		INIT_LIST_HEAD(&new_ns->entries);
> +		new_ns->enabled = 1;
> +		rwlock_init(&new_ns->entries_lock);
> +		new_ns->bm_mnt = NULL;
> +		new_ns->entry_count = 0;
> +		/* ensure new_ns is completely initialized before sharing it */
> +		smp_wmb();

(I haven't dived into patch logic, here just small barrier remark from quick sight).
smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
you want something like below in read hunk:

+		b_ns = READ_ONCE(ns->binfmt_ns);
+		if (b_ns) {
+			smp_rmb();
+			return b_ns;
+		}


> +		WRITE_ONCE(ns->binfmt_ns, new_ns);
> +	}
> +
>  	err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files);
>  	if (!err)
>  		sb->s_op = &s_ops;
> @@ -838,7 +884,10 @@ static int bm_fill_super(struct super_block *sb, void *data, int silent)
>  static struct dentry *bm_mount(struct file_system_type *fs_type,
>  	int flags, const char *dev_name, void *data)
>  {
> -	return mount_single(fs_type, flags, data, bm_fill_super);
> +	struct user_namespace *ns = current_user_ns();
> +
> +	return mount_ns(fs_type, flags, data, ns, ns,
> +			bm_fill_super);
>  }
>  
>  static struct linux_binfmt misc_format = {
> @@ -849,6 +898,7 @@ static struct linux_binfmt misc_format = {
>  static struct file_system_type bm_fs_type = {
>  	.owner		= THIS_MODULE,
>  	.name		= "binfmt_misc",
> +	.fs_flags	= FS_USERNS_MOUNT,
>  	.mount		= bm_mount,
>  	.kill_sb	= kill_litter_super,
>  };
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index d6b74b91096b..5c6e7e63b97e 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -52,6 +52,16 @@ enum ucount_type {
>  	UCOUNT_COUNTS,
>  };
>  
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +struct binfmt_namespace {
> +	struct list_head entries;
> +	rwlock_t entries_lock;
> +	int enabled;
> +	struct vfsmount *bm_mnt;
> +	int entry_count;
> +} __randomize_layout;
> +#endif
> +
>  struct user_namespace {
>  	struct uid_gid_map	uid_map;
>  	struct uid_gid_map	gid_map;
> @@ -76,6 +86,9 @@ struct user_namespace {
>  #endif
>  	struct ucounts		*ucounts;
>  	int ucount_max[UCOUNT_COUNTS];
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +	struct binfmt_namespace *binfmt_ns;
> +#endif
>  } __randomize_layout;
>  
>  struct ucounts {
> diff --git a/kernel/user.c b/kernel/user.c
> index 0df9b1640b2a..912916d435aa 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -19,6 +19,16 @@
>  #include <linux/user_namespace.h>
>  #include <linux/proc_ns.h>
>  
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +static struct binfmt_namespace init_binfmt_ns = {
> +	.entries = LIST_HEAD_INIT(init_binfmt_ns.entries),
> +	.enabled = 1,
> +	.entries_lock = __RW_LOCK_UNLOCKED(init_binfmt_ns.entries_lock),
> +	.bm_mnt = NULL,
> +	.entry_count = 0,
> +};
> +#endif
> +
>  /*
>   * userns count is 1 for root user, 1 for init_uts_ns,
>   * and 1 for... ?
> @@ -66,6 +76,9 @@ struct user_namespace init_user_ns = {
>  	.persistent_keyring_register_sem =
>  	__RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
>  #endif
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +	.binfmt_ns = &init_binfmt_ns,
> +#endif
>  };
>  EXPORT_SYMBOL_GPL(init_user_ns);
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index e5222b5fb4fe..990cf5950a89 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -195,6 +195,9 @@ static void free_user_ns(struct work_struct *work)
>  			kfree(ns->projid_map.forward);
>  			kfree(ns->projid_map.reverse);
>  		}
> +#if IS_ENABLED(CONFIG_BINFMT_MISC)
> +		kfree(ns->binfmt_ns);
> +#endif
>  		retire_userns_sysctls(ns);
>  #ifdef CONFIG_PERSISTENT_KEYRINGS
>  		key_put(ns->persistent_keyring_register);
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 16:15   ` Kirill Tkhai
@ 2018-10-09 16:45     ` Laurent Vivier
  2018-10-09 16:53       ` Jann Horn
  2018-10-09 17:01       ` Kirill Tkhai
  0 siblings, 2 replies; 16+ messages in thread
From: Laurent Vivier @ 2018-10-09 16:45 UTC (permalink / raw)
  To: Kirill Tkhai, linux-kernel
  Cc: Eric Biederman, Dmitry Safonov, linux-api, James Bottomley,
	Alexander Viro, linux-fsdevel, Andrei Vagin (C),
	containers, Jann Horn

Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
> On 09.10.2018 13:37, Laurent Vivier wrote:
>> This patch allows to have a different binfmt_misc configuration
>> for each new user namespace. By default, the binfmt_misc configuration
>> is the one of the previous level, but if the binfmt_misc filesystem is
>> mounted in the new namespace a new empty binfmt instance is created and
>> used in this namespace.
>>
>> For instance, using "unshare" we can start a chroot of an another
>> architecture and configure the binfmt_misc interpreter without being root
>> to run the binaries in this chroot.
>>
>> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
>> ---
>>  fs/binfmt_misc.c               | 106 ++++++++++++++++++++++++---------
>>  include/linux/user_namespace.h |  13 ++++
>>  kernel/user.c                  |  13 ++++
>>  kernel/user_namespace.c        |   3 +
>>  4 files changed, 107 insertions(+), 28 deletions(-)
>>
>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>> index aa4a7a23ff99..1e0029d097d9 100644
>> --- a/fs/binfmt_misc.c
>> +++ b/fs/binfmt_misc.c
...
>> @@ -80,18 +74,32 @@ static int entry_count;
>>   */
>>  #define MAX_REGISTER_LENGTH 1920
>>  
>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>> +{
>> +	struct binfmt_namespace *b_ns;
>> +
>> +	while (ns) {
>> +		b_ns = READ_ONCE(ns->binfmt_ns);
>> +		if (b_ns)
>> +			return b_ns;
>> +		ns = ns->parent;
>> +	}
>> +	WARN_ON_ONCE(1);
>> +	return NULL;
>> +}
>> +
...
>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>>  {
>>  	int err;
>> +	struct user_namespace *ns = sb->s_user_ns;
>>  	static const struct tree_descr bm_files[] = {
>>  		[2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
>>  		[3] = {"register", &bm_register_operations, S_IWUSR},
>>  		/* last one */ {""}
>>  	};
>>  
>> +	/* create a new binfmt namespace
>> +	 * if we are not in the first user namespace
>> +	 * but the binfmt namespace is the first one
>> +	 */
>> +	if (READ_ONCE(ns->binfmt_ns) == NULL) {
>> +		struct binfmt_namespace *new_ns;
>> +
>> +		new_ns = kmalloc(sizeof(struct binfmt_namespace),
>> +				 GFP_KERNEL);
>> +		if (new_ns == NULL)
>> +			return -ENOMEM;
>> +		INIT_LIST_HEAD(&new_ns->entries);
>> +		new_ns->enabled = 1;
>> +		rwlock_init(&new_ns->entries_lock);
>> +		new_ns->bm_mnt = NULL;
>> +		new_ns->entry_count = 0;
>> +		/* ensure new_ns is completely initialized before sharing it */
>> +		smp_wmb();
> 
> (I haven't dived into patch logic, here just small barrier remark from quick sight).
> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
> you want something like below in read hunk:
> 
> +		b_ns = READ_ONCE(ns->binfmt_ns);
> +		if (b_ns) {
> +			smp_rmb();
> +			return b_ns;
> +		}
> 
> 

The write barrier is here to ensure the structure is fully written
before we set the pointer.

I don't understand how read barrier can change something at this level,
IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
have correctly initialized the pointer and the structure when we read
the pointer back.

I think the pointer itself is the "barrier" to access the memory
modified before.

Thanks,
Laurent

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 16:45     ` Laurent Vivier
@ 2018-10-09 16:53       ` Jann Horn
  2018-10-09 16:57         ` Laurent Vivier
  2018-10-09 17:01       ` Kirill Tkhai
  1 sibling, 1 reply; 16+ messages in thread
From: Jann Horn @ 2018-10-09 16:53 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: ktkhai, kernel list, Eric W. Biederman, dima, Linux API,
	James Bottomley, Al Viro, linux-fsdevel, Andrei Vagin,
	containers

On Tue, Oct 9, 2018 at 6:45 PM Laurent Vivier <laurent@vivier.eu> wrote:
> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
> > On 09.10.2018 13:37, Laurent Vivier wrote:
> >> This patch allows to have a different binfmt_misc configuration
> >> for each new user namespace. By default, the binfmt_misc configuration
> >> is the one of the previous level, but if the binfmt_misc filesystem is
> >> mounted in the new namespace a new empty binfmt instance is created and
> >> used in this namespace.
> >>
> >> For instance, using "unshare" we can start a chroot of an another
> >> architecture and configure the binfmt_misc interpreter without being root
> >> to run the binaries in this chroot.
> >>
> >> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
> >> ---
> >>  fs/binfmt_misc.c               | 106 ++++++++++++++++++++++++---------
> >>  include/linux/user_namespace.h |  13 ++++
> >>  kernel/user.c                  |  13 ++++
> >>  kernel/user_namespace.c        |   3 +
> >>  4 files changed, 107 insertions(+), 28 deletions(-)
> >>
> >> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> >> index aa4a7a23ff99..1e0029d097d9 100644
> >> --- a/fs/binfmt_misc.c
> >> +++ b/fs/binfmt_misc.c
> ...
> >> @@ -80,18 +74,32 @@ static int entry_count;
> >>   */
> >>  #define MAX_REGISTER_LENGTH 1920
> >>
> >> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> >> +{
> >> +    struct binfmt_namespace *b_ns;
> >> +
> >> +    while (ns) {
> >> +            b_ns = READ_ONCE(ns->binfmt_ns);
> >> +            if (b_ns)
> >> +                    return b_ns;
> >> +            ns = ns->parent;
> >> +    }
> >> +    WARN_ON_ONCE(1);
> >> +    return NULL;
> >> +}
> >> +
> ...
> >> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
> >>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
> >>  {
> >>      int err;
> >> +    struct user_namespace *ns = sb->s_user_ns;
> >>      static const struct tree_descr bm_files[] = {
> >>              [2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
> >>              [3] = {"register", &bm_register_operations, S_IWUSR},
> >>              /* last one */ {""}
> >>      };
> >>
> >> +    /* create a new binfmt namespace
> >> +     * if we are not in the first user namespace
> >> +     * but the binfmt namespace is the first one
> >> +     */
> >> +    if (READ_ONCE(ns->binfmt_ns) == NULL) {
> >> +            struct binfmt_namespace *new_ns;
> >> +
> >> +            new_ns = kmalloc(sizeof(struct binfmt_namespace),
> >> +                             GFP_KERNEL);
> >> +            if (new_ns == NULL)
> >> +                    return -ENOMEM;
> >> +            INIT_LIST_HEAD(&new_ns->entries);
> >> +            new_ns->enabled = 1;
> >> +            rwlock_init(&new_ns->entries_lock);
> >> +            new_ns->bm_mnt = NULL;
> >> +            new_ns->entry_count = 0;
> >> +            /* ensure new_ns is completely initialized before sharing it */
> >> +            smp_wmb();
> >
> > (I haven't dived into patch logic, here just small barrier remark from quick sight).
> > smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
> > you want something like below in read hunk:
> >
> > +             b_ns = READ_ONCE(ns->binfmt_ns);
> > +             if (b_ns) {
> > +                     smp_rmb();
> > +                     return b_ns;
> > +             }
> >
> >
>
> The write barrier is here to ensure the structure is fully written
> before we set the pointer.
>
> I don't understand how read barrier can change something at this level,
> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
> have correctly initialized the pointer and the structure when we read
> the pointer back.
>
> I think the pointer itself is the "barrier" to access the memory
> modified before.

Things don't work that way on alpha, but that's why READ_ONCE()
includes an smp_read_barrier_depends():

#define __READ_ONCE(x, check)                                           \
({                                                                      \
        union { typeof(x) __val; char __c[1]; } __u;                    \
        if (check)                                                      \
                __read_once_size(&(x), __u.__c, sizeof(x));             \
        else                                                            \
                __read_once_size_nocheck(&(x), __u.__c, sizeof(x));     \
        smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
        __u.__val;                                                      \
})
#define READ_ONCE(x) __READ_ONCE(x, 1)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 16:53       ` Jann Horn
@ 2018-10-09 16:57         ` Laurent Vivier
  2018-10-09 17:01           ` Jann Horn
  0 siblings, 1 reply; 16+ messages in thread
From: Laurent Vivier @ 2018-10-09 16:57 UTC (permalink / raw)
  To: Jann Horn
  Cc: ktkhai, kernel list, Eric W. Biederman, dima, Linux API,
	James Bottomley, Al Viro, linux-fsdevel, Andrei Vagin,
	containers

Le 09/10/2018 à 18:53, Jann Horn a écrit :
> On Tue, Oct 9, 2018 at 6:45 PM Laurent Vivier <laurent@vivier.eu> wrote:
>> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
>>> On 09.10.2018 13:37, Laurent Vivier wrote:
>>>> This patch allows to have a different binfmt_misc configuration
>>>> for each new user namespace. By default, the binfmt_misc configuration
>>>> is the one of the previous level, but if the binfmt_misc filesystem is
>>>> mounted in the new namespace a new empty binfmt instance is created and
>>>> used in this namespace.
>>>>
>>>> For instance, using "unshare" we can start a chroot of an another
>>>> architecture and configure the binfmt_misc interpreter without being root
>>>> to run the binaries in this chroot.
>>>>
>>>> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
>>>> ---
>>>>  fs/binfmt_misc.c               | 106 ++++++++++++++++++++++++---------
>>>>  include/linux/user_namespace.h |  13 ++++
>>>>  kernel/user.c                  |  13 ++++
>>>>  kernel/user_namespace.c        |   3 +
>>>>  4 files changed, 107 insertions(+), 28 deletions(-)
>>>>
>>>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>>>> index aa4a7a23ff99..1e0029d097d9 100644
>>>> --- a/fs/binfmt_misc.c
>>>> +++ b/fs/binfmt_misc.c
>> ...
>>>> @@ -80,18 +74,32 @@ static int entry_count;
>>>>   */
>>>>  #define MAX_REGISTER_LENGTH 1920
>>>>
>>>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>>>> +{
>>>> +    struct binfmt_namespace *b_ns;
>>>> +
>>>> +    while (ns) {
>>>> +            b_ns = READ_ONCE(ns->binfmt_ns);
>>>> +            if (b_ns)
>>>> +                    return b_ns;
>>>> +            ns = ns->parent;
>>>> +    }
>>>> +    WARN_ON_ONCE(1);
>>>> +    return NULL;
>>>> +}
>>>> +
>> ...
>>>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>>>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>>>>  {
>>>>      int err;
>>>> +    struct user_namespace *ns = sb->s_user_ns;
>>>>      static const struct tree_descr bm_files[] = {
>>>>              [2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
>>>>              [3] = {"register", &bm_register_operations, S_IWUSR},
>>>>              /* last one */ {""}
>>>>      };
>>>>
>>>> +    /* create a new binfmt namespace
>>>> +     * if we are not in the first user namespace
>>>> +     * but the binfmt namespace is the first one
>>>> +     */
>>>> +    if (READ_ONCE(ns->binfmt_ns) == NULL) {
>>>> +            struct binfmt_namespace *new_ns;
>>>> +
>>>> +            new_ns = kmalloc(sizeof(struct binfmt_namespace),
>>>> +                             GFP_KERNEL);
>>>> +            if (new_ns == NULL)
>>>> +                    return -ENOMEM;
>>>> +            INIT_LIST_HEAD(&new_ns->entries);
>>>> +            new_ns->enabled = 1;
>>>> +            rwlock_init(&new_ns->entries_lock);
>>>> +            new_ns->bm_mnt = NULL;
>>>> +            new_ns->entry_count = 0;
>>>> +            /* ensure new_ns is completely initialized before sharing it */
>>>> +            smp_wmb();
>>>
>>> (I haven't dived into patch logic, here just small barrier remark from quick sight).
>>> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
>>> you want something like below in read hunk:
>>>
>>> +             b_ns = READ_ONCE(ns->binfmt_ns);
>>> +             if (b_ns) {
>>> +                     smp_rmb();
>>> +                     return b_ns;
>>> +             }
>>>
>>>
>>
>> The write barrier is here to ensure the structure is fully written
>> before we set the pointer.
>>
>> I don't understand how read barrier can change something at this level,
>> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
>> have correctly initialized the pointer and the structure when we read
>> the pointer back.
>>
>> I think the pointer itself is the "barrier" to access the memory
>> modified before.
> 
> Things don't work that way on alpha, but that's why READ_ONCE()
> includes an smp_read_barrier_depends():
> 
> #define __READ_ONCE(x, check)                                           \
> ({                                                                      \
>         union { typeof(x) __val; char __c[1]; } __u;                    \
>         if (check)                                                      \
>                 __read_once_size(&(x), __u.__c, sizeof(x));             \
>         else                                                            \
>                 __read_once_size_nocheck(&(x), __u.__c, sizeof(x));     \
>         smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
>         __u.__val;                                                      \
> })
> #define READ_ONCE(x) __READ_ONCE(x, 1)
> 

So my questions are:

- do we need a smp_wmb() barrier if we use READ_ONCE() and WRITE_ONCE()?

- if we need an smp_wmb() barrier, do we need an smp_rmb() barrier as
the data we want to "protect" are behind an access to the pointer?

Thanks,
Laurent

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 16:45     ` Laurent Vivier
  2018-10-09 16:53       ` Jann Horn
@ 2018-10-09 17:01       ` Kirill Tkhai
  2018-10-09 17:22         ` Laurent Vivier
  1 sibling, 1 reply; 16+ messages in thread
From: Kirill Tkhai @ 2018-10-09 17:01 UTC (permalink / raw)
  To: Laurent Vivier, linux-kernel
  Cc: Eric Biederman, Dmitry Safonov, linux-api, James Bottomley,
	Alexander Viro, linux-fsdevel, Andrei Vagin (C),
	containers, Jann Horn

On 09.10.2018 19:45, Laurent Vivier wrote:
> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
>> On 09.10.2018 13:37, Laurent Vivier wrote:
>>> This patch allows to have a different binfmt_misc configuration
>>> for each new user namespace. By default, the binfmt_misc configuration
>>> is the one of the previous level, but if the binfmt_misc filesystem is
>>> mounted in the new namespace a new empty binfmt instance is created and
>>> used in this namespace.
>>>
>>> For instance, using "unshare" we can start a chroot of an another
>>> architecture and configure the binfmt_misc interpreter without being root
>>> to run the binaries in this chroot.
>>>
>>> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
>>> ---
>>>  fs/binfmt_misc.c               | 106 ++++++++++++++++++++++++---------
>>>  include/linux/user_namespace.h |  13 ++++
>>>  kernel/user.c                  |  13 ++++
>>>  kernel/user_namespace.c        |   3 +
>>>  4 files changed, 107 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>>> index aa4a7a23ff99..1e0029d097d9 100644
>>> --- a/fs/binfmt_misc.c
>>> +++ b/fs/binfmt_misc.c
> ...
>>> @@ -80,18 +74,32 @@ static int entry_count;
>>>   */
>>>  #define MAX_REGISTER_LENGTH 1920
>>>  
>>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>>> +{
>>> +	struct binfmt_namespace *b_ns;
>>> +
>>> +	while (ns) {
>>> +		b_ns = READ_ONCE(ns->binfmt_ns);
>>> +		if (b_ns)
>>> +			return b_ns;
>>> +		ns = ns->parent;
>>> +	}
>>> +	WARN_ON_ONCE(1);
>>> +	return NULL;
>>> +}
>>> +
> ...
>>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>>>  {
>>>  	int err;
>>> +	struct user_namespace *ns = sb->s_user_ns;
>>>  	static const struct tree_descr bm_files[] = {
>>>  		[2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
>>>  		[3] = {"register", &bm_register_operations, S_IWUSR},
>>>  		/* last one */ {""}
>>>  	};
>>>  
>>> +	/* create a new binfmt namespace
>>> +	 * if we are not in the first user namespace
>>> +	 * but the binfmt namespace is the first one
>>> +	 */
>>> +	if (READ_ONCE(ns->binfmt_ns) == NULL) {
>>> +		struct binfmt_namespace *new_ns;
>>> +
>>> +		new_ns = kmalloc(sizeof(struct binfmt_namespace),
>>> +				 GFP_KERNEL);
>>> +		if (new_ns == NULL)
>>> +			return -ENOMEM;
>>> +		INIT_LIST_HEAD(&new_ns->entries);
>>> +		new_ns->enabled = 1;
>>> +		rwlock_init(&new_ns->entries_lock);
>>> +		new_ns->bm_mnt = NULL;
>>> +		new_ns->entry_count = 0;
>>> +		/* ensure new_ns is completely initialized before sharing it */
>>> +		smp_wmb();
>>
>> (I haven't dived into patch logic, here just small barrier remark from quick sight).
>> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
>> you want something like below in read hunk:
>>
>> +		b_ns = READ_ONCE(ns->binfmt_ns);
>> +		if (b_ns) {
>> +			smp_rmb();
>> +			return b_ns;
>> +		}
>>
>>
> 
> The write barrier is here to ensure the structure is fully written
> before we set the pointer.
> 
> I don't understand how read barrier can change something at this level,
> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
> have correctly initialized the pointer and the structure when we read
> the pointer back.
> 
> I think the pointer itself is the "barrier" to access the memory
> modified before.

smp_rmb() guarantees you see stores in the order you want. If you have:

[cpu0]					[cpu1]
new_ns->entry_count = 0; 
smp_wmb();
WRITE_ONCE(ns->binfmt_ns, new_ns); 	b_ns = READ_ONCE(ns->binfmt_ns);
					smp_rmb();
					<access b_ns->entry_count>

smp_rmb() guarantees you see true entry_count on the cpu1. Without
smp_rmb() you may see old value of new_ns->entry_count.
					
See Documentation/memory-barriers.txt

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 16:57         ` Laurent Vivier
@ 2018-10-09 17:01           ` Jann Horn
  0 siblings, 0 replies; 16+ messages in thread
From: Jann Horn @ 2018-10-09 17:01 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: ktkhai, kernel list, Eric W. Biederman, dima, Linux API,
	James Bottomley, Al Viro, linux-fsdevel, Andrei Vagin,
	containers

On Tue, Oct 9, 2018 at 6:57 PM Laurent Vivier <laurent@vivier.eu> wrote:
> Le 09/10/2018 à 18:53, Jann Horn a écrit :
> > On Tue, Oct 9, 2018 at 6:45 PM Laurent Vivier <laurent@vivier.eu> wrote:
> >> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
> >>> On 09.10.2018 13:37, Laurent Vivier wrote:
> >>>> This patch allows to have a different binfmt_misc configuration
> >>>> for each new user namespace. By default, the binfmt_misc configuration
> >>>> is the one of the previous level, but if the binfmt_misc filesystem is
> >>>> mounted in the new namespace a new empty binfmt instance is created and
> >>>> used in this namespace.
> >>>>
> >>>> For instance, using "unshare" we can start a chroot of an another
> >>>> architecture and configure the binfmt_misc interpreter without being root
> >>>> to run the binaries in this chroot.
> >>>>
> >>>> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
> >>>> ---
> >>>>  fs/binfmt_misc.c               | 106 ++++++++++++++++++++++++---------
> >>>>  include/linux/user_namespace.h |  13 ++++
> >>>>  kernel/user.c                  |  13 ++++
> >>>>  kernel/user_namespace.c        |   3 +
> >>>>  4 files changed, 107 insertions(+), 28 deletions(-)
> >>>>
> >>>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> >>>> index aa4a7a23ff99..1e0029d097d9 100644
> >>>> --- a/fs/binfmt_misc.c
> >>>> +++ b/fs/binfmt_misc.c
> >> ...
> >>>> @@ -80,18 +74,32 @@ static int entry_count;
> >>>>   */
> >>>>  #define MAX_REGISTER_LENGTH 1920
> >>>>
> >>>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> >>>> +{
> >>>> +    struct binfmt_namespace *b_ns;
> >>>> +
> >>>> +    while (ns) {
> >>>> +            b_ns = READ_ONCE(ns->binfmt_ns);
> >>>> +            if (b_ns)
> >>>> +                    return b_ns;
> >>>> +            ns = ns->parent;
> >>>> +    }
> >>>> +    WARN_ON_ONCE(1);
> >>>> +    return NULL;
> >>>> +}
> >>>> +
> >> ...
> >>>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
> >>>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
> >>>>  {
> >>>>      int err;
> >>>> +    struct user_namespace *ns = sb->s_user_ns;
> >>>>      static const struct tree_descr bm_files[] = {
> >>>>              [2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
> >>>>              [3] = {"register", &bm_register_operations, S_IWUSR},
> >>>>              /* last one */ {""}
> >>>>      };
> >>>>
> >>>> +    /* create a new binfmt namespace
> >>>> +     * if we are not in the first user namespace
> >>>> +     * but the binfmt namespace is the first one
> >>>> +     */
> >>>> +    if (READ_ONCE(ns->binfmt_ns) == NULL) {
> >>>> +            struct binfmt_namespace *new_ns;
> >>>> +
> >>>> +            new_ns = kmalloc(sizeof(struct binfmt_namespace),
> >>>> +                             GFP_KERNEL);
> >>>> +            if (new_ns == NULL)
> >>>> +                    return -ENOMEM;
> >>>> +            INIT_LIST_HEAD(&new_ns->entries);
> >>>> +            new_ns->enabled = 1;
> >>>> +            rwlock_init(&new_ns->entries_lock);
> >>>> +            new_ns->bm_mnt = NULL;
> >>>> +            new_ns->entry_count = 0;
> >>>> +            /* ensure new_ns is completely initialized before sharing it */
> >>>> +            smp_wmb();
> >>>
> >>> (I haven't dived into patch logic, here just small barrier remark from quick sight).
> >>> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
> >>> you want something like below in read hunk:
> >>>
> >>> +             b_ns = READ_ONCE(ns->binfmt_ns);
> >>> +             if (b_ns) {
> >>> +                     smp_rmb();
> >>> +                     return b_ns;
> >>> +             }
> >>>
> >>>
> >>
> >> The write barrier is here to ensure the structure is fully written
> >> before we set the pointer.
> >>
> >> I don't understand how read barrier can change something at this level,
> >> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
> >> have correctly initialized the pointer and the structure when we read
> >> the pointer back.
> >>
> >> I think the pointer itself is the "barrier" to access the memory
> >> modified before.
> >
> > Things don't work that way on alpha, but that's why READ_ONCE()
> > includes an smp_read_barrier_depends():
> >
> > #define __READ_ONCE(x, check)                                           \
> > ({                                                                      \
> >         union { typeof(x) __val; char __c[1]; } __u;                    \
> >         if (check)                                                      \
> >                 __read_once_size(&(x), __u.__c, sizeof(x));             \
> >         else                                                            \
> >                 __read_once_size_nocheck(&(x), __u.__c, sizeof(x));     \
> >         smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
> >         __u.__val;                                                      \
> > })
> > #define READ_ONCE(x) __READ_ONCE(x, 1)
> >
>
> So my questions are:
>
> - do we need a smp_wmb() barrier if we use READ_ONCE() and WRITE_ONCE()?

Yes.

> - if we need an smp_wmb() barrier, do we need an smp_rmb() barrier as
> the data we want to "protect" are behind an access to the pointer?

No. You only need an smp_read_barrier_depends() to access things
behind the pointer you're reading (documented in a big comment block
in arch/alpha/include/asm/barrier.h); and that barrier is implied by
READ_ONCE(), so you don't have to write it yourself.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 17:01       ` Kirill Tkhai
@ 2018-10-09 17:22         ` Laurent Vivier
  0 siblings, 0 replies; 16+ messages in thread
From: Laurent Vivier @ 2018-10-09 17:22 UTC (permalink / raw)
  To: Kirill Tkhai, linux-kernel
  Cc: Eric Biederman, Dmitry Safonov, linux-api, James Bottomley,
	Alexander Viro, linux-fsdevel, Andrei Vagin (C),
	containers, Jann Horn

Le 09/10/2018 à 19:01, Kirill Tkhai a écrit :
> On 09.10.2018 19:45, Laurent Vivier wrote:
>> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
>>> On 09.10.2018 13:37, Laurent Vivier wrote:
>>>> This patch allows to have a different binfmt_misc configuration
>>>> for each new user namespace. By default, the binfmt_misc configuration
>>>> is the one of the previous level, but if the binfmt_misc filesystem is
>>>> mounted in the new namespace a new empty binfmt instance is created and
>>>> used in this namespace.
>>>>
>>>> For instance, using "unshare" we can start a chroot of an another
>>>> architecture and configure the binfmt_misc interpreter without being root
>>>> to run the binaries in this chroot.
>>>>
>>>> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
>>>> ---
>>>>  fs/binfmt_misc.c               | 106 ++++++++++++++++++++++++---------
>>>>  include/linux/user_namespace.h |  13 ++++
>>>>  kernel/user.c                  |  13 ++++
>>>>  kernel/user_namespace.c        |   3 +
>>>>  4 files changed, 107 insertions(+), 28 deletions(-)
>>>>
>>>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>>>> index aa4a7a23ff99..1e0029d097d9 100644
>>>> --- a/fs/binfmt_misc.c
>>>> +++ b/fs/binfmt_misc.c
>> ...
>>>> @@ -80,18 +74,32 @@ static int entry_count;
>>>>   */
>>>>  #define MAX_REGISTER_LENGTH 1920
>>>>  
>>>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>>>> +{
>>>> +	struct binfmt_namespace *b_ns;
>>>> +
>>>> +	while (ns) {
>>>> +		b_ns = READ_ONCE(ns->binfmt_ns);
>>>> +		if (b_ns)
>>>> +			return b_ns;
>>>> +		ns = ns->parent;
>>>> +	}
>>>> +	WARN_ON_ONCE(1);
>>>> +	return NULL;
>>>> +}
>>>> +
>> ...
>>>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>>>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>>>>  {
>>>>  	int err;
>>>> +	struct user_namespace *ns = sb->s_user_ns;
>>>>  	static const struct tree_descr bm_files[] = {
>>>>  		[2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO},
>>>>  		[3] = {"register", &bm_register_operations, S_IWUSR},
>>>>  		/* last one */ {""}
>>>>  	};
>>>>  
>>>> +	/* create a new binfmt namespace
>>>> +	 * if we are not in the first user namespace
>>>> +	 * but the binfmt namespace is the first one
>>>> +	 */
>>>> +	if (READ_ONCE(ns->binfmt_ns) == NULL) {
>>>> +		struct binfmt_namespace *new_ns;
>>>> +
>>>> +		new_ns = kmalloc(sizeof(struct binfmt_namespace),
>>>> +				 GFP_KERNEL);
>>>> +		if (new_ns == NULL)
>>>> +			return -ENOMEM;
>>>> +		INIT_LIST_HEAD(&new_ns->entries);
>>>> +		new_ns->enabled = 1;
>>>> +		rwlock_init(&new_ns->entries_lock);
>>>> +		new_ns->bm_mnt = NULL;
>>>> +		new_ns->entry_count = 0;
>>>> +		/* ensure new_ns is completely initialized before sharing it */
>>>> +		smp_wmb();
>>>
>>> (I haven't dived into patch logic, here just small barrier remark from quick sight).
>>> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
>>> you want something like below in read hunk:
>>>
>>> +		b_ns = READ_ONCE(ns->binfmt_ns);
>>> +		if (b_ns) {
>>> +			smp_rmb();
>>> +			return b_ns;
>>> +		}
>>>
>>>
>>
>> The write barrier is here to ensure the structure is fully written
>> before we set the pointer.
>>
>> I don't understand how read barrier can change something at this level,
>> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
>> have correctly initialized the pointer and the structure when we read
>> the pointer back.
>>
>> I think the pointer itself is the "barrier" to access the memory
>> modified before.
> 
> smp_rmb() guarantees you see stores in the order you want. If you have:
> 
> [cpu0]					[cpu1]
> new_ns->entry_count = 0; 
> smp_wmb();
> WRITE_ONCE(ns->binfmt_ns, new_ns); 	b_ns = READ_ONCE(ns->binfmt_ns);
> 					smp_rmb();
> 					<access b_ns->entry_count>
> 
> smp_rmb() guarantees you see true entry_count on the cpu1. Without
> smp_rmb() you may see old value of new_ns->entry_count.
> 					
> See Documentation/memory-barriers.txt

Yes, I tried to read this document several times...

What I understand from example line 1077 (7696f9910a9a
Documentation/memory-barriers.txt) is we only need a data dependency
barrier, and as explained by Jann it comes with the READ_ONCE() (and is
only needed for alpha).

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 15:19     ` Laurent Vivier
@ 2018-10-10  7:14       ` Aleksa Sarai
  2018-10-10 10:11       ` Laurent Vivier
  1 sibling, 0 replies; 16+ messages in thread
From: Aleksa Sarai @ 2018-10-10  7:14 UTC (permalink / raw)
  Cc: Tycho Andersen, Jann Horn, linux-api, containers, Dmitry Safonov,
	linux-kernel, James Bottomley, Eric Biederman, linux-fsdevel,
	Alexander Viro

[-- Attachment #1: Type: text/plain, Size: 1610 bytes --]

On 2018-10-09, Laurent Vivier <laurent@vivier.eu> wrote:
> Le 09/10/2018 à 17:16, Tycho Andersen a écrit :
> > On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
> >> @@ -80,18 +74,32 @@ static int entry_count;
> >>   */
> >>  #define MAX_REGISTER_LENGTH 1920
> >>  
> >> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> >> +{
> >> +	struct binfmt_namespace *b_ns;
> >> +
> >> +	while (ns) {
> >> +		b_ns = READ_ONCE(ns->binfmt_ns);
> >> +		if (b_ns)
> >> +			return b_ns;
> >> +		ns = ns->parent;
> >> +	}
> >> +	WARN_ON_ONCE(1);
> > 
> > It looks like we warn here,
> > 
> >> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
> >>  	struct file *interp_file = NULL;
> >>  	int retval;
> >>  	int fd_binary = -1;
> >> +	struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
> >>  
> >>  	retval = -ENOEXEC;
> >> -	if (!enabled)
> >> +	if (!ns->enabled)
> > 
> > ...but then in cases like this we immediately dereference the pointer
> > anyways and crash. Can we return some other error code here in the !ns
> > case so we don't crash?
> 
> My concern here is I don't want to add code to check an error case that
> cannot happen. The first namespace binfmt_ns pointer is initialized with
> &init_binfmt_ns, so the return value cannot be NULL.

I'd argue that BUG() is a better thing to do then -- if doing a dummy
error path makes no sense. Though IIRC BUG() is no longer a popular
thing to do.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace
  2018-10-09 15:19     ` Laurent Vivier
  2018-10-10  7:14       ` Aleksa Sarai
@ 2018-10-10 10:11       ` Laurent Vivier
  1 sibling, 0 replies; 16+ messages in thread
From: Laurent Vivier @ 2018-10-10 10:11 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: linux-kernel, Dmitry Safonov, linux-api, containers, Jann Horn,
	James Bottomley, Eric Biederman, linux-fsdevel, Alexander Viro

On 09/10/2018 17:19, Laurent Vivier wrote:
> Le 09/10/2018 à 17:16, Tycho Andersen a écrit :
>> On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
>>> @@ -80,18 +74,32 @@ static int entry_count;
>>>   */
>>>  #define MAX_REGISTER_LENGTH 1920
>>>  
>>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>>> +{
>>> +	struct binfmt_namespace *b_ns;
>>> +
>>> +	while (ns) {
>>> +		b_ns = READ_ONCE(ns->binfmt_ns);
>>> +		if (b_ns)
>>> +			return b_ns;
>>> +		ns = ns->parent;
>>> +	}
>>> +	WARN_ON_ONCE(1);
>>
>> It looks like we warn here,
>>
>>> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>>>  	struct file *interp_file = NULL;
>>>  	int retval;
>>>  	int fd_binary = -1;
>>> +	struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>>>  
>>>  	retval = -ENOEXEC;
>>> -	if (!enabled)
>>> +	if (!ns->enabled)
>>
>> ...but then in cases like this we immediately dereference the pointer
>> anyways and crash. Can we return some other error code here in the !ns
>> case so we don't crash?
> 
> My concern here is I don't want to add code to check an error case that
> cannot happen. The first namespace binfmt_ns pointer is initialized with
> &init_binfmt_ns, so the return value cannot be NULL.

Perhaps it could be reasonable to return &init_binfmt_ns rather than
NULL in this case?

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-10-10 10:12 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-09 10:37 [RFC v5 0/1] ns: introduce binfmt_misc namespace Laurent Vivier
2018-10-09 10:37 ` [RFC v5 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier
2018-10-09 12:43   ` Jann Horn
2018-10-09 13:06     ` Laurent Vivier
2018-10-09 13:15       ` Jann Horn
2018-10-09 15:16   ` Tycho Andersen
2018-10-09 15:19     ` Laurent Vivier
2018-10-10  7:14       ` Aleksa Sarai
2018-10-10 10:11       ` Laurent Vivier
2018-10-09 16:15   ` Kirill Tkhai
2018-10-09 16:45     ` Laurent Vivier
2018-10-09 16:53       ` Jann Horn
2018-10-09 16:57         ` Laurent Vivier
2018-10-09 17:01           ` Jann Horn
2018-10-09 17:01       ` Kirill Tkhai
2018-10-09 17:22         ` Laurent Vivier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).