* [PATCH v6 0/1] ns: introduce binfmt_misc namespace @ 2018-10-10 16:14 Laurent Vivier 2018-10-10 16:14 ` [PATCH v6 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier 2018-10-16 9:52 ` [PATCH v6 0/1] ns: introduce binfmt_misc namespace Laurent Vivier 0 siblings, 2 replies; 17+ messages in thread From: Laurent Vivier @ 2018-10-10 16:14 UTC (permalink / raw) To: linux-kernel Cc: Jann Horn, James Bottomley, linux-api, linux-fsdevel, Andrei Vagin, Alexander Viro, Eric Biederman, containers, Dmitry Safonov, Laurent Vivier v6: Return &init_binfmt_ns instead of NULL in binfmt_ns() This should never happen, but to stay safe return a value we can use. change subject from "RFC" to "PATCH" v5: Use READ_ONCE()/WRITE_ONCE() move mount pointer struct init to bm_fill_super() and add smp_wmb() remove useless NULL value init add WARN_ON_ONCE() v4: first user namespace is initialized with &init_binfmt_ns, all new user namespaces are initialized with a NULL and use the one of the first parent that is not NULL. The pointer is initialized to a valid value the first time the binfmt_misc fs is mounted in the current user namespace. This allows to not change the way it was working before: new ns inherits values from its parent, and if parent value is modified (or parent creates its own binfmt entry by mounting the fs) child inherits it (unless it has itself mounted the fs). v3: create a structure to store binfmt_misc data, add a pointer to this structure in the user_namespace structure, in init_user_ns structure this pointer points to an init_binfmt_ns structure. And all new user namespaces point to this init structure. A new binfmt namespace structure is allocated if the binfmt_misc filesystem is mounted in a user namespace that is not the initial one but its binfmt namespace pointer points to the initial one. add override_creds()/revert_creds() around open_exec() in bm_register_write() v2: no new namespace, binfmt_misc data are now part of the mount namespace I put this in mount namespace instead of user namespace because the mount namespace is already needed and I don't want to force to have the user namespace for that. As this is a filesystem, it seems logic to have it here. This allows to define a new interpreter for each new container. But the main goal is to be able to chroot to a directory using a binfmt_misc interpreter without being root. I have a modified version of unshare at: git@github.com:vivier/util-linux.git branch unshare-chroot with some new options to unshare binfmt_misc namespace and to chroot to a directory. If you have a directory /chroot/powerpc/jessie containing debian for powerpc binaries and a qemu-ppc interpreter, you can do for instance: $ uname -a Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux $ ./unshare --map-root-user --fork --pid \ --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \ --root=/chroot/powerpc/jessie /bin/bash -l # uname -a Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux # id uid=0(root) gid=0(root) groups=0(root),65534(nogroup) # ls -l total 5940 drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:58 bin drwxr-xr-x. 2 nobody nogroup 4096 Jun 17 20:26 boot drwxr-xr-x. 4 nobody nogroup 4096 Aug 12 00:08 dev drwxr-xr-x. 42 nobody nogroup 4096 Sep 28 07:25 etc drwxr-xr-x. 3 nobody nogroup 4096 Sep 28 07:25 home drwxr-xr-x. 9 nobody nogroup 4096 Aug 12 00:58 lib drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 media drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 mnt drwxr-xr-x. 3 nobody nogroup 4096 Aug 12 13:09 opt dr-xr-xr-x. 143 nobody nogroup 0 Sep 30 23:02 proc -rwxr-xr-x. 1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc drwx------. 3 nobody nogroup 4096 Aug 12 12:54 root drwxr-xr-x. 3 nobody nogroup 4096 Aug 12 00:08 run drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:58 sbin drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 srv drwxr-xr-x. 2 nobody nogroup 4096 Apr 6 2015 sys drwxrwxrwt. 2 nobody nogroup 4096 Sep 28 10:31 tmp drwxr-xr-x. 10 nobody nogroup 4096 Aug 12 00:08 usr drwxr-xr-x. 11 nobody nogroup 4096 Aug 12 00:08 var If you want to use the qemu binary provided by your distro, you can use --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF" With the 'F' flag, qemu-ppc-static will be then loaded from the main root filesystem before switching to the chroot. Laurent Vivier (1): ns: add binfmt_misc to the user namespace fs/binfmt_misc.c | 111 ++++++++++++++++++++++++--------- include/linux/user_namespace.h | 15 +++++ kernel/user.c | 14 +++++ kernel/user_namespace.c | 3 + 4 files changed, 115 insertions(+), 28 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v6 1/1] ns: add binfmt_misc to the user namespace 2018-10-10 16:14 [PATCH v6 0/1] ns: introduce binfmt_misc namespace Laurent Vivier @ 2018-10-10 16:14 ` Laurent Vivier 2018-10-16 10:13 ` Rasmus Villemoes 2018-10-16 16:22 ` Andrei Vagin 2018-10-16 9:52 ` [PATCH v6 0/1] ns: introduce binfmt_misc namespace Laurent Vivier 1 sibling, 2 replies; 17+ messages in thread From: Laurent Vivier @ 2018-10-10 16:14 UTC (permalink / raw) To: linux-kernel Cc: Jann Horn, James Bottomley, linux-api, linux-fsdevel, Andrei Vagin, Alexander Viro, Eric Biederman, containers, Dmitry Safonov, Laurent Vivier This patch allows to have a different binfmt_misc configuration for each new user namespace. By default, the binfmt_misc configuration is the one of the previous level, but if the binfmt_misc filesystem is mounted in the new namespace a new empty binfmt instance is created and used in this namespace. For instance, using "unshare" we can start a chroot of another architecture and configure the binfmt_misc interpreter without being root to run the binaries in this chroot. Signed-off-by: Laurent Vivier <laurent@vivier.eu> --- fs/binfmt_misc.c | 111 ++++++++++++++++++++++++--------- include/linux/user_namespace.h | 15 +++++ kernel/user.c | 14 +++++ kernel/user_namespace.c | 3 + 4 files changed, 115 insertions(+), 28 deletions(-) diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index aa4a7a23ff99..df9dc3248b7b 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -38,9 +38,6 @@ enum { VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */ }; -static LIST_HEAD(entries); -static int enabled = 1; - enum {Enabled, Magic}; #define MISC_FMT_PRESERVE_ARGV0 (1 << 31) #define MISC_FMT_OPEN_BINARY (1 << 30) @@ -60,10 +57,7 @@ typedef struct { struct file *interp_file; } Node; -static DEFINE_RWLOCK(entries_lock); static struct file_system_type bm_fs_type; -static struct vfsmount *bm_mnt; -static int entry_count; /* * Max length of the register string. Determined by: @@ -80,18 +74,37 @@ static int entry_count; */ #define MAX_REGISTER_LENGTH 1920 +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns) +{ + struct binfmt_namespace *b_ns; + + while (ns) { + b_ns = READ_ONCE(ns->binfmt_ns); + if (b_ns) + return b_ns; + ns = ns->parent; + } + /* as the first user namespace is initialized with + * &init_binfmt_ns we should never come here + * but we try to stay safe by logging a warning + * and returning a sane value + */ + WARN_ON_ONCE(1); + return &init_binfmt_ns; +} + /* * Check if we support the binfmt * if we do, return the node, else NULL * locking is done in load_misc_binary */ -static Node *check_file(struct linux_binprm *bprm) +static Node *check_file(struct binfmt_namespace *ns, struct linux_binprm *bprm) { char *p = strrchr(bprm->interp, '.'); struct list_head *l; /* Walk all the registered handlers. */ - list_for_each(l, &entries) { + list_for_each(l, &ns->entries) { Node *e = list_entry(l, Node, list); char *s; int j; @@ -133,17 +146,18 @@ static int load_misc_binary(struct linux_binprm *bprm) struct file *interp_file = NULL; int retval; int fd_binary = -1; + struct binfmt_namespace *ns = binfmt_ns(current_user_ns()); retval = -ENOEXEC; - if (!enabled) + if (!ns->enabled) return retval; /* to keep locking time low, we copy the interpreter string */ - read_lock(&entries_lock); - fmt = check_file(bprm); + read_lock(&ns->entries_lock); + fmt = check_file(ns, bprm); if (fmt) dget(fmt->dentry); - read_unlock(&entries_lock); + read_unlock(&ns->entries_lock); if (!fmt) return retval; @@ -609,19 +623,19 @@ static void bm_evict_inode(struct inode *inode) kfree(e); } -static void kill_node(Node *e) +static void kill_node(struct binfmt_namespace *ns, Node *e) { struct dentry *dentry; - write_lock(&entries_lock); + write_lock(&ns->entries_lock); list_del_init(&e->list); - write_unlock(&entries_lock); + write_unlock(&ns->entries_lock); dentry = e->dentry; drop_nlink(d_inode(dentry)); d_drop(dentry); dput(dentry); - simple_release_fs(&bm_mnt, &entry_count); + simple_release_fs(&ns->bm_mnt, &ns->entry_count); } /* /<entry> */ @@ -651,6 +665,9 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer, struct dentry *root; Node *e = file_inode(file)->i_private; int res = parse_command(buffer, count); + struct binfmt_namespace *ns; + + ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns); switch (res) { case 1: @@ -667,7 +684,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer, inode_lock(d_inode(root)); if (!list_empty(&e->list)) - kill_node(e); + kill_node(ns, e); inode_unlock(d_inode(root)); break; @@ -693,6 +710,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, struct inode *inode; struct super_block *sb = file_inode(file)->i_sb; struct dentry *root = sb->s_root, *dentry; + struct binfmt_namespace *ns; int err = 0; e = create_entry(buffer, count); @@ -716,7 +734,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, if (!inode) goto out2; - err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count); + ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns); + err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt, + &ns->entry_count); if (err) { iput(inode); inode = NULL; @@ -725,12 +745,16 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, if (e->flags & MISC_FMT_OPEN_FILE) { struct file *f; + const struct cred *old_cred; + old_cred = override_creds(file->f_cred); f = open_exec(e->interpreter); + revert_creds(old_cred); if (IS_ERR(f)) { err = PTR_ERR(f); pr_notice("register: failed to install interpreter file %s\n", e->interpreter); - simple_release_fs(&bm_mnt, &entry_count); + simple_release_fs(&ns->bm_mnt, + &ns->entry_count); iput(inode); inode = NULL; goto out2; @@ -743,9 +767,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, inode->i_fop = &bm_entry_operations; d_instantiate(dentry, inode); - write_lock(&entries_lock); - list_add(&e->list, &entries); - write_unlock(&entries_lock); + write_lock(&ns->entries_lock); + list_add(&e->list, &ns->entries); + write_unlock(&ns->entries_lock); err = 0; out2: @@ -770,7 +794,9 @@ static const struct file_operations bm_register_operations = { static ssize_t bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) { - char *s = enabled ? "enabled\n" : "disabled\n"; + struct binfmt_namespace *ns = + binfmt_ns(file->f_path.dentry->d_sb->s_user_ns); + char *s = ns->enabled ? "enabled\n" : "disabled\n"; return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s)); } @@ -778,25 +804,28 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) static ssize_t bm_status_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { + struct binfmt_namespace *ns; int res = parse_command(buffer, count); struct dentry *root; + ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns); switch (res) { case 1: /* Disable all handlers. */ - enabled = 0; + ns->enabled = 0; break; case 2: /* Enable all handlers. */ - enabled = 1; + ns->enabled = 1; break; case 3: /* Delete all handlers. */ root = file_inode(file)->i_sb->s_root; inode_lock(d_inode(root)); - while (!list_empty(&entries)) - kill_node(list_first_entry(&entries, Node, list)); + while (!list_empty(&ns->entries)) + kill_node(ns, list_first_entry(&ns->entries, + Node, list)); inode_unlock(d_inode(root)); break; @@ -823,12 +852,34 @@ static const struct super_operations s_ops = { static int bm_fill_super(struct super_block *sb, void *data, int silent) { int err; + struct user_namespace *ns = sb->s_user_ns; static const struct tree_descr bm_files[] = { [2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO}, [3] = {"register", &bm_register_operations, S_IWUSR}, /* last one */ {""} }; + /* create a new binfmt namespace + * if we are not in the first user namespace + * but the binfmt namespace is the first one + */ + if (READ_ONCE(ns->binfmt_ns) == NULL) { + struct binfmt_namespace *new_ns; + + new_ns = kmalloc(sizeof(struct binfmt_namespace), + GFP_KERNEL); + if (new_ns == NULL) + return -ENOMEM; + INIT_LIST_HEAD(&new_ns->entries); + new_ns->enabled = 1; + rwlock_init(&new_ns->entries_lock); + new_ns->bm_mnt = NULL; + new_ns->entry_count = 0; + /* ensure new_ns is completely initialized before sharing it */ + smp_wmb(); + WRITE_ONCE(ns->binfmt_ns, new_ns); + } + err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files); if (!err) sb->s_op = &s_ops; @@ -838,7 +889,10 @@ static int bm_fill_super(struct super_block *sb, void *data, int silent) static struct dentry *bm_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { - return mount_single(fs_type, flags, data, bm_fill_super); + struct user_namespace *ns = current_user_ns(); + + return mount_ns(fs_type, flags, data, ns, ns, + bm_fill_super); } static struct linux_binfmt misc_format = { @@ -849,6 +903,7 @@ static struct linux_binfmt misc_format = { static struct file_system_type bm_fs_type = { .owner = THIS_MODULE, .name = "binfmt_misc", + .fs_flags = FS_USERNS_MOUNT, .mount = bm_mount, .kill_sb = kill_litter_super, }; diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index d6b74b91096b..319141da5315 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -52,6 +52,18 @@ enum ucount_type { UCOUNT_COUNTS, }; +#if IS_ENABLED(CONFIG_BINFMT_MISC) +struct binfmt_namespace { + struct list_head entries; + rwlock_t entries_lock; + int enabled; + struct vfsmount *bm_mnt; + int entry_count; +} __randomize_layout; + +extern struct binfmt_namespace init_binfmt_ns; +#endif + struct user_namespace { struct uid_gid_map uid_map; struct uid_gid_map gid_map; @@ -76,6 +88,9 @@ struct user_namespace { #endif struct ucounts *ucounts; int ucount_max[UCOUNT_COUNTS]; +#if IS_ENABLED(CONFIG_BINFMT_MISC) + struct binfmt_namespace *binfmt_ns; +#endif } __randomize_layout; struct ucounts { diff --git a/kernel/user.c b/kernel/user.c index 0df9b1640b2a..220ab2053d44 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -19,6 +19,17 @@ #include <linux/user_namespace.h> #include <linux/proc_ns.h> +#if IS_ENABLED(CONFIG_BINFMT_MISC) +struct binfmt_namespace init_binfmt_ns = { + .entries = LIST_HEAD_INIT(init_binfmt_ns.entries), + .enabled = 1, + .entries_lock = __RW_LOCK_UNLOCKED(init_binfmt_ns.entries_lock), + .bm_mnt = NULL, + .entry_count = 0, +}; +EXPORT_SYMBOL_GPL(init_binfmt_ns); +#endif + /* * userns count is 1 for root user, 1 for init_uts_ns, * and 1 for... ? @@ -66,6 +77,9 @@ struct user_namespace init_user_ns = { .persistent_keyring_register_sem = __RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem), #endif +#if IS_ENABLED(CONFIG_BINFMT_MISC) + .binfmt_ns = &init_binfmt_ns, +#endif }; EXPORT_SYMBOL_GPL(init_user_ns); diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index e5222b5fb4fe..990cf5950a89 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -195,6 +195,9 @@ static void free_user_ns(struct work_struct *work) kfree(ns->projid_map.forward); kfree(ns->projid_map.reverse); } +#if IS_ENABLED(CONFIG_BINFMT_MISC) + kfree(ns->binfmt_ns); +#endif retire_userns_sysctls(ns); #ifdef CONFIG_PERSISTENT_KEYRINGS key_put(ns->persistent_keyring_register); -- 2.17.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v6 1/1] ns: add binfmt_misc to the user namespace 2018-10-10 16:14 ` [PATCH v6 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier @ 2018-10-16 10:13 ` Rasmus Villemoes 2018-10-16 10:53 ` Laurent Vivier 2018-10-16 16:22 ` Andrei Vagin 1 sibling, 1 reply; 17+ messages in thread From: Rasmus Villemoes @ 2018-10-16 10:13 UTC (permalink / raw) To: Laurent Vivier, linux-kernel Cc: Jann Horn, James Bottomley, linux-api, linux-fsdevel, Andrei Vagin, Alexander Viro, Eric Biederman, containers, Dmitry Safonov On 2018-10-10 18:14, Laurent Vivier wrote: > + /* create a new binfmt namespace > + * if we are not in the first user namespace > + * but the binfmt namespace is the first one > + */ > + if (READ_ONCE(ns->binfmt_ns) == NULL) { > + struct binfmt_namespace *new_ns; > + > + new_ns = kmalloc(sizeof(struct binfmt_namespace), > + GFP_KERNEL); > + if (new_ns == NULL) > + return -ENOMEM; > + INIT_LIST_HEAD(&new_ns->entries); > + new_ns->enabled = 1; > + rwlock_init(&new_ns->entries_lock); > + new_ns->bm_mnt = NULL; > + new_ns->entry_count = 0; > + /* ensure new_ns is completely initialized before sharing it */ > + smp_wmb(); > + WRITE_ONCE(ns->binfmt_ns, new_ns); > + } If ns->binfmt_ns can really change under us (given you use READ_ONCE), what prevents two instances of this code running at the same time, in which case one of them would leak its new_ns instance? Also, there doesn't seem to be any smp_rmb() buddy to that wmb(), I don't think that's implied by READ_ONCE() in binfmt_ns(). Rasmus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 1/1] ns: add binfmt_misc to the user namespace 2018-10-16 10:13 ` Rasmus Villemoes @ 2018-10-16 10:53 ` Laurent Vivier 0 siblings, 0 replies; 17+ messages in thread From: Laurent Vivier @ 2018-10-16 10:53 UTC (permalink / raw) To: Rasmus Villemoes, linux-kernel Cc: Jann Horn, James Bottomley, linux-api, linux-fsdevel, Andrei Vagin, Alexander Viro, Eric Biederman, containers, Dmitry Safonov Le 16/10/2018 à 12:13, Rasmus Villemoes a écrit : > On 2018-10-10 18:14, Laurent Vivier wrote: > >> + /* create a new binfmt namespace >> + * if we are not in the first user namespace >> + * but the binfmt namespace is the first one >> + */ >> + if (READ_ONCE(ns->binfmt_ns) == NULL) { >> + struct binfmt_namespace *new_ns; >> + >> + new_ns = kmalloc(sizeof(struct binfmt_namespace), >> + GFP_KERNEL); >> + if (new_ns == NULL) >> + return -ENOMEM; >> + INIT_LIST_HEAD(&new_ns->entries); >> + new_ns->enabled = 1; >> + rwlock_init(&new_ns->entries_lock); >> + new_ns->bm_mnt = NULL; >> + new_ns->entry_count = 0; >> + /* ensure new_ns is completely initialized before sharing it */ >> + smp_wmb(); >> + WRITE_ONCE(ns->binfmt_ns, new_ns); >> + } > > If ns->binfmt_ns can really change under us (given you use READ_ONCE), > what prevents two instances of this code running at the same time, in > which case one of them would leak its new_ns instance? Also, there See https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1782780.html > doesn't seem to be any smp_rmb() buddy to that wmb(), I don't think > that's implied by READ_ONCE() in binfmt_ns(). See https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1783049.html Thanks, Laurent ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 1/1] ns: add binfmt_misc to the user namespace 2018-10-10 16:14 ` [PATCH v6 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier 2018-10-16 10:13 ` Rasmus Villemoes @ 2018-10-16 16:22 ` Andrei Vagin 2018-10-24 17:15 ` Laurent Vivier 1 sibling, 1 reply; 17+ messages in thread From: Andrei Vagin @ 2018-10-16 16:22 UTC (permalink / raw) To: Laurent Vivier Cc: linux-kernel, Jann Horn, James Bottomley, linux-api, linux-fsdevel, Alexander Viro, Eric Biederman, containers, Dmitry Safonov On Wed, Oct 10, 2018 at 06:14:30PM +0200, Laurent Vivier wrote: > This patch allows to have a different binfmt_misc configuration > for each new user namespace. By default, the binfmt_misc configuration > is the one of the previous level, but if the binfmt_misc filesystem is > mounted in the new namespace a new empty binfmt instance is created and > used in this namespace. > > For instance, using "unshare" we can start a chroot of another > architecture and configure the binfmt_misc interpreter without being root > to run the binaries in this chroot. > > Signed-off-by: Laurent Vivier <laurent@vivier.eu> Acked-by: Andrei Vagin <avagin@gmail.com> Thanks, Andrei > --- > fs/binfmt_misc.c | 111 ++++++++++++++++++++++++--------- > include/linux/user_namespace.h | 15 +++++ > kernel/user.c | 14 +++++ > kernel/user_namespace.c | 3 + > 4 files changed, 115 insertions(+), 28 deletions(-) > > diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c > index aa4a7a23ff99..df9dc3248b7b 100644 > --- a/fs/binfmt_misc.c > +++ b/fs/binfmt_misc.c > @@ -38,9 +38,6 @@ enum { > VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */ > }; > > -static LIST_HEAD(entries); > -static int enabled = 1; > - > enum {Enabled, Magic}; > #define MISC_FMT_PRESERVE_ARGV0 (1 << 31) > #define MISC_FMT_OPEN_BINARY (1 << 30) > @@ -60,10 +57,7 @@ typedef struct { > struct file *interp_file; > } Node; > > -static DEFINE_RWLOCK(entries_lock); > static struct file_system_type bm_fs_type; > -static struct vfsmount *bm_mnt; > -static int entry_count; > > /* > * Max length of the register string. Determined by: > @@ -80,18 +74,37 @@ static int entry_count; > */ > #define MAX_REGISTER_LENGTH 1920 > > +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns) > +{ > + struct binfmt_namespace *b_ns; > + > + while (ns) { > + b_ns = READ_ONCE(ns->binfmt_ns); > + if (b_ns) > + return b_ns; > + ns = ns->parent; > + } > + /* as the first user namespace is initialized with > + * &init_binfmt_ns we should never come here > + * but we try to stay safe by logging a warning > + * and returning a sane value > + */ > + WARN_ON_ONCE(1); > + return &init_binfmt_ns; > +} > + > /* > * Check if we support the binfmt > * if we do, return the node, else NULL > * locking is done in load_misc_binary > */ > -static Node *check_file(struct linux_binprm *bprm) > +static Node *check_file(struct binfmt_namespace *ns, struct linux_binprm *bprm) > { > char *p = strrchr(bprm->interp, '.'); > struct list_head *l; > > /* Walk all the registered handlers. */ > - list_for_each(l, &entries) { > + list_for_each(l, &ns->entries) { > Node *e = list_entry(l, Node, list); > char *s; > int j; > @@ -133,17 +146,18 @@ static int load_misc_binary(struct linux_binprm *bprm) > struct file *interp_file = NULL; > int retval; > int fd_binary = -1; > + struct binfmt_namespace *ns = binfmt_ns(current_user_ns()); > > retval = -ENOEXEC; > - if (!enabled) > + if (!ns->enabled) > return retval; > > /* to keep locking time low, we copy the interpreter string */ > - read_lock(&entries_lock); > - fmt = check_file(bprm); > + read_lock(&ns->entries_lock); > + fmt = check_file(ns, bprm); > if (fmt) > dget(fmt->dentry); > - read_unlock(&entries_lock); > + read_unlock(&ns->entries_lock); > if (!fmt) > return retval; > > @@ -609,19 +623,19 @@ static void bm_evict_inode(struct inode *inode) > kfree(e); > } > > -static void kill_node(Node *e) > +static void kill_node(struct binfmt_namespace *ns, Node *e) > { > struct dentry *dentry; > > - write_lock(&entries_lock); > + write_lock(&ns->entries_lock); > list_del_init(&e->list); > - write_unlock(&entries_lock); > + write_unlock(&ns->entries_lock); > > dentry = e->dentry; > drop_nlink(d_inode(dentry)); > d_drop(dentry); > dput(dentry); > - simple_release_fs(&bm_mnt, &entry_count); > + simple_release_fs(&ns->bm_mnt, &ns->entry_count); > } > > /* /<entry> */ > @@ -651,6 +665,9 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer, > struct dentry *root; > Node *e = file_inode(file)->i_private; > int res = parse_command(buffer, count); > + struct binfmt_namespace *ns; > + > + ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns); > > switch (res) { > case 1: > @@ -667,7 +684,7 @@ static ssize_t bm_entry_write(struct file *file, const char __user *buffer, > inode_lock(d_inode(root)); > > if (!list_empty(&e->list)) > - kill_node(e); > + kill_node(ns, e); > > inode_unlock(d_inode(root)); > break; > @@ -693,6 +710,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > struct inode *inode; > struct super_block *sb = file_inode(file)->i_sb; > struct dentry *root = sb->s_root, *dentry; > + struct binfmt_namespace *ns; > int err = 0; > > e = create_entry(buffer, count); > @@ -716,7 +734,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > if (!inode) > goto out2; > > - err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count); > + ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns); > + err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt, > + &ns->entry_count); > if (err) { > iput(inode); > inode = NULL; > @@ -725,12 +745,16 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > > if (e->flags & MISC_FMT_OPEN_FILE) { > struct file *f; > + const struct cred *old_cred; > > + old_cred = override_creds(file->f_cred); > f = open_exec(e->interpreter); > + revert_creds(old_cred); > if (IS_ERR(f)) { > err = PTR_ERR(f); > pr_notice("register: failed to install interpreter file %s\n", e->interpreter); > - simple_release_fs(&bm_mnt, &entry_count); > + simple_release_fs(&ns->bm_mnt, > + &ns->entry_count); > iput(inode); > inode = NULL; > goto out2; > @@ -743,9 +767,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > inode->i_fop = &bm_entry_operations; > > d_instantiate(dentry, inode); > - write_lock(&entries_lock); > - list_add(&e->list, &entries); > - write_unlock(&entries_lock); > + write_lock(&ns->entries_lock); > + list_add(&e->list, &ns->entries); > + write_unlock(&ns->entries_lock); > > err = 0; > out2: > @@ -770,7 +794,9 @@ static const struct file_operations bm_register_operations = { > static ssize_t > bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) > { > - char *s = enabled ? "enabled\n" : "disabled\n"; > + struct binfmt_namespace *ns = > + binfmt_ns(file->f_path.dentry->d_sb->s_user_ns); > + char *s = ns->enabled ? "enabled\n" : "disabled\n"; > > return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s)); > } > @@ -778,25 +804,28 @@ bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) > static ssize_t bm_status_write(struct file *file, const char __user *buffer, > size_t count, loff_t *ppos) > { > + struct binfmt_namespace *ns; > int res = parse_command(buffer, count); > struct dentry *root; > > + ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns); > switch (res) { > case 1: > /* Disable all handlers. */ > - enabled = 0; > + ns->enabled = 0; > break; > case 2: > /* Enable all handlers. */ > - enabled = 1; > + ns->enabled = 1; > break; > case 3: > /* Delete all handlers. */ > root = file_inode(file)->i_sb->s_root; > inode_lock(d_inode(root)); > > - while (!list_empty(&entries)) > - kill_node(list_first_entry(&entries, Node, list)); > + while (!list_empty(&ns->entries)) > + kill_node(ns, list_first_entry(&ns->entries, > + Node, list)); > > inode_unlock(d_inode(root)); > break; > @@ -823,12 +852,34 @@ static const struct super_operations s_ops = { > static int bm_fill_super(struct super_block *sb, void *data, int silent) > { > int err; > + struct user_namespace *ns = sb->s_user_ns; > static const struct tree_descr bm_files[] = { > [2] = {"status", &bm_status_operations, S_IWUSR|S_IRUGO}, > [3] = {"register", &bm_register_operations, S_IWUSR}, > /* last one */ {""} > }; > > + /* create a new binfmt namespace > + * if we are not in the first user namespace > + * but the binfmt namespace is the first one > + */ > + if (READ_ONCE(ns->binfmt_ns) == NULL) { > + struct binfmt_namespace *new_ns; > + > + new_ns = kmalloc(sizeof(struct binfmt_namespace), > + GFP_KERNEL); > + if (new_ns == NULL) > + return -ENOMEM; > + INIT_LIST_HEAD(&new_ns->entries); > + new_ns->enabled = 1; > + rwlock_init(&new_ns->entries_lock); > + new_ns->bm_mnt = NULL; > + new_ns->entry_count = 0; > + /* ensure new_ns is completely initialized before sharing it */ > + smp_wmb(); > + WRITE_ONCE(ns->binfmt_ns, new_ns); > + } > + > err = simple_fill_super(sb, BINFMTFS_MAGIC, bm_files); > if (!err) > sb->s_op = &s_ops; > @@ -838,7 +889,10 @@ static int bm_fill_super(struct super_block *sb, void *data, int silent) > static struct dentry *bm_mount(struct file_system_type *fs_type, > int flags, const char *dev_name, void *data) > { > - return mount_single(fs_type, flags, data, bm_fill_super); > + struct user_namespace *ns = current_user_ns(); > + > + return mount_ns(fs_type, flags, data, ns, ns, > + bm_fill_super); > } > > static struct linux_binfmt misc_format = { > @@ -849,6 +903,7 @@ static struct linux_binfmt misc_format = { > static struct file_system_type bm_fs_type = { > .owner = THIS_MODULE, > .name = "binfmt_misc", > + .fs_flags = FS_USERNS_MOUNT, > .mount = bm_mount, > .kill_sb = kill_litter_super, > }; > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h > index d6b74b91096b..319141da5315 100644 > --- a/include/linux/user_namespace.h > +++ b/include/linux/user_namespace.h > @@ -52,6 +52,18 @@ enum ucount_type { > UCOUNT_COUNTS, > }; > > +#if IS_ENABLED(CONFIG_BINFMT_MISC) > +struct binfmt_namespace { > + struct list_head entries; > + rwlock_t entries_lock; > + int enabled; > + struct vfsmount *bm_mnt; > + int entry_count; > +} __randomize_layout; > + > +extern struct binfmt_namespace init_binfmt_ns; > +#endif > + > struct user_namespace { > struct uid_gid_map uid_map; > struct uid_gid_map gid_map; > @@ -76,6 +88,9 @@ struct user_namespace { > #endif > struct ucounts *ucounts; > int ucount_max[UCOUNT_COUNTS]; > +#if IS_ENABLED(CONFIG_BINFMT_MISC) > + struct binfmt_namespace *binfmt_ns; > +#endif > } __randomize_layout; > > struct ucounts { > diff --git a/kernel/user.c b/kernel/user.c > index 0df9b1640b2a..220ab2053d44 100644 > --- a/kernel/user.c > +++ b/kernel/user.c > @@ -19,6 +19,17 @@ > #include <linux/user_namespace.h> > #include <linux/proc_ns.h> > > +#if IS_ENABLED(CONFIG_BINFMT_MISC) > +struct binfmt_namespace init_binfmt_ns = { > + .entries = LIST_HEAD_INIT(init_binfmt_ns.entries), > + .enabled = 1, > + .entries_lock = __RW_LOCK_UNLOCKED(init_binfmt_ns.entries_lock), > + .bm_mnt = NULL, > + .entry_count = 0, > +}; > +EXPORT_SYMBOL_GPL(init_binfmt_ns); > +#endif > + > /* > * userns count is 1 for root user, 1 for init_uts_ns, > * and 1 for... ? > @@ -66,6 +77,9 @@ struct user_namespace init_user_ns = { > .persistent_keyring_register_sem = > __RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem), > #endif > +#if IS_ENABLED(CONFIG_BINFMT_MISC) > + .binfmt_ns = &init_binfmt_ns, > +#endif > }; > EXPORT_SYMBOL_GPL(init_user_ns); > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index e5222b5fb4fe..990cf5950a89 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -195,6 +195,9 @@ static void free_user_ns(struct work_struct *work) > kfree(ns->projid_map.forward); > kfree(ns->projid_map.reverse); > } > +#if IS_ENABLED(CONFIG_BINFMT_MISC) > + kfree(ns->binfmt_ns); > +#endif > retire_userns_sysctls(ns); > #ifdef CONFIG_PERSISTENT_KEYRINGS > key_put(ns->persistent_keyring_register); > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 1/1] ns: add binfmt_misc to the user namespace 2018-10-16 16:22 ` Andrei Vagin @ 2018-10-24 17:15 ` Laurent Vivier 2018-10-30 8:51 ` Laurent Vivier 0 siblings, 1 reply; 17+ messages in thread From: Laurent Vivier @ 2018-10-24 17:15 UTC (permalink / raw) To: Andrei Vagin Cc: linux-kernel, Jann Horn, James Bottomley, linux-api, linux-fsdevel, Alexander Viro, Eric Biederman, containers, Dmitry Safonov On 16/10/2018 17:22, Andrei Vagin wrote: > On Wed, Oct 10, 2018 at 06:14:30PM +0200, Laurent Vivier wrote: >> This patch allows to have a different binfmt_misc configuration >> for each new user namespace. By default, the binfmt_misc configuration >> is the one of the previous level, but if the binfmt_misc filesystem is >> mounted in the new namespace a new empty binfmt instance is created and >> used in this namespace. >> >> For instance, using "unshare" we can start a chroot of another >> architecture and configure the binfmt_misc interpreter without being root >> to run the binaries in this chroot. >> >> Signed-off-by: Laurent Vivier <laurent@vivier.eu> > > Acked-by: Andrei Vagin <avagin@gmail.com> > > Thanks, > Andrei > I don't konw who is the maintainer for this part, but is there any chance to have this merged in 4.20? Thanks, Laurent ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 1/1] ns: add binfmt_misc to the user namespace 2018-10-24 17:15 ` Laurent Vivier @ 2018-10-30 8:51 ` Laurent Vivier 2018-11-06 19:40 ` Jann Horn 0 siblings, 1 reply; 17+ messages in thread From: Laurent Vivier @ 2018-10-30 8:51 UTC (permalink / raw) To: Eric Biederman Cc: Andrei Vagin, linux-kernel, Jann Horn, James Bottomley, linux-api, linux-fsdevel, Alexander Viro, containers, Dmitry Safonov Le 24/10/2018 à 19:15, Laurent Vivier a écrit : > On 16/10/2018 17:22, Andrei Vagin wrote: >> On Wed, Oct 10, 2018 at 06:14:30PM +0200, Laurent Vivier wrote: >>> This patch allows to have a different binfmt_misc configuration >>> for each new user namespace. By default, the binfmt_misc configuration >>> is the one of the previous level, but if the binfmt_misc filesystem is >>> mounted in the new namespace a new empty binfmt instance is created and >>> used in this namespace. >>> >>> For instance, using "unshare" we can start a chroot of another >>> architecture and configure the binfmt_misc interpreter without being root >>> to run the binaries in this chroot. >>> >>> Signed-off-by: Laurent Vivier <laurent@vivier.eu> >> >> Acked-by: Andrei Vagin <avagin@gmail.com> >> >> Thanks, >> Andrei >> > > I don't konw who is the maintainer for this part, but is there any > chance to have this merged in 4.20? > I'd really want to have this merged. I have some real use cases for this: 1- to allow a non root user to run a container (with "unshare" for instance) with its own binfmt_misc configuration. For instance, like we provide a disk image and ask an ordinary user to run it with his favorite VM hypervisor, we can provide a tar.gz containing our own interpreter and just ask him to unshare+chroot to the exploded file tree, 2- to allow to run automatic tests of an interpreter on a machine without having to change the global configuration of the system. I have in mind to add some tests in Avocado to automatically test qemu-linux-user in containers, so the interpreter path can depend on the build path and possibly run them concurrently, 3- to select an interpreter by container. For instance, on the qemu-devel mailing list, we have a waiting patch to add the bFLT interpreter binfmt_misc configuration, but the bFLT doesn't provide the CPU type in magic/mask. So it would be interesting to be able to select also a bFLT interpreter by container, as we know the CPU architecture we have in each chroot/container, 4- another example to select an interpreter by container is qemu-mips and qemu-misn32 share the same magic/mask because only the kernel API changes, so we can't configure both on the system (but I agree it's a QEMU bug: they should be merged and the kernel API be selected at runtime). Thanks, Laurent ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 1/1] ns: add binfmt_misc to the user namespace 2018-10-30 8:51 ` Laurent Vivier @ 2018-11-06 19:40 ` Jann Horn 0 siblings, 0 replies; 17+ messages in thread From: Jann Horn @ 2018-11-06 19:40 UTC (permalink / raw) To: Laurent Vivier, Andrew Morton Cc: Eric W. Biederman, Andrei Vagin, kernel list, James Bottomley, Linux API, linux-fsdevel, Al Viro, containers, dima On Tue, Oct 30, 2018 at 9:51 AM Laurent Vivier <laurent@vivier.eu> wrote: > Le 24/10/2018 à 19:15, Laurent Vivier a écrit : > > On 16/10/2018 17:22, Andrei Vagin wrote: > >> On Wed, Oct 10, 2018 at 06:14:30PM +0200, Laurent Vivier wrote: > >>> This patch allows to have a different binfmt_misc configuration > >>> for each new user namespace. By default, the binfmt_misc configuration > >>> is the one of the previous level, but if the binfmt_misc filesystem is > >>> mounted in the new namespace a new empty binfmt instance is created and > >>> used in this namespace. > >>> > >>> For instance, using "unshare" we can start a chroot of another > >>> architecture and configure the binfmt_misc interpreter without being root > >>> to run the binaries in this chroot. > >>> > >>> Signed-off-by: Laurent Vivier <laurent@vivier.eu> > >> > >> Acked-by: Andrei Vagin <avagin@gmail.com> > >> > >> Thanks, > >> Andrei > >> > > > > I don't konw who is the maintainer for this part, I think Andrew Morton is the right maintainer here. > > but is there any > > chance to have this merged in 4.20? > > > > I'd really want to have this merged. > > I have some real use cases for this: > > 1- to allow a non root user to run a container (with "unshare" for > instance) with its own binfmt_misc configuration. For instance, like we > provide a disk image and ask an ordinary user to run it with his > favorite VM hypervisor, we can provide a tar.gz containing our own > interpreter and just ask him to unshare+chroot to the exploded file tree, > > 2- to allow to run automatic tests of an interpreter on a machine > without having to change the global configuration of the system. I have > in mind to add some tests in Avocado to automatically test > qemu-linux-user in containers, so the interpreter path can depend on the > build path and possibly run them concurrently, > > 3- to select an interpreter by container. For instance, on the > qemu-devel mailing list, we have a waiting patch to add the bFLT > interpreter binfmt_misc configuration, but the bFLT doesn't provide the > CPU type in magic/mask. So it would be interesting to be able to select > also a bFLT interpreter by container, as we know the CPU architecture we > have in each chroot/container, > > 4- another example to select an interpreter by container is qemu-mips > and qemu-misn32 share the same magic/mask because only the kernel API > changes, so we can't configure both on the system (but I agree it's a > QEMU bug: they should be merged and the kernel API be selected at runtime). ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 0/1] ns: introduce binfmt_misc namespace 2018-10-10 16:14 [PATCH v6 0/1] ns: introduce binfmt_misc namespace Laurent Vivier 2018-10-10 16:14 ` [PATCH v6 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier @ 2018-10-16 9:52 ` Laurent Vivier 2018-11-01 2:59 ` James Bottomley 1 sibling, 1 reply; 17+ messages in thread From: Laurent Vivier @ 2018-10-16 9:52 UTC (permalink / raw) To: linux-kernel Cc: Jann Horn, James Bottomley, linux-api, linux-fsdevel, Andrei Vagin, Alexander Viro, Eric Biederman, containers, Dmitry Safonov Hi, Any comment on this last version? Any chance to be merged? Thanks, Laurent Le 10/10/2018 à 18:14, Laurent Vivier a écrit : > v6: Return &init_binfmt_ns instead of NULL in binfmt_ns() > This should never happen, but to stay safe return a > value we can use. > change subject from "RFC" to "PATCH" > > v5: Use READ_ONCE()/WRITE_ONCE() > move mount pointer struct init to bm_fill_super() and add smp_wmb() > remove useless NULL value init > add WARN_ON_ONCE() > > v4: first user namespace is initialized with &init_binfmt_ns, > all new user namespaces are initialized with a NULL and use > the one of the first parent that is not NULL. The pointer > is initialized to a valid value the first time the binfmt_misc > fs is mounted in the current user namespace. > This allows to not change the way it was working before: > new ns inherits values from its parent, and if parent value is modified > (or parent creates its own binfmt entry by mounting the fs) child > inherits it (unless it has itself mounted the fs). > > v3: create a structure to store binfmt_misc data, > add a pointer to this structure in the user_namespace structure, > in init_user_ns structure this pointer points to an init_binfmt_ns > structure. And all new user namespaces point to this init structure. > A new binfmt namespace structure is allocated if the binfmt_misc > filesystem is mounted in a user namespace that is not the initial > one but its binfmt namespace pointer points to the initial one. > add override_creds()/revert_creds() around open_exec() in > bm_register_write() > > v2: no new namespace, binfmt_misc data are now part of > the mount namespace > I put this in mount namespace instead of user namespace > because the mount namespace is already needed and > I don't want to force to have the user namespace for that. > As this is a filesystem, it seems logic to have it here. > > This allows to define a new interpreter for each new container. > > But the main goal is to be able to chroot to a directory > using a binfmt_misc interpreter without being root. > > I have a modified version of unshare at: > > git@github.com:vivier/util-linux.git branch unshare-chroot > > with some new options to unshare binfmt_misc namespace and to chroot > to a directory. > > If you have a directory /chroot/powerpc/jessie containing debian for powerpc > binaries and a qemu-ppc interpreter, you can do for instance: > > $ uname -a > Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux > $ ./unshare --map-root-user --fork --pid \ > --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \ > --root=/chroot/powerpc/jessie /bin/bash -l > # uname -a > Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux > # id > uid=0(root) gid=0(root) groups=0(root),65534(nogroup) > # ls -l > total 5940 > drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:58 bin > drwxr-xr-x. 2 nobody nogroup 4096 Jun 17 20:26 boot > drwxr-xr-x. 4 nobody nogroup 4096 Aug 12 00:08 dev > drwxr-xr-x. 42 nobody nogroup 4096 Sep 28 07:25 etc > drwxr-xr-x. 3 nobody nogroup 4096 Sep 28 07:25 home > drwxr-xr-x. 9 nobody nogroup 4096 Aug 12 00:58 lib > drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 media > drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 mnt > drwxr-xr-x. 3 nobody nogroup 4096 Aug 12 13:09 opt > dr-xr-xr-x. 143 nobody nogroup 0 Sep 30 23:02 proc > -rwxr-xr-x. 1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc > drwx------. 3 nobody nogroup 4096 Aug 12 12:54 root > drwxr-xr-x. 3 nobody nogroup 4096 Aug 12 00:08 run > drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:58 sbin > drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 srv > drwxr-xr-x. 2 nobody nogroup 4096 Apr 6 2015 sys > drwxrwxrwt. 2 nobody nogroup 4096 Sep 28 10:31 tmp > drwxr-xr-x. 10 nobody nogroup 4096 Aug 12 00:08 usr > drwxr-xr-x. 11 nobody nogroup 4096 Aug 12 00:08 var > > If you want to use the qemu binary provided by your distro, you can use > > --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF" > > With the 'F' flag, qemu-ppc-static will be then loaded from the main root > filesystem before switching to the chroot. > > Laurent Vivier (1): > ns: add binfmt_misc to the user namespace > > fs/binfmt_misc.c | 111 ++++++++++++++++++++++++--------- > include/linux/user_namespace.h | 15 +++++ > kernel/user.c | 14 +++++ > kernel/user_namespace.c | 3 + > 4 files changed, 115 insertions(+), 28 deletions(-) > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 0/1] ns: introduce binfmt_misc namespace 2018-10-16 9:52 ` [PATCH v6 0/1] ns: introduce binfmt_misc namespace Laurent Vivier @ 2018-11-01 2:59 ` James Bottomley 2018-11-01 3:51 ` Jann Horn 0 siblings, 1 reply; 17+ messages in thread From: James Bottomley @ 2018-11-01 2:59 UTC (permalink / raw) To: Laurent Vivier, linux-kernel Cc: Jann Horn, linux-api, containers, Dmitry Safonov, Alexander Viro, linux-fsdevel, Eric Biederman On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: > Hi, > > Any comment on this last version? > > Any chance to be merged? I've got a use case for this: I went to one of the Graphene talks in Edinburgh and it struck me that we seem to keep reinventing the type of sandboxing that qemu-user already does. However if you want to do an x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism because that has you running *every* binary on the system emulated. Doing it per user namespace fixes this problem and allows us to at least cut down on all the pointless duplication. James ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 0/1] ns: introduce binfmt_misc namespace 2018-11-01 2:59 ` James Bottomley @ 2018-11-01 3:51 ` Jann Horn 2018-11-01 12:28 ` Laurent Vivier 2018-11-01 14:10 ` James Bottomley 0 siblings, 2 replies; 17+ messages in thread From: Jann Horn @ 2018-11-01 3:51 UTC (permalink / raw) To: James Bottomley Cc: Laurent Vivier, kernel list, Linux API, containers, dima, Al Viro, linux-fsdevel, Eric W. Biederman On Thu, Nov 1, 2018 at 3:59 AM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: > > Hi, > > > > Any comment on this last version? > > > > Any chance to be merged? > > I've got a use case for this: I went to one of the Graphene talks in > Edinburgh and it struck me that we seem to keep reinventing the type of > sandboxing that qemu-user already does. However if you want to do an > x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism > because that has you running *every* binary on the system emulated. > Doing it per user namespace fixes this problem and allows us to at > least cut down on all the pointless duplication. Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes your code slower and *LESS* secure. As far as I know, qemu-user is only intended for purposes like development and testing. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 0/1] ns: introduce binfmt_misc namespace 2018-11-01 3:51 ` Jann Horn @ 2018-11-01 12:28 ` Laurent Vivier 2018-11-01 14:16 ` Eric W. Biederman 2018-11-01 14:10 ` James Bottomley 1 sibling, 1 reply; 17+ messages in thread From: Laurent Vivier @ 2018-11-01 12:28 UTC (permalink / raw) To: Jann Horn, James Bottomley Cc: kernel list, Linux API, containers, dima, Al Viro, linux-fsdevel, Eric W. Biederman On 01/11/2018 04:51, Jann Horn wrote: > On Thu, Nov 1, 2018 at 3:59 AM James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: >> >> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: >>> Hi, >>> >>> Any comment on this last version? >>> >>> Any chance to be merged? >> >> I've got a use case for this: I went to one of the Graphene talks in >> Edinburgh and it struck me that we seem to keep reinventing the type of >> sandboxing that qemu-user already does. However if you want to do an >> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism >> because that has you running *every* binary on the system emulated. >> Doing it per user namespace fixes this problem and allows us to at >> least cut down on all the pointless duplication. > > Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes > your code slower and *LESS* secure. As far as I know, qemu-user is > only intended for purposes like development and testing. > I think the idea here is not to run qemu, but to use an interpreter (something like gVisor) into a container to control the binaries execution inside the container without using this interpreter on the host itself (container and host shares the same binfmt_misc magic/mask). Thanks, Laurent ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 0/1] ns: introduce binfmt_misc namespace 2018-11-01 12:28 ` Laurent Vivier @ 2018-11-01 14:16 ` Eric W. Biederman 2018-11-29 13:05 ` Laurent Vivier 0 siblings, 1 reply; 17+ messages in thread From: Eric W. Biederman @ 2018-11-01 14:16 UTC (permalink / raw) To: Laurent Vivier Cc: Jann Horn, James Bottomley, kernel list, Linux API, containers, dima, Al Viro, linux-fsdevel Laurent Vivier <laurent@vivier.eu> writes: > On 01/11/2018 04:51, Jann Horn wrote: >> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley >> <James.Bottomley@hansenpartnership.com> wrote: >>> >>> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: >>>> Hi, >>>> >>>> Any comment on this last version? >>>> >>>> Any chance to be merged? >>> >>> I've got a use case for this: I went to one of the Graphene talks in >>> Edinburgh and it struck me that we seem to keep reinventing the type of >>> sandboxing that qemu-user already does. However if you want to do an >>> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism >>> because that has you running *every* binary on the system emulated. >>> Doing it per user namespace fixes this problem and allows us to at >>> least cut down on all the pointless duplication. >> >> Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes >> your code slower and *LESS* secure. As far as I know, qemu-user is >> only intended for purposes like development and testing. >> > > I think the idea here is not to run qemu, but to use an interpreter > (something like gVisor) into a container to control the binaries > execution inside the container without using this interpreter on the > host itself (container and host shares the same binfmt_misc > magic/mask). Please remind me of this patchset after the merge window is over, and if there are no issues I will take it via my user namespace branch. Last I looked I had a concern that some of the permission check issues were being papered over by using override cred instead of fixing the deaper code. Sometimes they are necessary but seeing work-arounds instead of fixes for problems tends to be a maintenance issue, possibly with security consequences. Best is if the everyone agrees on how all of the interfaces work so their are no surprises. Eric ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 0/1] ns: introduce binfmt_misc namespace 2018-11-01 14:16 ` Eric W. Biederman @ 2018-11-29 13:05 ` Laurent Vivier 2018-12-29 15:41 ` Laurent Vivier 0 siblings, 1 reply; 17+ messages in thread From: Laurent Vivier @ 2018-11-29 13:05 UTC (permalink / raw) To: Eric W. Biederman Cc: Jann Horn, James Bottomley, kernel list, Linux API, containers, dima, Al Viro, linux-fsdevel, Andrew Morton Le 01/11/2018 à 15:16, Eric W. Biederman a écrit : > Laurent Vivier <laurent@vivier.eu> writes: > >> On 01/11/2018 04:51, Jann Horn wrote: >>> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley >>> <James.Bottomley@hansenpartnership.com> wrote: >>>> >>>> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: >>>>> Hi, >>>>> >>>>> Any comment on this last version? >>>>> >>>>> Any chance to be merged? >>>> >>>> I've got a use case for this: I went to one of the Graphene talks in >>>> Edinburgh and it struck me that we seem to keep reinventing the type of >>>> sandboxing that qemu-user already does. However if you want to do an >>>> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism >>>> because that has you running *every* binary on the system emulated. >>>> Doing it per user namespace fixes this problem and allows us to at >>>> least cut down on all the pointless duplication. >>> >>> Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes >>> your code slower and *LESS* secure. As far as I know, qemu-user is >>> only intended for purposes like development and testing. >>> >> >> I think the idea here is not to run qemu, but to use an interpreter >> (something like gVisor) into a container to control the binaries >> execution inside the container without using this interpreter on the >> host itself (container and host shares the same binfmt_misc >> magic/mask). > > Please remind me of this patchset after the merge window is over, and if > there are no issues I will take it via my user namespace branch. > > Last I looked I had a concern that some of the permission check issues > were being papered over by using override cred instead of fixing the > deaper code. Sometimes they are necessary but seeing work-arounds > instead of fixes for problems tends to be a maintenance issue, possibly > with security consequences. Best is if the everyone agrees on how all > of the interfaces work so their are no surprises. I don't know where we are in the merge window, but is there something I can do to have this merged? Thanks, Laurent ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 0/1] ns: introduce binfmt_misc namespace 2018-11-29 13:05 ` Laurent Vivier @ 2018-12-29 15:41 ` Laurent Vivier 0 siblings, 0 replies; 17+ messages in thread From: Laurent Vivier @ 2018-12-29 15:41 UTC (permalink / raw) To: Eric W. Biederman Cc: Jann Horn, James Bottomley, kernel list, Linux API, containers, dima, Al Viro, linux-fsdevel, Andrew Morton Ping Thanks, Laurent Le 29/11/2018 à 14:05, Laurent Vivier a écrit : > Le 01/11/2018 à 15:16, Eric W. Biederman a écrit : >> Laurent Vivier <laurent@vivier.eu> writes: >> >>> On 01/11/2018 04:51, Jann Horn wrote: >>>> On Thu, Nov 1, 2018 at 3:59 AM James Bottomley >>>> <James.Bottomley@hansenpartnership.com> wrote: >>>>> >>>>> On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: >>>>>> Hi, >>>>>> >>>>>> Any comment on this last version? >>>>>> >>>>>> Any chance to be merged? >>>>> >>>>> I've got a use case for this: I went to one of the Graphene talks in >>>>> Edinburgh and it struck me that we seem to keep reinventing the type of >>>>> sandboxing that qemu-user already does. However if you want to do an >>>>> x86 on x86 sandbox, you can't currently use the binfmt_misc mechanism >>>>> because that has you running *every* binary on the system emulated. >>>>> Doing it per user namespace fixes this problem and allows us to at >>>>> least cut down on all the pointless duplication. >>>> >>>> Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes >>>> your code slower and *LESS* secure. As far as I know, qemu-user is >>>> only intended for purposes like development and testing. >>>> >>> >>> I think the idea here is not to run qemu, but to use an interpreter >>> (something like gVisor) into a container to control the binaries >>> execution inside the container without using this interpreter on the >>> host itself (container and host shares the same binfmt_misc >>> magic/mask). >> >> Please remind me of this patchset after the merge window is over, and if >> there are no issues I will take it via my user namespace branch. >> >> Last I looked I had a concern that some of the permission check issues >> were being papered over by using override cred instead of fixing the >> deaper code. Sometimes they are necessary but seeing work-arounds >> instead of fixes for problems tends to be a maintenance issue, possibly >> with security consequences. Best is if the everyone agrees on how all >> of the interfaces work so their are no surprises. > > I don't know where we are in the merge window, but is there something I > can do to have this merged? > > Thanks, > Laurent > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 0/1] ns: introduce binfmt_misc namespace 2018-11-01 3:51 ` Jann Horn 2018-11-01 12:28 ` Laurent Vivier @ 2018-11-01 14:10 ` James Bottomley 2018-11-01 14:44 ` Jann Horn 1 sibling, 1 reply; 17+ messages in thread From: James Bottomley @ 2018-11-01 14:10 UTC (permalink / raw) To: Jann Horn Cc: Laurent Vivier, kernel list, Linux API, containers, dima, Al Viro, linux-fsdevel, Eric W. Biederman On Thu, 2018-11-01 at 04:51 +0100, Jann Horn wrote: > On Thu, Nov 1, 2018 at 3:59 AM James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: > > > Hi, > > > > > > Any comment on this last version? > > > > > > Any chance to be merged? > > > > I've got a use case for this: I went to one of the Graphene talks > > in Edinburgh and it struck me that we seem to keep reinventing the > > type of sandboxing that qemu-user already does. However if you > > want to do an x86 on x86 sandbox, you can't currently use the > > binfmt_misc mechanism because that has you running *every* binary > > on the system emulated. Doing it per user namespace fixes this > > problem and allows us to at least cut down on all the pointless > > duplication. > > Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes > your code slower and *LESS* secure. As far as I know, qemu-user is > only intended for purposes like development and testing. Sandboxing is about protecting the cloud service provider (and other tenants) from horizontal attack by reducing calls to the shared kernel. I think it's pretty indisputable that full emulation is an effective sandbox in that regard. We can argue for about bugginess vs completeness, but technologically qemu-user already has most of the system calls, which seems to be a significant problem with other sandboxes. I also can't dispute it's slower, but that's a tradeoff for people to make. James ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v6 0/1] ns: introduce binfmt_misc namespace 2018-11-01 14:10 ` James Bottomley @ 2018-11-01 14:44 ` Jann Horn 0 siblings, 0 replies; 17+ messages in thread From: Jann Horn @ 2018-11-01 14:44 UTC (permalink / raw) To: James Bottomley Cc: Laurent Vivier, kernel list, Linux API, containers, dima, Al Viro, linux-fsdevel, Eric W. Biederman On Thu, Nov 1, 2018 at 3:10 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > On Thu, 2018-11-01 at 04:51 +0100, Jann Horn wrote: > > On Thu, Nov 1, 2018 at 3:59 AM James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > On Tue, 2018-10-16 at 11:52 +0200, Laurent Vivier wrote: > > > > Hi, > > > > > > > > Any comment on this last version? > > > > > > > > Any chance to be merged? > > > > > > I've got a use case for this: I went to one of the Graphene talks > > > in Edinburgh and it struck me that we seem to keep reinventing the > > > type of sandboxing that qemu-user already does. However if you > > > want to do an x86 on x86 sandbox, you can't currently use the > > > binfmt_misc mechanism because that has you running *every* binary > > > on the system emulated. Doing it per user namespace fixes this > > > problem and allows us to at least cut down on all the pointless > > > duplication. > > > > Waaaaaait. What? qemu-user does not do "sandboxing". qemu-user makes > > your code slower and *LESS* secure. As far as I know, qemu-user is > > only intended for purposes like development and testing. > > Sandboxing is about protecting the cloud service provider (and other > tenants) from horizontal attack by reducing calls to the shared kernel. > I think it's pretty indisputable that full emulation is an effective > sandbox in that regard. > > We can argue for about bugginess vs completeness, but technologically > qemu-user already has most of the system calls, which seems to be a > significant problem with other sandboxes. I also can't dispute it's > slower, but that's a tradeoff for people to make. I'm pretty sure you don't understand how qemu-user works. When the emulated code makes a syscall, QEMU just forwards the syscall to the native kernel. QEMU doesn't even prevent you from accessing the address space used by the emulation logic. qemu-user is not for sandboxing. qemu-user is not for security. qemu-user is for running binaries from architecture A on architecture B, with as much direct access to the kernel's syscall surface as possible. An example: $ cat blah.c #include <fcntl.h> #include <unistd.h> #include <stdio.h> int main(void) { open("/foo/bar/blah", O_RDONLY); char c; printf("ptr is %p\n", &c); read(1337, &c, 1); *(volatile char *)0x13371338; } $ aarch64-linux-gnu-gcc -static -o blah blah.c && strace -f qemu-aarch64 ./blah [...] [pid 14181] openat(AT_FDCWD, "/foo/bar/blah", O_RDONLY) = -1 ENOENT (No such file or directory) [pid 14181] fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 93), ...}) = 0 [pid 14181] write(1, "ptr is 0x40007fff2f\n", 20ptr is 0x40007fff2f ) = 20 [pid 14181] read(1337, 0x40007fff2f, 1) = -1 EBADF (Bad file descriptor) [pid 14181] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x13371338} --- [...] ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-12-29 15:42 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-10-10 16:14 [PATCH v6 0/1] ns: introduce binfmt_misc namespace Laurent Vivier 2018-10-10 16:14 ` [PATCH v6 1/1] ns: add binfmt_misc to the user namespace Laurent Vivier 2018-10-16 10:13 ` Rasmus Villemoes 2018-10-16 10:53 ` Laurent Vivier 2018-10-16 16:22 ` Andrei Vagin 2018-10-24 17:15 ` Laurent Vivier 2018-10-30 8:51 ` Laurent Vivier 2018-11-06 19:40 ` Jann Horn 2018-10-16 9:52 ` [PATCH v6 0/1] ns: introduce binfmt_misc namespace Laurent Vivier 2018-11-01 2:59 ` James Bottomley 2018-11-01 3:51 ` Jann Horn 2018-11-01 12:28 ` Laurent Vivier 2018-11-01 14:16 ` Eric W. Biederman 2018-11-29 13:05 ` Laurent Vivier 2018-12-29 15:41 ` Laurent Vivier 2018-11-01 14:10 ` James Bottomley 2018-11-01 14:44 ` Jann Horn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).