* [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace @ 2018-10-02 10:20 Laurent Vivier 2018-10-02 10:20 ` [RFC v2 v2 1/1] ns: add binfmt_misc to the mount namespace Laurent Vivier 2018-10-02 16:13 ` [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace James Bottomley 0 siblings, 2 replies; 7+ messages in thread From: Laurent Vivier @ 2018-10-02 10:20 UTC (permalink / raw) To: linux-kernel Cc: Dmitry Safonov, Andrei Vagin, Eric Biederman, Alexander Viro, James Bottomley, containers, linux-fsdevel, linux-api, Laurent Vivier v2: no new namespace, binfmt_misc data are now part of the mount namespace I put this in mount namespace instead of user namespace because the mount namespace is already needed and I don't want to force to have the user namespace for that. As this is a filesystem, it seems logic to have it here. This allows to define a new interpreter for each new container. But the main goal is to be able to chroot to a directory using a binfmt_misc interpreter without being root. I have a modified version of unshare at: git@github.com:vivier/util-linux.git branch unshare-chroot with some new options to unshare binfmt_misc namespace and to chroot to a directory. If you have a directory /chroot/powerpc/jessie containing debian for powerpc binaries and a qemu-ppc interpreter, you can do for instance: $ uname -a Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux $ ./unshare --map-root-user --fork --pid \ --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \ --root=/chroot/powerpc/jessie /bin/bash -l # uname -a Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux # id uid=0(root) gid=0(root) groups=0(root),65534(nogroup) # ls -l total 5940 drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:58 bin drwxr-xr-x. 2 nobody nogroup 4096 Jun 17 20:26 boot drwxr-xr-x. 4 nobody nogroup 4096 Aug 12 00:08 dev drwxr-xr-x. 42 nobody nogroup 4096 Sep 28 07:25 etc drwxr-xr-x. 3 nobody nogroup 4096 Sep 28 07:25 home drwxr-xr-x. 9 nobody nogroup 4096 Aug 12 00:58 lib drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 media drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 mnt drwxr-xr-x. 3 nobody nogroup 4096 Aug 12 13:09 opt dr-xr-xr-x. 143 nobody nogroup 0 Sep 30 23:02 proc -rwxr-xr-x. 1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc drwx------. 3 nobody nogroup 4096 Aug 12 12:54 root drwxr-xr-x. 3 nobody nogroup 4096 Aug 12 00:08 run drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:58 sbin drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 srv drwxr-xr-x. 2 nobody nogroup 4096 Apr 6 2015 sys drwxrwxrwt. 2 nobody nogroup 4096 Sep 28 10:31 tmp drwxr-xr-x. 10 nobody nogroup 4096 Aug 12 00:08 usr drwxr-xr-x. 11 nobody nogroup 4096 Aug 12 00:08 var If you want to use the qemu binary provided by your distro, you can use --load-interp ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF" With the 'F' flag, qemu-ppc-static will be then loaded from the main root filesystem before switching to the chroot. Laurent Vivier (1): ns: add binfmt_misc to the mount namespace fs/binfmt_misc.c | 50 +++++++++++++++++++++++++----------------------- fs/mount.h | 8 ++++++++ fs/namespace.c | 6 ++++++ 3 files changed, 40 insertions(+), 24 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC v2 v2 1/1] ns: add binfmt_misc to the mount namespace 2018-10-02 10:20 [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace Laurent Vivier @ 2018-10-02 10:20 ` Laurent Vivier 2018-10-03 6:07 ` Eric W. Biederman 2018-10-02 16:13 ` [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace James Bottomley 1 sibling, 1 reply; 7+ messages in thread From: Laurent Vivier @ 2018-10-02 10:20 UTC (permalink / raw) To: linux-kernel Cc: Dmitry Safonov, Andrei Vagin, Eric Biederman, Alexander Viro, James Bottomley, containers, linux-fsdevel, linux-api, Laurent Vivier This patch allows to have a different binftm_misc configuration in each container we mount binfmt_misc filesystem with mount namespace enabled. A container started without the CLONE_NEWNS will use the host binfmt_misc configuration, otherwise the container starts with an empty binfmt_misc interpreters list. For instance, using "unshare" we can start a chroot of an another architecture and configure the binfmt_misc interpreted without being root to run the binaries in this chroot. Signed-off-by: Laurent Vivier <laurent@vivier.eu> --- fs/binfmt_misc.c | 50 +++++++++++++++++++++++++----------------------- fs/mount.h | 8 ++++++++ fs/namespace.c | 6 ++++++ 3 files changed, 40 insertions(+), 24 deletions(-) diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index aa4a7a23ff99..ecb14776c759 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -25,6 +25,7 @@ #include <linux/syscalls.h> #include <linux/fs.h> #include <linux/uaccess.h> +#include <mount.h> #include "internal.h" @@ -38,9 +39,6 @@ enum { VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */ }; -static LIST_HEAD(entries); -static int enabled = 1; - enum {Enabled, Magic}; #define MISC_FMT_PRESERVE_ARGV0 (1 << 31) #define MISC_FMT_OPEN_BINARY (1 << 30) @@ -60,10 +58,7 @@ typedef struct { struct file *interp_file; } Node; -static DEFINE_RWLOCK(entries_lock); static struct file_system_type bm_fs_type; -static struct vfsmount *bm_mnt; -static int entry_count; /* * Max length of the register string. Determined by: @@ -91,7 +86,7 @@ static Node *check_file(struct linux_binprm *bprm) struct list_head *l; /* Walk all the registered handlers. */ - list_for_each(l, &entries) { + list_for_each(l, &binfmt_ns(entries)) { Node *e = list_entry(l, Node, list); char *s; int j; @@ -135,15 +130,15 @@ static int load_misc_binary(struct linux_binprm *bprm) int fd_binary = -1; retval = -ENOEXEC; - if (!enabled) + if (!binfmt_ns(enabled)) return retval; /* to keep locking time low, we copy the interpreter string */ - read_lock(&entries_lock); + read_lock(&binfmt_ns(entries_lock)); fmt = check_file(bprm); if (fmt) dget(fmt->dentry); - read_unlock(&entries_lock); + read_unlock(&binfmt_ns(entries_lock)); if (!fmt) return retval; @@ -613,15 +608,15 @@ static void kill_node(Node *e) { struct dentry *dentry; - write_lock(&entries_lock); + write_lock(&binfmt_ns(entries_lock)); list_del_init(&e->list); - write_unlock(&entries_lock); + write_unlock(&binfmt_ns(entries_lock)); dentry = e->dentry; drop_nlink(d_inode(dentry)); d_drop(dentry); dput(dentry); - simple_release_fs(&bm_mnt, &entry_count); + simple_release_fs(&binfmt_ns(bm_mnt), &binfmt_ns(entry_count)); } /* /<entry> */ @@ -716,7 +711,8 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, if (!inode) goto out2; - err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count); + err = simple_pin_fs(&bm_fs_type, &binfmt_ns(bm_mnt), + &binfmt_ns(entry_count)); if (err) { iput(inode); inode = NULL; @@ -730,7 +726,8 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, if (IS_ERR(f)) { err = PTR_ERR(f); pr_notice("register: failed to install interpreter file %s\n", e->interpreter); - simple_release_fs(&bm_mnt, &entry_count); + simple_release_fs(&binfmt_ns(bm_mnt), + &binfmt_ns(entry_count)); iput(inode); inode = NULL; goto out2; @@ -743,9 +740,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, inode->i_fop = &bm_entry_operations; d_instantiate(dentry, inode); - write_lock(&entries_lock); - list_add(&e->list, &entries); - write_unlock(&entries_lock); + write_lock(&binfmt_ns(entries_lock)); + list_add(&e->list, &binfmt_ns(entries)); + write_unlock(&binfmt_ns(entries_lock)); err = 0; out2: @@ -770,7 +767,7 @@ static const struct file_operations bm_register_operations = { static ssize_t bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) { - char *s = enabled ? "enabled\n" : "disabled\n"; + char *s = binfmt_ns(enabled) ? "enabled\n" : "disabled\n"; return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s)); } @@ -784,19 +781,20 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer, switch (res) { case 1: /* Disable all handlers. */ - enabled = 0; + binfmt_ns(enabled) = 0; break; case 2: /* Enable all handlers. */ - enabled = 1; + binfmt_ns(enabled) = 1; break; case 3: /* Delete all handlers. */ root = file_inode(file)->i_sb->s_root; inode_lock(d_inode(root)); - while (!list_empty(&entries)) - kill_node(list_first_entry(&entries, Node, list)); + while (!list_empty(&binfmt_ns(entries))) + kill_node(list_first_entry(&binfmt_ns(entries), + Node, list)); inode_unlock(d_inode(root)); break; @@ -838,7 +836,10 @@ static int bm_fill_super(struct super_block *sb, void *data, int silent) static struct dentry *bm_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { - return mount_single(fs_type, flags, data, bm_fill_super); + struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns; + + return mount_ns(fs_type, flags, data, mnt_ns, mnt_ns->user_ns, + bm_fill_super); } static struct linux_binfmt misc_format = { @@ -849,6 +850,7 @@ static struct linux_binfmt misc_format = { static struct file_system_type bm_fs_type = { .owner = THIS_MODULE, .name = "binfmt_misc", + .fs_flags = FS_USERNS_MOUNT, .mount = bm_mount, .kill_sb = kill_litter_super, }; diff --git a/fs/mount.h b/fs/mount.h index f39bc9da4d73..f03b35141440 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -17,6 +17,12 @@ struct mnt_namespace { u64 event; unsigned int mounts; /* # of mounts in the namespace */ unsigned int pending_mounts; + /* binfmt misc */ + struct list_head entries; + rwlock_t entries_lock; + int enabled; + struct vfsmount *bm_mnt; + int entry_count; } __randomize_layout; struct mnt_pcp { @@ -72,6 +78,8 @@ struct mount { struct dentry *mnt_ex_mountpoint; } __randomize_layout; +#define binfmt_ns(a) (current->nsproxy->mnt_ns->a) + #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */ static inline struct mount *real_mount(struct vfsmount *mnt) diff --git a/fs/namespace.c b/fs/namespace.c index 99186556f8d3..f92b8371228d 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2850,6 +2850,12 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns) new_ns->ucounts = ucounts; new_ns->mounts = 0; new_ns->pending_mounts = 0; + /* binfmt_misc */ + INIT_LIST_HEAD(&new_ns->entries); + new_ns->enabled = 1; + rwlock_init(&new_ns->entries_lock); + new_ns->bm_mnt = NULL; + new_ns->entry_count = 0; return new_ns; } -- 2.17.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC v2 v2 1/1] ns: add binfmt_misc to the mount namespace 2018-10-02 10:20 ` [RFC v2 v2 1/1] ns: add binfmt_misc to the mount namespace Laurent Vivier @ 2018-10-03 6:07 ` Eric W. Biederman 2018-10-03 19:26 ` Jann Horn 0 siblings, 1 reply; 7+ messages in thread From: Eric W. Biederman @ 2018-10-03 6:07 UTC (permalink / raw) To: Laurent Vivier Cc: linux-kernel, Dmitry Safonov, Andrei Vagin, Alexander Viro, James Bottomley, containers, linux-fsdevel, linux-api Laurent Vivier <laurent@vivier.eu> writes: > This patch allows to have a different binftm_misc configuration > in each container we mount binfmt_misc filesystem with mount namespace > enabled. > > A container started without the CLONE_NEWNS will use the host binfmt_misc > configuration, otherwise the container starts with an empty binfmt_misc > interpreters list. > > For instance, using "unshare" we can start a chroot of an another > architecture and configure the binfmt_misc interpreted without being root > to run the binaries in this chroot. A couple of things. As has already been mentioned on your previous version anything that comes through the filesystem interface needs to lookup up the local binfmt context not through current but through file->f_dentry->d_sb. AKA the superblock of the mounted filesystem. As you have this coded any time a mount namespace is unshared you get a new binfmt context. That has a very reasonable chance of breaking existing userspace. A mount of binfmt_misc today from within a user namespace is not allowed which is why I have figured that will be a nice place to trigger creating a new binfmt context. It is fundamentally necessary to be able to get a pointer to the binfmt context from current. Either stored in an existing namespace or stored in nsproxy. Anything else will risk breaking backwards compatibility with existing user space for no good reason. What is fundamentally being changed is the behavior of exec. Changing the behavior of exec needs to be carefully contained or we risk confusing privileged applications. I believe your last email to James Bottomley detailed a very strong use case for this functionality. As the key gains over the existing kernel is unprivileged use. As it is the behavior of exec that is changing. You definitely need a user namespace involved. So I think the simplest would be to hang the binfmt context off of a user namespace. But I am open to other ideas. My primary concern is that we keep the cognitive and the maintenance burden as small as is reasonably possible so that the costs don't out weigh the benefit. Eric > Signed-off-by: Laurent Vivier <laurent@vivier.eu> > --- > fs/binfmt_misc.c | 50 +++++++++++++++++++++++++----------------------- > fs/mount.h | 8 ++++++++ > fs/namespace.c | 6 ++++++ > 3 files changed, 40 insertions(+), 24 deletions(-) > > diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c > index aa4a7a23ff99..ecb14776c759 100644 > --- a/fs/binfmt_misc.c > +++ b/fs/binfmt_misc.c > @@ -25,6 +25,7 @@ > #include <linux/syscalls.h> > #include <linux/fs.h> > #include <linux/uaccess.h> > +#include <mount.h> > > #include "internal.h" > > @@ -38,9 +39,6 @@ enum { > VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */ > }; > > -static LIST_HEAD(entries); > -static int enabled = 1; > - > enum {Enabled, Magic}; > #define MISC_FMT_PRESERVE_ARGV0 (1 << 31) > #define MISC_FMT_OPEN_BINARY (1 << 30) > @@ -60,10 +58,7 @@ typedef struct { > struct file *interp_file; > } Node; > > -static DEFINE_RWLOCK(entries_lock); > static struct file_system_type bm_fs_type; > -static struct vfsmount *bm_mnt; > -static int entry_count; > > /* > * Max length of the register string. Determined by: > @@ -91,7 +86,7 @@ static Node *check_file(struct linux_binprm *bprm) > struct list_head *l; > > /* Walk all the registered handlers. */ > - list_for_each(l, &entries) { > + list_for_each(l, &binfmt_ns(entries)) { > Node *e = list_entry(l, Node, list); > char *s; > int j; > @@ -135,15 +130,15 @@ static int load_misc_binary(struct linux_binprm *bprm) > int fd_binary = -1; > > retval = -ENOEXEC; > - if (!enabled) > + if (!binfmt_ns(enabled)) > return retval; > > /* to keep locking time low, we copy the interpreter string */ > - read_lock(&entries_lock); > + read_lock(&binfmt_ns(entries_lock)); > fmt = check_file(bprm); > if (fmt) > dget(fmt->dentry); > - read_unlock(&entries_lock); > + read_unlock(&binfmt_ns(entries_lock)); > if (!fmt) > return retval; > > @@ -613,15 +608,15 @@ static void kill_node(Node *e) > { > struct dentry *dentry; > > - write_lock(&entries_lock); > + write_lock(&binfmt_ns(entries_lock)); > list_del_init(&e->list); > - write_unlock(&entries_lock); > + write_unlock(&binfmt_ns(entries_lock)); > > dentry = e->dentry; > drop_nlink(d_inode(dentry)); > d_drop(dentry); > dput(dentry); > - simple_release_fs(&bm_mnt, &entry_count); > + simple_release_fs(&binfmt_ns(bm_mnt), &binfmt_ns(entry_count)); > } > > /* /<entry> */ > @@ -716,7 +711,8 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > if (!inode) > goto out2; > > - err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count); > + err = simple_pin_fs(&bm_fs_type, &binfmt_ns(bm_mnt), > + &binfmt_ns(entry_count)); > if (err) { > iput(inode); > inode = NULL; > @@ -730,7 +726,8 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > if (IS_ERR(f)) { > err = PTR_ERR(f); > pr_notice("register: failed to install interpreter file %s\n", e->interpreter); > - simple_release_fs(&bm_mnt, &entry_count); > + simple_release_fs(&binfmt_ns(bm_mnt), > + &binfmt_ns(entry_count)); > iput(inode); > inode = NULL; > goto out2; > @@ -743,9 +740,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > inode->i_fop = &bm_entry_operations; > > d_instantiate(dentry, inode); > - write_lock(&entries_lock); > - list_add(&e->list, &entries); > - write_unlock(&entries_lock); > + write_lock(&binfmt_ns(entries_lock)); > + list_add(&e->list, &binfmt_ns(entries)); > + write_unlock(&binfmt_ns(entries_lock)); > > err = 0; > out2: > @@ -770,7 +767,7 @@ static const struct file_operations bm_register_operations = { > static ssize_t > bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) > { > - char *s = enabled ? "enabled\n" : "disabled\n"; > + char *s = binfmt_ns(enabled) ? "enabled\n" : "disabled\n"; > > return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s)); > } > @@ -784,19 +781,20 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer, > switch (res) { > case 1: > /* Disable all handlers. */ > - enabled = 0; > + binfmt_ns(enabled) = 0; > break; > case 2: > /* Enable all handlers. */ > - enabled = 1; > + binfmt_ns(enabled) = 1; > break; > case 3: > /* Delete all handlers. */ > root = file_inode(file)->i_sb->s_root; > inode_lock(d_inode(root)); > > - while (!list_empty(&entries)) > - kill_node(list_first_entry(&entries, Node, list)); > + while (!list_empty(&binfmt_ns(entries))) > + kill_node(list_first_entry(&binfmt_ns(entries), > + Node, list)); > > inode_unlock(d_inode(root)); > break; > @@ -838,7 +836,10 @@ static int bm_fill_super(struct super_block *sb, void *data, int silent) > static struct dentry *bm_mount(struct file_system_type *fs_type, > int flags, const char *dev_name, void *data) > { > - return mount_single(fs_type, flags, data, bm_fill_super); > + struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns; > + > + return mount_ns(fs_type, flags, data, mnt_ns, mnt_ns->user_ns, > + bm_fill_super); > } > > static struct linux_binfmt misc_format = { > @@ -849,6 +850,7 @@ static struct linux_binfmt misc_format = { > static struct file_system_type bm_fs_type = { > .owner = THIS_MODULE, > .name = "binfmt_misc", > + .fs_flags = FS_USERNS_MOUNT, > .mount = bm_mount, > .kill_sb = kill_litter_super, > }; > diff --git a/fs/mount.h b/fs/mount.h > index f39bc9da4d73..f03b35141440 100644 > --- a/fs/mount.h > +++ b/fs/mount.h > @@ -17,6 +17,12 @@ struct mnt_namespace { > u64 event; > unsigned int mounts; /* # of mounts in the namespace */ > unsigned int pending_mounts; > + /* binfmt misc */ > + struct list_head entries; > + rwlock_t entries_lock; > + int enabled; > + struct vfsmount *bm_mnt; > + int entry_count; > } __randomize_layout; > > struct mnt_pcp { > @@ -72,6 +78,8 @@ struct mount { > struct dentry *mnt_ex_mountpoint; > } __randomize_layout; > > +#define binfmt_ns(a) (current->nsproxy->mnt_ns->a) > + > #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */ > > static inline struct mount *real_mount(struct vfsmount *mnt) > diff --git a/fs/namespace.c b/fs/namespace.c > index 99186556f8d3..f92b8371228d 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2850,6 +2850,12 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns) > new_ns->ucounts = ucounts; > new_ns->mounts = 0; > new_ns->pending_mounts = 0; > + /* binfmt_misc */ > + INIT_LIST_HEAD(&new_ns->entries); > + new_ns->enabled = 1; > + rwlock_init(&new_ns->entries_lock); > + new_ns->bm_mnt = NULL; > + new_ns->entry_count = 0; > return new_ns; > } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC v2 v2 1/1] ns: add binfmt_misc to the mount namespace 2018-10-03 6:07 ` Eric W. Biederman @ 2018-10-03 19:26 ` Jann Horn 0 siblings, 0 replies; 7+ messages in thread From: Jann Horn @ 2018-10-03 19:26 UTC (permalink / raw) To: Eric W. Biederman Cc: Laurent Vivier, kernel list, dima, Andrei Vagin, Al Viro, James Bottomley, containers, linux-fsdevel, Linux API On Wed, Oct 3, 2018 at 8:07 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > Laurent Vivier <laurent@vivier.eu> writes: > > This patch allows to have a different binftm_misc configuration > > in each container we mount binfmt_misc filesystem with mount namespace > > enabled. > > > > A container started without the CLONE_NEWNS will use the host binfmt_misc > > configuration, otherwise the container starts with an empty binfmt_misc > > interpreters list. > > > > For instance, using "unshare" we can start a chroot of an another > > architecture and configure the binfmt_misc interpreted without being root > > to run the binaries in this chroot. > > A couple of things. > As has already been mentioned on your previous version anything that > comes through the filesystem interface needs to lookup up the local > binfmt context not through current but through file->f_dentry->d_sb. > AKA the superblock of the mounted filesystem. Something else: bm_register_write() currently calls into open_exec(), which uses the credentials of current. That's not really allowed in this context - but so far, it's not a big deal because only init-namespace root can reach this code. Before you expose this stuff to unprivileged userspace, this needs to get fixed; perhaps by wrapping the open_exec() call in override_creds(file->f_cred) and revert_creds(). ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace 2018-10-02 10:20 [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace Laurent Vivier 2018-10-02 10:20 ` [RFC v2 v2 1/1] ns: add binfmt_misc to the mount namespace Laurent Vivier @ 2018-10-02 16:13 ` James Bottomley 2018-10-02 16:47 ` Laurent Vivier 1 sibling, 1 reply; 7+ messages in thread From: James Bottomley @ 2018-10-02 16:13 UTC (permalink / raw) To: Laurent Vivier, linux-kernel Cc: Andrei Vagin, Dmitry Safonov, linux-api, containers, Eric Biederman, linux-fsdevel, Alexander Viro On Tue, 2018-10-02 at 12:20 +0200, Laurent Vivier wrote: > v2: no new namespace, binfmt_misc data are now part of > the mount namespace > I put this in mount namespace instead of user namespace > because the mount namespace is already needed and > I don't want to force to have the user namespace for that. > As this is a filesystem, it seems logic to have it here. > > This allows to define a new interpreter for each new container. > > But the main goal is to be able to chroot to a directory > using a binfmt_misc interpreter without being root. Reading all this, I don't quite understand why this works for me and not for you (I think I get from your explanation that it doesn't work for you, but I might have missed something): jejb@jarvis:~> uname -m x86_64 jejb@jarvis:~> unshare -r -m root@jarvis:~# chroot /home/jejb/containers/aarch64 jarvis:/ # uname -m aarch64 Of course to get that to work I have an 'F' entry in /etc/binfmt.d/qemu-aarch64.conf Which means I'm running the host emulator in the container, which is what I want to do. I think another goal of the patches might be to use different emulators for different aarch64 containers? Do you have a use case for this, because right at the moment for arch emulation containers I think a single host wide entry per static emulator is the right approach. James ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace 2018-10-02 16:13 ` [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace James Bottomley @ 2018-10-02 16:47 ` Laurent Vivier 2018-10-03 10:13 ` James Bottomley 0 siblings, 1 reply; 7+ messages in thread From: Laurent Vivier @ 2018-10-02 16:47 UTC (permalink / raw) To: James Bottomley, linux-kernel Cc: Andrei Vagin, Dmitry Safonov, linux-api, containers, Eric Biederman, linux-fsdevel, Alexander Viro Le 02/10/2018 à 18:13, James Bottomley a écrit : > On Tue, 2018-10-02 at 12:20 +0200, Laurent Vivier wrote: >> v2: no new namespace, binfmt_misc data are now part of >> the mount namespace >> I put this in mount namespace instead of user namespace >> because the mount namespace is already needed and >> I don't want to force to have the user namespace for that. >> As this is a filesystem, it seems logic to have it here. >> >> This allows to define a new interpreter for each new container. >> >> But the main goal is to be able to chroot to a directory >> using a binfmt_misc interpreter without being root. > > Reading all this, I don't quite understand why this works for me and > not for you (I think I get from your explanation that it doesn't work > for you, but I might have missed something): > > jejb@jarvis:~> uname -m > x86_64 > jejb@jarvis:~> unshare -r -m > root@jarvis:~# chroot /home/jejb/containers/aarch64 > jarvis:/ # uname -m > aarch64 > > Of course to get that to work I have an 'F' entry in > /etc/binfmt.d/qemu-aarch64.conf > I'd like to configure the interpreter without being root. As a simple user can run a VM and a full system inside, I'd like to be able to start a container/chroot without having to configure something at the host level. For instance, I'd like to provide to "someone" (with no admin rights) a tar file with inside an OS environment for a given target and the interpreter, and allow him to run the binaries inside just by running a simple command (like qemu-system-XXX -hda my.img) It's also interesting for a test purpose: I can test concurrently different interpreters for the same target without modifying the target root filesystem (with the 'F' flag but on a per directory basis) or the host configuration. Another case is we can't configure qemu-mips/qemu-mipsel (old kernel API) and qemu-mipsn32/qemu-mipsne32el (new kernel API) interpreters on the same system because they share the same ELF signature (to be honest qemu should have only one binary for the old and the new interface and dynamically change it according to the ELF binary that is loaded, as it is done for ARM). But if no one thinks it's useful, I don't want to push this more than that... Thanks, Laurent ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace 2018-10-02 16:47 ` Laurent Vivier @ 2018-10-03 10:13 ` James Bottomley 0 siblings, 0 replies; 7+ messages in thread From: James Bottomley @ 2018-10-03 10:13 UTC (permalink / raw) To: Laurent Vivier, linux-kernel Cc: Andrei Vagin, Dmitry Safonov, linux-api, containers, Eric Biederman, linux-fsdevel, Alexander Viro On Tue, 2018-10-02 at 18:47 +0200, Laurent Vivier wrote: > Le 02/10/2018 à 18:13, James Bottomley a écrit : > > On Tue, 2018-10-02 at 12:20 +0200, Laurent Vivier wrote: > > > v2: no new namespace, binfmt_misc data are now part of > > > the mount namespace > > > I put this in mount namespace instead of user namespace > > > because the mount namespace is already needed and > > > I don't want to force to have the user namespace for that. > > > As this is a filesystem, it seems logic to have it here. > > > > > > This allows to define a new interpreter for each new container. > > > > > > But the main goal is to be able to chroot to a directory > > > using a binfmt_misc interpreter without being root. > > > > Reading all this, I don't quite understand why this works for me > > and > > not for you (I think I get from your explanation that it doesn't > > work > > for you, but I might have missed something): > > > > jejb@jarvis:~> uname -m > > x86_64 > > jejb@jarvis:~> unshare -r -m > > root@jarvis:~# chroot /home/jejb/containers/aarch64 > > jarvis:/ # uname -m > > aarch64 > > > > Of course to get that to work I have an 'F' entry in > > /etc/binfmt.d/qemu-aarch64.conf > > > > I'd like to configure the interpreter without being root. > > As a simple user can run a VM and a full system inside, I'd like to > be > able to start a container/chroot without having to configure > something > at the host level. > > For instance, I'd like to provide to "someone" (with no admin rights) > a tar file with inside an OS environment for a given target and the > interpreter, and allow him to run the binaries inside just by running > a simple command (like qemu-system-XXX -hda my.img) OK, since trying to persuade the distros to add the 'F' flag has been challenging, I certainly buy this use case. There is a security risk to allowing an unprivileged user to supply an arbitrary interpreter (suid and sgid binaries), but as long as whatever's agreed requires root in the user namespace, I'm happy we have the security issue confined. James > It's also interesting for a test purpose: I can test concurrently > different interpreters for the same target without modifying the > target root filesystem (with the 'F' flag but on a per directory > basis) or the host configuration. > > Another case is we can't configure qemu-mips/qemu-mipsel (old kernel > API) and qemu-mipsn32/qemu-mipsne32el (new kernel API) interpreters > on the same system because they share the same ELF signature (to be > honest qemu should have only one binary for the old and the new > interface and dynamically change it according to the ELF binary that > is loaded, as it is done for ARM). > > But if no one thinks it's useful, I don't want to push this more than > that... > > Thanks, > Laurent > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-10-04 2:17 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-10-02 10:20 [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace Laurent Vivier 2018-10-02 10:20 ` [RFC v2 v2 1/1] ns: add binfmt_misc to the mount namespace Laurent Vivier 2018-10-03 6:07 ` Eric W. Biederman 2018-10-03 19:26 ` Jann Horn 2018-10-02 16:13 ` [RFC v2 v2 0/1] ns: introduce binfmt_misc namespace James Bottomley 2018-10-02 16:47 ` Laurent Vivier 2018-10-03 10:13 ` James Bottomley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).