* [RFC 0/2] ns: introduce binfmt_misc namespace @ 2018-09-30 23:46 Laurent Vivier 2018-09-30 23:46 ` [RFC 1/2] " Laurent Vivier ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Laurent Vivier @ 2018-09-30 23:46 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, James Bottomley, Alexander Viro, linux-api, Eric Biederman, Dmitry Safonov, Andrei Vagin, containers, Laurent Vivier This series introduces a new namespace for binfmt_misc. This allows to define a new interpreter for each new container. But the main goal is to be able to chroot to a directory using a binfmt_misc interpreter without being root. I have a modified version of unshare at: git@github.com:vivier/util-linux.git branch unshare-chroot with some new options to unshare binfmt_misc namespace and to chroot to a directory. If you have a directory /chroot/powerpc/jessie containing debian for powerpc binaries and a qemu-ppc interpreter, you can do for instance: $ uname -a Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 x86_64 x86_64 x86_64 GNU/Linux $ ./unshare --map-root-user --fork --pid \ --load-binfmt ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/qemu-ppc:OC" \ --root=/chroot/powerpc/jessie /bin/bash -l Linux fedora28-wor-2 4.19.0-rc5+ #18 SMP Mon Oct 1 00:32:34 CEST 2018 ppc GNU/Linux uid=0(root) gid=0(root) groups=0(root),65534(nogroup) total 5940 drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:58 bin drwxr-xr-x. 2 nobody nogroup 4096 Jun 17 20:26 boot drwxr-xr-x. 4 nobody nogroup 4096 Aug 12 00:08 dev drwxr-xr-x. 42 nobody nogroup 4096 Sep 28 07:25 etc drwxr-xr-x. 3 nobody nogroup 4096 Sep 28 07:25 home drwxr-xr-x. 9 nobody nogroup 4096 Aug 12 00:58 lib drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 media drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 mnt drwxr-xr-x. 3 nobody nogroup 4096 Aug 12 13:09 opt dr-xr-xr-x. 143 nobody nogroup 0 Sep 30 23:02 proc -rwxr-xr-x. 1 nobody nogroup 6009712 Sep 28 07:22 qemu-ppc drwx------. 3 nobody nogroup 4096 Aug 12 12:54 root drwxr-xr-x. 3 nobody nogroup 4096 Aug 12 00:08 run drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:58 sbin drwxr-xr-x. 2 nobody nogroup 4096 Aug 12 00:08 srv drwxr-xr-x. 2 nobody nogroup 4096 Apr 6 2015 sys drwxrwxrwt. 2 nobody nogroup 4096 Sep 28 10:31 tmp drwxr-xr-x. 10 nobody nogroup 4096 Aug 12 00:08 usr drwxr-xr-x. 11 nobody nogroup 4096 Aug 12 00:08 var If you want to use the qemu binary provided by your distro, you can use --load-binfmt ":qemu-ppc:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x14:\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff:/bin/qemu-ppc-static:OCF" With the 'F' flag, qemu-ppc-static will be then loaded from the main root filesystem before switching to the chroot. Laurent Vivier (2): ns: introduce binfmt_misc namespace binfmt_misc: move data to binfmt_namespace fs/binfmt_misc.c | 50 +++++----- fs/proc/namespaces.c | 3 + include/linux/binfmt_namespace.h | 63 ++++++++++++ include/linux/nsproxy.h | 2 + include/linux/proc_ns.h | 2 + include/linux/user_namespace.h | 1 + include/uapi/linux/sched.h | 1 + init/Kconfig | 8 ++ kernel/Makefile | 1 + kernel/binfmt_namespace.c | 164 +++++++++++++++++++++++++++++++ kernel/fork.c | 3 +- kernel/nsproxy.c | 18 +++- 12 files changed, 289 insertions(+), 27 deletions(-) create mode 100644 include/linux/binfmt_namespace.h create mode 100644 kernel/binfmt_namespace.c -- 2.17.1 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC 1/2] ns: introduce binfmt_misc namespace 2018-09-30 23:46 [RFC 0/2] ns: introduce binfmt_misc namespace Laurent Vivier @ 2018-09-30 23:46 ` Laurent Vivier 2018-10-01 1:21 ` Greg KH 2018-09-30 23:46 ` [RFC 2/2] binfmt_misc: move data to binfmt_namespace Laurent Vivier 2018-10-01 4:45 ` [RFC 0/2] ns: introduce binfmt_misc namespace Andy Lutomirski 2 siblings, 1 reply; 12+ messages in thread From: Laurent Vivier @ 2018-09-30 23:46 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, James Bottomley, Alexander Viro, linux-api, Eric Biederman, Dmitry Safonov, Andrei Vagin, containers, Laurent Vivier Signed-off-by: Laurent Vivier <laurent@vivier.eu> --- fs/proc/namespaces.c | 3 + include/linux/binfmt_namespace.h | 51 +++++++++++ include/linux/nsproxy.h | 2 + include/linux/proc_ns.h | 2 + include/linux/user_namespace.h | 1 + include/uapi/linux/sched.h | 1 + init/Kconfig | 8 ++ kernel/Makefile | 1 + kernel/binfmt_namespace.c | 153 +++++++++++++++++++++++++++++++ kernel/fork.c | 3 +- kernel/nsproxy.c | 18 +++- 11 files changed, 240 insertions(+), 3 deletions(-) create mode 100644 include/linux/binfmt_namespace.h create mode 100644 kernel/binfmt_namespace.c diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c index dd2b35f78b09..4d86549a788f 100644 --- a/fs/proc/namespaces.c +++ b/fs/proc/namespaces.c @@ -33,6 +33,9 @@ static const struct proc_ns_operations *ns_entries[] = { #ifdef CONFIG_CGROUPS &cgroupns_operations, #endif +#ifdef CONFIG_BINFMT_NS + &binfmtns_operations, +#endif }; static const char *proc_ns_get_link(struct dentry *dentry, diff --git a/include/linux/binfmt_namespace.h b/include/linux/binfmt_namespace.h new file mode 100644 index 000000000000..8688869ee254 --- /dev/null +++ b/include/linux/binfmt_namespace.h @@ -0,0 +1,51 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_BINFMT_NAMESPACE_H +#define _LINUX_BINFMT_NAMESPACE_H + +struct user_namespace; +extern struct user_namespace init_user_ns; + +struct binfmt_namespace { + struct kref kref; + struct user_namespace *user_ns; + struct ucounts *ucounts; + struct ns_common ns; +} __randomize_layout; +extern struct binfmt_namespace init_binfmt_ns; + +#ifdef CONFIG_BINFMT_NS +static inline void get_binfmt_ns(struct binfmt_namespace *ns) +{ + if (ns) + kref_get(&ns->kref); +} + +extern struct binfmt_namespace *copy_binfmt_ns(unsigned long flags, + struct user_namespace *user_ns, struct binfmt_namespace *old_ns); +extern void free_binfmt_ns(struct kref *kref); + +static inline void put_binfmt_ns(struct binfmt_namespace *ns) +{ + if (ns) + kref_put(&ns->kref, free_binfmt_ns); +} + +#else +static inline void get_binfmt_ns(struct binfmt_namespace *ns) +{ +} + +static inline void put_binfmt_ns(struct binfmt_namespace *ns) +{ +} + +static inline struct binfmt_namespace *copy_binfmt_ns(unsigned long flags, + struct user_namespace *user_ns, struct binfmt_namespace *old_ns) +{ + if (flags & CLONE_NEWBINFMT) + return ERR_PTR(-EINVAL); + + return old_ns; +} +#endif +#endif /* _LINUX_BINFMT_NAMESPACE_H */ diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h index 2ae1b1a4d84d..8d2294477095 100644 --- a/include/linux/nsproxy.h +++ b/include/linux/nsproxy.h @@ -10,6 +10,7 @@ struct uts_namespace; struct ipc_namespace; struct pid_namespace; struct cgroup_namespace; +struct binfmt_namespace; struct fs_struct; /* @@ -36,6 +37,7 @@ struct nsproxy { struct pid_namespace *pid_ns_for_children; struct net *net_ns; struct cgroup_namespace *cgroup_ns; + struct binfmt_namespace *binfmt_ns; }; extern struct nsproxy init_nsproxy; diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h index d31cb6215905..6afa2dbc5204 100644 --- a/include/linux/proc_ns.h +++ b/include/linux/proc_ns.h @@ -32,6 +32,7 @@ extern const struct proc_ns_operations pidns_for_children_operations; extern const struct proc_ns_operations userns_operations; extern const struct proc_ns_operations mntns_operations; extern const struct proc_ns_operations cgroupns_operations; +extern const struct proc_ns_operations binfmtns_operations; /* * We always define these enumerators @@ -43,6 +44,7 @@ enum { PROC_USER_INIT_INO = 0xEFFFFFFDU, PROC_PID_INIT_INO = 0xEFFFFFFCU, PROC_CGROUP_INIT_INO = 0xEFFFFFFBU, + PROC_BINFMT_INIT_INO = 0xEFFFFFFAU, }; #ifdef CONFIG_PROC_FS diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index d6b74b91096b..81365a22362c 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -45,6 +45,7 @@ enum ucount_type { UCOUNT_NET_NAMESPACES, UCOUNT_MNT_NAMESPACES, UCOUNT_CGROUP_NAMESPACES, + UCOUNT_BINFMT_NAMESPACES, #ifdef CONFIG_INOTIFY_USER UCOUNT_INOTIFY_INSTANCES, UCOUNT_INOTIFY_WATCHES, diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 22627f80063e..51fe40681e8e 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -10,6 +10,7 @@ #define CLONE_FS 0x00000200 /* set if fs info shared between processes */ #define CLONE_FILES 0x00000400 /* set if open files shared between processes */ #define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */ +#define CLONE_NEWBINFMT 0x00001000 /* New binfmt_misc namespace */ #define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */ #define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */ #define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */ diff --git a/init/Kconfig b/init/Kconfig index 1e234e2f1cba..4874719a2799 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -965,6 +965,14 @@ config NET_NS Allow user space to create what appear to be multiple instances of the network stack. +config BINFMT_NS + bool "binfmt_misc Namespace" + depends on BINFMT_MISC + default y + help + This allows to use several binfmt_misc configurations on + the same system. + endif # NAMESPACES config CHECKPOINT_RESTORE diff --git a/kernel/Makefile b/kernel/Makefile index 7a63d567fdb5..313c80f5883f 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -72,6 +72,7 @@ obj-$(CONFIG_CGROUPS) += cgroup/ obj-$(CONFIG_UTS_NS) += utsname.o obj-$(CONFIG_USER_NS) += user_namespace.o obj-$(CONFIG_PID_NS) += pid_namespace.o +obj-$(CONFIG_BINFMT_NS) += binfmt_namespace.o obj-$(CONFIG_IKCONFIG) += configs.o obj-$(CONFIG_SMP) += stop_machine.o obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o diff --git a/kernel/binfmt_namespace.c b/kernel/binfmt_namespace.c new file mode 100644 index 000000000000..63a80bcd70df --- /dev/null +++ b/kernel/binfmt_namespace.c @@ -0,0 +1,153 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#include <linux/slab.h> +#include <linux/user_namespace.h> +#include <linux/cred.h> +#include <linux/binfmt_namespace.h> +#include <linux/proc_ns.h> +#include <linux/sched/task.h> + +static struct ucounts *inc_binfmt_namespaces(struct user_namespace *ns) +{ + return inc_ucount(ns, current_euid(), UCOUNT_BINFMT_NAMESPACES); +} + +static void dec_binfmt_namespaces(struct ucounts *ucounts) +{ + dec_ucount(ucounts, UCOUNT_BINFMT_NAMESPACES); +} + +static struct binfmt_namespace *create_binfmt_ns(void) +{ + struct binfmt_namespace *binfmt_ns; + + binfmt_ns = kmalloc(sizeof(struct binfmt_namespace), GFP_KERNEL); + if (binfmt_ns) + kref_init(&binfmt_ns->kref); + return binfmt_ns; +} + +static struct binfmt_namespace *clone_binfmt_ns(struct user_namespace *user_ns, + struct binfmt_namespace *old_ns) +{ + struct binfmt_namespace *ns; + struct ucounts *ucounts; + int err; + + err = -ENOSPC; + ucounts = inc_binfmt_namespaces(user_ns); + if (!ucounts) + goto fail; + + err = -ENOMEM; + ns = create_binfmt_ns(); + if (!ns) + goto fail_dec; + + err = ns_alloc_inum(&ns->ns); + if (err) + goto fail_free; + + ns->ucounts = ucounts; + ns->ns.ops = &binfmtns_operations; + ns->user_ns = get_user_ns(user_ns); + return ns; + +fail_free: + kfree(ns); +fail_dec: + dec_binfmt_namespaces(ucounts); +fail: + return ERR_PTR(err); +} + +struct binfmt_namespace *copy_binfmt_ns(unsigned long flags, + struct user_namespace *user_ns, struct binfmt_namespace *old_ns) +{ + if (!(flags & CLONE_NEWBINFMT)) { + get_binfmt_ns(old_ns); + return old_ns; + } + + return clone_binfmt_ns(user_ns, old_ns); +} + +void free_binfmt_ns(struct kref *kref) +{ + struct binfmt_namespace *ns; + + ns = container_of(kref, struct binfmt_namespace, kref); + dec_binfmt_namespaces(ns->ucounts); + put_user_ns(ns->user_ns); + ns_free_inum(&ns->ns); + kfree(ns); +} + +static inline struct binfmt_namespace *to_binfmt_ns(struct ns_common *ns) +{ + return container_of(ns, struct binfmt_namespace, ns); +} + +static struct ns_common *binfmtns_get(struct task_struct *task) +{ + struct binfmt_namespace *ns = NULL; + struct nsproxy *nsproxy; + + task_lock(task); + nsproxy = task->nsproxy; + if (nsproxy) { + ns = nsproxy->binfmt_ns; + get_binfmt_ns(ns); + } + task_unlock(task); + + return ns ? &ns->ns : NULL; +} + +static void binfmtns_put(struct ns_common *ns) +{ + put_binfmt_ns(to_binfmt_ns(ns)); +} + +static int binfmtns_install(struct nsproxy *nsproxy, struct ns_common *new) +{ + struct binfmt_namespace *ns = to_binfmt_ns(new); + + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) || + !ns_capable(current_user_ns(), CAP_SYS_ADMIN)) + return -EPERM; + + get_binfmt_ns(ns); + put_binfmt_ns(nsproxy->binfmt_ns); + nsproxy->binfmt_ns = ns; + return 0; +} + +static struct user_namespace *binfmtns_owner(struct ns_common *ns) +{ + return to_binfmt_ns(ns)->user_ns; +} + +const struct proc_ns_operations binfmtns_operations = { + .name = "binfmt_misc", + .type = CLONE_NEWBINFMT, + .get = binfmtns_get, + .put = binfmtns_put, + .install = binfmtns_install, + .owner = binfmtns_owner, +}; + +struct binfmt_namespace init_binfmt_ns = { + .kref = KREF_INIT(2), + .user_ns = &init_user_ns, + .ns.inum = PROC_BINFMT_INIT_INO, +#ifdef CONFIG_BINFMT_NS + .ns.ops = &binfmtns_operations, +#endif +}; + +static int __init binfmt_ns_init(void) +{ + return 0; +} +subsys_initcall(binfmt_ns_init); diff --git a/kernel/fork.c b/kernel/fork.c index f0b58479534f..d89cf8b89e43 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2365,7 +2365,8 @@ static int check_unshare_flags(unsigned long unshare_flags) if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND| CLONE_VM|CLONE_FILES|CLONE_SYSVSEM| CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET| - CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP)) + CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP| + CLONE_NEWBINFMT)) return -EINVAL; /* * Not implemented, but pretend it works if there is nothing diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index f6c5d330059a..386028e6da39 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -22,6 +22,7 @@ #include <linux/pid_namespace.h> #include <net/net_namespace.h> #include <linux/ipc_namespace.h> +#include <linux/binfmt_namespace.h> #include <linux/proc_ns.h> #include <linux/file.h> #include <linux/syscalls.h> @@ -44,6 +45,9 @@ struct nsproxy init_nsproxy = { #ifdef CONFIG_CGROUPS .cgroup_ns = &init_cgroup_ns, #endif +#if IS_ENABLED(BINFMT_MISC) + .binfmt_ns = &init_binfmt_ns, +#endif }; static inline struct nsproxy *create_nsproxy(void) @@ -110,6 +114,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags, goto out_net; } + new_nsp->binfmt_ns = copy_binfmt_ns(flags, user_ns, + tsk->nsproxy->binfmt_ns); + if (IS_ERR(new_nsp->binfmt_ns)) { + err = PTR_ERR(new_nsp->binfmt_ns); + goto out_net; + } + return new_nsp; out_net: @@ -143,7 +154,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk) if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET | - CLONE_NEWCGROUP)))) { + CLONE_NEWCGROUP | CLONE_NEWBINFMT)))) { get_nsproxy(old_ns); return 0; } @@ -180,6 +191,8 @@ void free_nsproxy(struct nsproxy *ns) put_ipc_ns(ns->ipc_ns); if (ns->pid_ns_for_children) put_pid_ns(ns->pid_ns_for_children); + if (ns->binfmt_ns) + put_binfmt_ns(ns->binfmt_ns); put_cgroup_ns(ns->cgroup_ns); put_net(ns->net_ns); kmem_cache_free(nsproxy_cachep, ns); @@ -196,7 +209,8 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags, int err = 0; if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | - CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP))) + CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP | + CLONE_NEWBINFMT))) return 0; user_ns = new_cred ? new_cred->user_ns : current_user_ns(); -- 2.17.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC 1/2] ns: introduce binfmt_misc namespace 2018-09-30 23:46 ` [RFC 1/2] " Laurent Vivier @ 2018-10-01 1:21 ` Greg KH 2018-10-01 7:00 ` Laurent Vivier 0 siblings, 1 reply; 12+ messages in thread From: Greg KH @ 2018-10-01 1:21 UTC (permalink / raw) To: Laurent Vivier Cc: linux-kernel, linux-fsdevel, James Bottomley, Alexander Viro, linux-api, Eric Biederman, Dmitry Safonov, Andrei Vagin, containers On Mon, Oct 01, 2018 at 01:46:27AM +0200, Laurent Vivier wrote: > Signed-off-by: Laurent Vivier <laurent@vivier.eu> > --- I don't take patches without any changelog text, I don't know if other maintainers are as nice. But for a new feature, you really should write something... thanks, greg k-h ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 1/2] ns: introduce binfmt_misc namespace 2018-10-01 1:21 ` Greg KH @ 2018-10-01 7:00 ` Laurent Vivier 0 siblings, 0 replies; 12+ messages in thread From: Laurent Vivier @ 2018-10-01 7:00 UTC (permalink / raw) To: Greg KH Cc: linux-kernel, linux-fsdevel, James Bottomley, Alexander Viro, linux-api, Eric Biederman, Dmitry Safonov, Andrei Vagin, containers Le 01/10/2018 à 03:21, Greg KH a écrit : > On Mon, Oct 01, 2018 at 01:46:27AM +0200, Laurent Vivier wrote: >> Signed-off-by: Laurent Vivier <laurent@vivier.eu> >> --- > > I don't take patches without any changelog text, I don't know if other > maintainers are as nice. But for a new feature, you really should write > something... Yes, I know. But it's an RFC and all the explanations are in the cover letter for now. I will fill the changelog once I know if the feature is interesting or not. Thank you for your comment. Laurent ^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC 2/2] binfmt_misc: move data to binfmt_namespace 2018-09-30 23:46 [RFC 0/2] ns: introduce binfmt_misc namespace Laurent Vivier 2018-09-30 23:46 ` [RFC 1/2] " Laurent Vivier @ 2018-09-30 23:46 ` Laurent Vivier 2018-10-01 8:54 ` Jann Horn 2018-10-01 4:45 ` [RFC 0/2] ns: introduce binfmt_misc namespace Andy Lutomirski 2 siblings, 1 reply; 12+ messages in thread From: Laurent Vivier @ 2018-09-30 23:46 UTC (permalink / raw) To: linux-kernel Cc: linux-fsdevel, James Bottomley, Alexander Viro, linux-api, Eric Biederman, Dmitry Safonov, Andrei Vagin, containers, Laurent Vivier Signed-off-by: Laurent Vivier <laurent@vivier.eu> --- fs/binfmt_misc.c | 50 +++++++++++++++++--------------- include/linux/binfmt_namespace.h | 12 ++++++++ kernel/binfmt_namespace.c | 11 +++++++ 3 files changed, 49 insertions(+), 24 deletions(-) diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c index aa4a7a23ff99..c6148b2bdd19 100644 --- a/fs/binfmt_misc.c +++ b/fs/binfmt_misc.c @@ -25,6 +25,7 @@ #include <linux/syscalls.h> #include <linux/fs.h> #include <linux/uaccess.h> +#include <linux/binfmt_namespace.h> #include "internal.h" @@ -38,9 +39,6 @@ enum { VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */ }; -static LIST_HEAD(entries); -static int enabled = 1; - enum {Enabled, Magic}; #define MISC_FMT_PRESERVE_ARGV0 (1 << 31) #define MISC_FMT_OPEN_BINARY (1 << 30) @@ -60,10 +58,7 @@ typedef struct { struct file *interp_file; } Node; -static DEFINE_RWLOCK(entries_lock); static struct file_system_type bm_fs_type; -static struct vfsmount *bm_mnt; -static int entry_count; /* * Max length of the register string. Determined by: @@ -91,7 +86,7 @@ static Node *check_file(struct linux_binprm *bprm) struct list_head *l; /* Walk all the registered handlers. */ - list_for_each(l, &entries) { + list_for_each(l, &binfmt_ns(entries)) { Node *e = list_entry(l, Node, list); char *s; int j; @@ -135,15 +130,15 @@ static int load_misc_binary(struct linux_binprm *bprm) int fd_binary = -1; retval = -ENOEXEC; - if (!enabled) + if (!binfmt_ns(enabled)) return retval; /* to keep locking time low, we copy the interpreter string */ - read_lock(&entries_lock); + read_lock(&binfmt_ns(entries_lock)); fmt = check_file(bprm); if (fmt) dget(fmt->dentry); - read_unlock(&entries_lock); + read_unlock(&binfmt_ns(entries_lock)); if (!fmt) return retval; @@ -613,15 +608,15 @@ static void kill_node(Node *e) { struct dentry *dentry; - write_lock(&entries_lock); + write_lock(&binfmt_ns(entries_lock)); list_del_init(&e->list); - write_unlock(&entries_lock); + write_unlock(&binfmt_ns(entries_lock)); dentry = e->dentry; drop_nlink(d_inode(dentry)); d_drop(dentry); dput(dentry); - simple_release_fs(&bm_mnt, &entry_count); + simple_release_fs(&binfmt_ns(bm_mnt), &binfmt_ns(entry_count)); } /* /<entry> */ @@ -716,7 +711,8 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, if (!inode) goto out2; - err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count); + err = simple_pin_fs(&bm_fs_type, &binfmt_ns(bm_mnt), + &binfmt_ns(entry_count)); if (err) { iput(inode); inode = NULL; @@ -730,7 +726,8 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, if (IS_ERR(f)) { err = PTR_ERR(f); pr_notice("register: failed to install interpreter file %s\n", e->interpreter); - simple_release_fs(&bm_mnt, &entry_count); + simple_release_fs(&binfmt_ns(bm_mnt), + &binfmt_ns(entry_count)); iput(inode); inode = NULL; goto out2; @@ -743,9 +740,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, inode->i_fop = &bm_entry_operations; d_instantiate(dentry, inode); - write_lock(&entries_lock); - list_add(&e->list, &entries); - write_unlock(&entries_lock); + write_lock(&binfmt_ns(entries_lock)); + list_add(&e->list, &binfmt_ns(entries)); + write_unlock(&binfmt_ns(entries_lock)); err = 0; out2: @@ -770,7 +767,7 @@ static const struct file_operations bm_register_operations = { static ssize_t bm_status_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) { - char *s = enabled ? "enabled\n" : "disabled\n"; + char *s = binfmt_ns(enabled) ? "enabled\n" : "disabled\n"; return simple_read_from_buffer(buf, nbytes, ppos, s, strlen(s)); } @@ -784,19 +781,20 @@ static ssize_t bm_status_write(struct file *file, const char __user *buffer, switch (res) { case 1: /* Disable all handlers. */ - enabled = 0; + binfmt_ns(enabled) = 0; break; case 2: /* Enable all handlers. */ - enabled = 1; + binfmt_ns(enabled) = 1; break; case 3: /* Delete all handlers. */ root = file_inode(file)->i_sb->s_root; inode_lock(d_inode(root)); - while (!list_empty(&entries)) - kill_node(list_first_entry(&entries, Node, list)); + while (!list_empty(&binfmt_ns(entries))) + kill_node(list_first_entry(&binfmt_ns(entries), + Node, list)); inode_unlock(d_inode(root)); break; @@ -838,7 +836,10 @@ static int bm_fill_super(struct super_block *sb, void *data, int silent) static struct dentry *bm_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { - return mount_single(fs_type, flags, data, bm_fill_super); + struct binfmt_namespace *binfmt_ns = current->nsproxy->binfmt_ns; + + return mount_ns(fs_type, flags, data, binfmt_ns, binfmt_ns->user_ns, + bm_fill_super); } static struct linux_binfmt misc_format = { @@ -849,6 +850,7 @@ static struct linux_binfmt misc_format = { static struct file_system_type bm_fs_type = { .owner = THIS_MODULE, .name = "binfmt_misc", + .fs_flags = FS_USERNS_MOUNT, .mount = bm_mount, .kill_sb = kill_litter_super, }; diff --git a/include/linux/binfmt_namespace.h b/include/linux/binfmt_namespace.h index 8688869ee254..550357ab4f62 100644 --- a/include/linux/binfmt_namespace.h +++ b/include/linux/binfmt_namespace.h @@ -7,12 +7,24 @@ extern struct user_namespace init_user_ns; struct binfmt_namespace { struct kref kref; + + struct list_head entries; + rwlock_t entries_lock; + int enabled; + struct vfsmount *bm_mnt; + int entry_count; + + /* user_ns which owns the binfmt_misc ns */ + struct user_namespace *user_ns; struct ucounts *ucounts; + struct ns_common ns; } __randomize_layout; extern struct binfmt_namespace init_binfmt_ns; +#define binfmt_ns(a) (current->nsproxy->binfmt_ns->a) + #ifdef CONFIG_BINFMT_NS static inline void get_binfmt_ns(struct binfmt_namespace *ns) { diff --git a/kernel/binfmt_namespace.c b/kernel/binfmt_namespace.c index 63a80bcd70df..22be49beee08 100644 --- a/kernel/binfmt_namespace.c +++ b/kernel/binfmt_namespace.c @@ -48,6 +48,12 @@ static struct binfmt_namespace *clone_binfmt_ns(struct user_namespace *user_ns, if (err) goto fail_free; + INIT_LIST_HEAD(&ns->entries); + ns->enabled = 1; + rwlock_init(&ns->entries_lock); + ns->bm_mnt = NULL; + ns->entry_count = 0; + ns->ucounts = ucounts; ns->ns.ops = &binfmtns_operations; ns->user_ns = get_user_ns(user_ns); @@ -140,6 +146,9 @@ const struct proc_ns_operations binfmtns_operations = { struct binfmt_namespace init_binfmt_ns = { .kref = KREF_INIT(2), .user_ns = &init_user_ns, + .enabled = 1, + .entry_count = 0, + .bm_mnt = NULL, .ns.inum = PROC_BINFMT_INIT_INO, #ifdef CONFIG_BINFMT_NS .ns.ops = &binfmtns_operations, @@ -148,6 +157,8 @@ struct binfmt_namespace init_binfmt_ns = { static int __init binfmt_ns_init(void) { + INIT_LIST_HEAD(&init_binfmt_ns.entries); + rwlock_init(&init_binfmt_ns.entries_lock); return 0; } subsys_initcall(binfmt_ns_init); -- 2.17.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC 2/2] binfmt_misc: move data to binfmt_namespace 2018-09-30 23:46 ` [RFC 2/2] binfmt_misc: move data to binfmt_namespace Laurent Vivier @ 2018-10-01 8:54 ` Jann Horn 0 siblings, 0 replies; 12+ messages in thread From: Jann Horn @ 2018-10-01 8:54 UTC (permalink / raw) To: laurent Cc: kernel list, linux-fsdevel, James Bottomley, Al Viro, Linux API, Eric W. Biederman, dima, Andrei Vagin, containers, Andy Lutomirski On Mon, Oct 1, 2018 at 1:47 AM Laurent Vivier <laurent@vivier.eu> wrote: > @@ -716,7 +711,8 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > if (!inode) > goto out2; > > - err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count); > + err = simple_pin_fs(&bm_fs_type, &binfmt_ns(bm_mnt), > + &binfmt_ns(entry_count)); > if (err) { > iput(inode); > inode = NULL; > @@ -730,7 +726,8 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > if (IS_ERR(f)) { > err = PTR_ERR(f); > pr_notice("register: failed to install interpreter file %s\n", e->interpreter); > - simple_release_fs(&bm_mnt, &entry_count); > + simple_release_fs(&binfmt_ns(bm_mnt), > + &binfmt_ns(entry_count)); > iput(inode); > inode = NULL; > goto out2; > @@ -743,9 +740,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > inode->i_fop = &bm_entry_operations; > > d_instantiate(dentry, inode); > - write_lock(&entries_lock); > - list_add(&e->list, &entries); > - write_unlock(&entries_lock); > + write_lock(&binfmt_ns(entries_lock)); > + list_add(&e->list, &binfmt_ns(entries)); > + write_unlock(&binfmt_ns(entries_lock)); This looks wrong. A write handler's behavior should not depend on the namespace of the process that is using it. Ideally, the affected namespace should depend on the file you're writing to. If that's not possible, the affected namespace should at least be the namespace of the process that opened the file. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 0/2] ns: introduce binfmt_misc namespace 2018-09-30 23:46 [RFC 0/2] ns: introduce binfmt_misc namespace Laurent Vivier 2018-09-30 23:46 ` [RFC 1/2] " Laurent Vivier 2018-09-30 23:46 ` [RFC 2/2] binfmt_misc: move data to binfmt_namespace Laurent Vivier @ 2018-10-01 4:45 ` Andy Lutomirski 2018-10-01 7:13 ` Laurent Vivier 2018-10-01 7:21 ` Eric W. Biederman 2 siblings, 2 replies; 12+ messages in thread From: Andy Lutomirski @ 2018-10-01 4:45 UTC (permalink / raw) To: laurent Cc: LKML, Linux FS Devel, James Bottomley, Al Viro, Linux API, Eric W. Biederman, Dmitry Safonov, Andrey Vagin, Linux Containers On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier <laurent@vivier.eu> wrote: > > This series introduces a new namespace for binfmt_misc. > This seems conceptually quite reasonable, but I'm wondering if the number of namespace types is getting out of hand given the current API. Should we be considering whether we need a new set of namespace creation APIs that scale better to larger numbers of namespace types? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 0/2] ns: introduce binfmt_misc namespace 2018-10-01 4:45 ` [RFC 0/2] ns: introduce binfmt_misc namespace Andy Lutomirski @ 2018-10-01 7:13 ` Laurent Vivier 2018-10-01 12:26 ` Dmitry Safonov 2018-10-01 7:21 ` Eric W. Biederman 1 sibling, 1 reply; 12+ messages in thread From: Laurent Vivier @ 2018-10-01 7:13 UTC (permalink / raw) To: Andy Lutomirski Cc: LKML, Linux FS Devel, James Bottomley, Al Viro, Linux API, Eric W. Biederman, Dmitry Safonov, Andrey Vagin, Linux Containers Le 01/10/2018 à 06:45, Andy Lutomirski a écrit : > On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier <laurent@vivier.eu> wrote: >> >> This series introduces a new namespace for binfmt_misc. >> > > This seems conceptually quite reasonable, but I'm wondering if the > number of namespace types is getting out of hand given the current > API. Should we be considering whether we need a new set of namespace > creation APIs that scale better to larger numbers of namespace types? > Yes, we need something to increase the maximum number of namespace types because this is the last bit in the clone() flags and the time namespace has already preempted it. Thanks, Laurent ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 0/2] ns: introduce binfmt_misc namespace 2018-10-01 7:13 ` Laurent Vivier @ 2018-10-01 12:26 ` Dmitry Safonov 0 siblings, 0 replies; 12+ messages in thread From: Dmitry Safonov @ 2018-10-01 12:26 UTC (permalink / raw) To: Laurent Vivier, Andy Lutomirski Cc: LKML, Linux FS Devel, James Bottomley, Al Viro, Linux API, Eric W. Biederman, Andrey Vagin, Linux Containers Hi Laurent, thanks for Cc, On Mon, 2018-10-01 at 09:13 +0200, Laurent Vivier wrote: > Le 01/10/2018 à 06:45, Andy Lutomirski a écrit : > > On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier <laurent@vivier.eu> > > wrote: > > > > > > This series introduces a new namespace for binfmt_misc. > > > > > > > This seems conceptually quite reasonable, but I'm wondering if the > > number of namespace types is getting out of hand given the current > > API. Should we be considering whether we need a new set of > > namespace > > creation APIs that scale better to larger numbers of namespace > > types? > > > > Yes, we need something to increase the maximum number of namespace > types > because this is the last bit in the clone() flags and the time > namespace > has already preempted it. Yeah, there is this last CLONE_* flag.. I tried to use that 0x1000 flag for something like CLONE_EXTENDED with all parameters on the stack, but not sure that's reasonable and maybe someone will suggest a better solution. All those different clone() ABI (how many parameters to supply and in which order do not help much). -- Thanks, Dmitry ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 0/2] ns: introduce binfmt_misc namespace 2018-10-01 4:45 ` [RFC 0/2] ns: introduce binfmt_misc namespace Andy Lutomirski 2018-10-01 7:13 ` Laurent Vivier @ 2018-10-01 7:21 ` Eric W. Biederman 2018-10-01 8:45 ` Laurent Vivier 1 sibling, 1 reply; 12+ messages in thread From: Eric W. Biederman @ 2018-10-01 7:21 UTC (permalink / raw) To: Andy Lutomirski Cc: laurent, LKML, Linux FS Devel, James Bottomley, Al Viro, Linux API, Dmitry Safonov, Andrey Vagin, Linux Containers Andy Lutomirski <luto@kernel.org> writes: > On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier <laurent@vivier.eu> wrote: >> >> This series introduces a new namespace for binfmt_misc. >> > > This seems conceptually quite reasonable, but I'm wondering if the > number of namespace types is getting out of hand given the current > API. Should we be considering whether we need a new set of namespace > creation APIs that scale better to larger numbers of namespace types? I would rather encourage a way to make this part of an existing namespace or find a way to make a mount of binfmt_misc control this. Hmm. This looks like something that can be very straight forwardly be made part of the user namespace. If you ever mount binfmt_misc in the user namespace you get the new behavior. Otherwise you get the existing behavior. A user namespace will definitely be required, as otherwise you run the risk of confusing root (and suid root exectuables0 by being able to change the behavior of executables. What is the motivation for this? My impression is that very few people tweak binfmt_misc. I also don't think this raises to the level where it makes sense to create a new namespace for this. Eric ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 0/2] ns: introduce binfmt_misc namespace 2018-10-01 7:21 ` Eric W. Biederman @ 2018-10-01 8:45 ` Laurent Vivier 2018-10-01 8:56 ` Eric W. Biederman 0 siblings, 1 reply; 12+ messages in thread From: Laurent Vivier @ 2018-10-01 8:45 UTC (permalink / raw) To: Eric W. Biederman, Andy Lutomirski Cc: LKML, Linux FS Devel, James Bottomley, Al Viro, Linux API, Dmitry Safonov, Andrey Vagin, Linux Containers Le 01/10/2018 à 09:21, Eric W. Biederman a écrit : > Andy Lutomirski <luto@kernel.org> writes: > >> On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier <laurent@vivier.eu> wrote: >>> >>> This series introduces a new namespace for binfmt_misc. >>> >> >> This seems conceptually quite reasonable, but I'm wondering if the >> number of namespace types is getting out of hand given the current >> API. Should we be considering whether we need a new set of namespace >> creation APIs that scale better to larger numbers of namespace types? > > I would rather encourage a way to make this part of an existing > namespace or find a way to make a mount of binfmt_misc control this. > > Hmm. This looks like something that can be very straight forwardly be > made part of the user namespace. If you ever mount binfmt_misc in the > user namespace you get the new behavior. Otherwise you get the existing > behavior. Thank you. I'll do that. > A user namespace will definitely be required, as otherwise you run the > risk of confusing root (and suid root exectuables0 by being able to > change the behavior of executables. > > What is the motivation for this? My impression is that very few people > tweak binfmt_misc. I think more and more people are using an interpreter like qemu linux-usermode to have a cross-compilation environment: they bootstrap a distro filesystems (with something like debootstrap), and then use binfmt_misc to run the compiler inside this environment (see for instance [1] [2] [3] or [4] [5]). This is interesting because you have more than a cross-compiler with that: you have also all the libraries of the target system, you can select exactly which target release you want to build to, with the exact same compiler and libraries versions (and you can re-use it you want to do maintenance on your project 10 years later...) The problem with this is you need to be root: 1- to chroot 2- to configure binfmt_misc We already can use "unshare --map-root-user chroot" to address the point 1, and this series tries to address the point 2. I think it's also interesting to have a per container configuration for binfmt_misc when the server administrator configures it and don't want to share each user configuration with all the other user ones (in something like docker or a cloud application). > I also don't think this raises to the level where it makes sense to > create a new namespace for this. OK. Thanks, Laurent [1] https://wiki.debian.org/Arm64Qemu [2] https://wiki.debian.org/M68k/sbuildQEMU [3] https://wiki.debian.org/RISC-V#Manual_qemu-user_installation [4] https://kbeckmann.github.io/2017/05/26/QEMU-instead-of-cross-compiling/ [5] https://wiki.gentoo.org/wiki/Crossdev_qemu-static-user-chroot ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 0/2] ns: introduce binfmt_misc namespace 2018-10-01 8:45 ` Laurent Vivier @ 2018-10-01 8:56 ` Eric W. Biederman 0 siblings, 0 replies; 12+ messages in thread From: Eric W. Biederman @ 2018-10-01 8:56 UTC (permalink / raw) To: Laurent Vivier Cc: Andy Lutomirski, LKML, Linux FS Devel, James Bottomley, Al Viro, Linux API, Dmitry Safonov, Andrey Vagin, Linux Containers Laurent Vivier <laurent@vivier.eu> writes: > Le 01/10/2018 à 09:21, Eric W. Biederman a écrit : >> Andy Lutomirski <luto@kernel.org> writes: >> >>> On Sun, Sep 30, 2018 at 4:47 PM Laurent Vivier <laurent@vivier.eu> wrote: >>>> >>>> This series introduces a new namespace for binfmt_misc. >>>> >>> >>> This seems conceptually quite reasonable, but I'm wondering if the >>> number of namespace types is getting out of hand given the current >>> API. Should we be considering whether we need a new set of namespace >>> creation APIs that scale better to larger numbers of namespace types? >> >> I would rather encourage a way to make this part of an existing >> namespace or find a way to make a mount of binfmt_misc control this. >> >> Hmm. This looks like something that can be very straight forwardly be >> made part of the user namespace. If you ever mount binfmt_misc in the >> user namespace you get the new behavior. Otherwise you get the existing >> behavior. > > Thank you. I'll do that. > >> A user namespace will definitely be required, as otherwise you run the >> risk of confusing root (and suid root exectuables0 by being able to >> change the behavior of executables. >> >> What is the motivation for this? My impression is that very few people >> tweak binfmt_misc. > > I think more and more people are using an interpreter like qemu > linux-usermode to have a cross-compilation environment: they bootstrap a > distro filesystems (with something like debootstrap), and then use > binfmt_misc to run the compiler inside this environment (see for > instance [1] [2] [3] or [4] [5]). This is interesting because you have > more than a cross-compiler with that: you have also all the libraries of > the target system, you can select exactly which target release you want > to build to, with the exact same compiler and libraries versions (and > you can re-use it you want to do maintenance on your project 10 years > later...) > > The problem with this is you need to be root: > 1- to chroot > 2- to configure binfmt_misc > > We already can use "unshare --map-root-user chroot" to address the point > 1, and this series tries to address the point 2. > > I think it's also interesting to have a per container configuration for > binfmt_misc when the server administrator configures it and don't want > to share each user configuration with all the other user ones (in > something like docker or a cloud application). OK. So it sounds like you are already needing a user namespace for this. If this is your use case then my proposed method above seems to fit rather well. James Bottomley was doing something similar that connected to personality(2). That might be worth a look to see if there is some synergy there. >> I also don't think this raises to the level where it makes sense to >> create a new namespace for this. > > OK. > > Thanks, > Laurent > > [1] https://wiki.debian.org/Arm64Qemu > [2] https://wiki.debian.org/M68k/sbuildQEMU > [3] https://wiki.debian.org/RISC-V#Manual_qemu-user_installation > [4] https://kbeckmann.github.io/2017/05/26/QEMU-instead-of-cross-compiling/ > [5] https://wiki.gentoo.org/wiki/Crossdev_qemu-static-user-chroot ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2018-10-01 19:04 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-09-30 23:46 [RFC 0/2] ns: introduce binfmt_misc namespace Laurent Vivier 2018-09-30 23:46 ` [RFC 1/2] " Laurent Vivier 2018-10-01 1:21 ` Greg KH 2018-10-01 7:00 ` Laurent Vivier 2018-09-30 23:46 ` [RFC 2/2] binfmt_misc: move data to binfmt_namespace Laurent Vivier 2018-10-01 8:54 ` Jann Horn 2018-10-01 4:45 ` [RFC 0/2] ns: introduce binfmt_misc namespace Andy Lutomirski 2018-10-01 7:13 ` Laurent Vivier 2018-10-01 12:26 ` Dmitry Safonov 2018-10-01 7:21 ` Eric W. Biederman 2018-10-01 8:45 ` Laurent Vivier 2018-10-01 8:56 ` Eric W. Biederman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).