From: ebiederm@xmission.com (Eric W. Biederman)
To: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
Kernel Hardening <kernel-hardening@lists.openwall.com>,
Linux API <linux-api@vger.kernel.org>,
Linux FS Devel <linux-fsdevel@vger.kernel.org>,
Linux Security Module <linux-security-module@vger.kernel.org>,
Akinobu Mita <akinobu.mita@gmail.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Alexey Dobriyan <adobriyan@gmail.com>,
Alexey Gladkov <legion@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Andy Lutomirski <luto@kernel.org>,
Daniel Micay <danielmicay@gmail.com>,
Djalal Harouni <tixxdz@gmail.com>,
"Dmitry V . Levin" <ldv@altlinux.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Ingo Molnar <mingo@kernel.org>,
"J . Bruce Fields" <bfields@fieldses.org>,
Jeff Layton <jlayton@poochiereds.net>,
Jonathan Corbet <corbet@lwn.net>,
Kees Cook <keescook@chromium.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Oleg Nesterov <oleg@redhat.com>,
David Howells <dhowells@redhat.com>
Subject: Re: [PATCH RESEND v11 2/8] proc: allow to mount many instances of proc in one pid namespace
Date: Fri, 17 Apr 2020 13:55:05 -0500 [thread overview]
Message-ID: <87tv1iaqnq.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <20200409123752.1070597-3-gladkov.alexey@gmail.com> (Alexey Gladkov's message of "Thu, 9 Apr 2020 14:37:46 +0200")
Alexey Gladkov <gladkov.alexey@gmail.com> writes:
> This patch allows to have multiple procfs instances inside the
> same pid namespace. The aim here is lightweight sandboxes, and to allow
> that we have to modernize procfs internals.
>
> 1) The main aim of this work is to have on embedded systems one
> supervisor for apps. Right now we have some lightweight sandbox support,
> however if we create pid namespacess we have to manages all the
> processes inside too, where our goal is to be able to run a bunch of
> apps each one inside its own mount namespace without being able to
> notice each other. We only want to use mount namespaces, and we want
> procfs to behave more like a real mount point.
>
> 2) Linux Security Modules have multiple ptrace paths inside some
> subsystems, however inside procfs, the implementation does not guarantee
> that the ptrace() check which triggers the security_ptrace_check() hook
> will always run. We have the 'hidepid' mount option that can be used to
> force the ptrace_may_access() check inside has_pid_permissions() to run.
> The problem is that 'hidepid' is per pid namespace and not attached to
> the mount point, any remount or modification of 'hidepid' will propagate
> to all other procfs mounts.
>
> This also does not allow to support Yama LSM easily in desktop and user
> sessions. Yama ptrace scope which restricts ptrace and some other
> syscalls to be allowed only on inferiors, can be updated to have a
> per-task context, where the context will be inherited during fork(),
> clone() and preserved across execve(). If we support multiple private
> procfs instances, then we may force the ptrace_may_access() on
> /proc/<pids>/ to always run inside that new procfs instances. This will
> allow to specifiy on user sessions if we should populate procfs with
> pids that the user can ptrace or not.
>
> By using Yama ptrace scope, some restricted users will only be able to see
> inferiors inside /proc, they won't even be able to see their other
> processes. Some software like Chromium, Firefox's crash handler, Wine
> and others are already using Yama to restrict which processes can be
> ptracable. With this change this will give the possibility to restrict
> /proc/<pids>/ but more importantly this will give desktop users a
> generic and usuable way to specifiy which users should see all processes
> and which users can not.
>
> Side notes:
> * This covers the lack of seccomp where it is not able to parse
> arguments, it is easy to install a seccomp filter on direct syscalls
> that operate on pids, however /proc/<pid>/ is a Linux ABI using
> filesystem syscalls. With this change LSMs should be able to analyze
> open/read/write/close...
>
> In the new patchset version I removed the 'newinstance' option
> as suggested by Eric W. Biederman.
Some very small requests.
1) Can you please not place fs_info in fs_context, and instead allocate
fs_info in fill_super? Unless I have misread introduced a resource
leak if proc is not mounted or if proc is simply reconfigured.
2) Can you please move hide_pid and pid_gid into fs_info in this patch?
As was shown by my recent bug fix
3) Can you please rebase on on v5.7-rc1 or v5.7-rc2 and repost these
patches please? I thought I could do it safely but between my bug
fixes, and Alexey Dobriyan's parallel changes to proc these patches
do not apply cleanly.
Plus there is a resource leak in this patch.
Eric
> Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> ---
> fs/proc/base.c | 13 +++++++----
> fs/proc/inode.c | 4 ++--
> fs/proc/root.c | 42 ++++++++++++++++++++++-------------
> fs/proc/self.c | 6 ++---
> fs/proc/thread_self.c | 6 ++---
> include/linux/pid_namespace.h | 4 ----
> include/linux/proc_fs.h | 12 ++++++++++
> 7 files changed, 55 insertions(+), 32 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 74f948a6b621..3b9155a69ade 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3301,6 +3301,7 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
> {
> struct task_struct *task;
> unsigned tgid;
> + struct proc_fs_info *fs_info;
> struct pid_namespace *ns;
> struct dentry *result = ERR_PTR(-ENOENT);
>
> @@ -3308,7 +3309,8 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
> if (tgid == ~0U)
> goto out;
>
> - ns = dentry->d_sb->s_fs_info;
> + fs_info = proc_sb_info(dentry->d_sb);
> + ns = fs_info->pid_ns;
> rcu_read_lock();
> task = find_task_by_pid_ns(tgid, ns);
> if (task)
> @@ -3372,6 +3374,7 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
> int proc_pid_readdir(struct file *file, struct dir_context *ctx)
> {
> struct tgid_iter iter;
> + struct proc_fs_info *fs_info = proc_sb_info(file_inode(file)->i_sb);
> struct pid_namespace *ns = proc_pid_ns(file_inode(file));
> loff_t pos = ctx->pos;
>
> @@ -3379,13 +3382,13 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
> return 0;
>
> if (pos == TGID_OFFSET - 2) {
> - struct inode *inode = d_inode(ns->proc_self);
> + struct inode *inode = d_inode(fs_info->proc_self);
> if (!dir_emit(ctx, "self", 4, inode->i_ino, DT_LNK))
> return 0;
> ctx->pos = pos = pos + 1;
> }
> if (pos == TGID_OFFSET - 1) {
> - struct inode *inode = d_inode(ns->proc_thread_self);
> + struct inode *inode = d_inode(fs_info->proc_thread_self);
> if (!dir_emit(ctx, "thread-self", 11, inode->i_ino, DT_LNK))
> return 0;
> ctx->pos = pos = pos + 1;
> @@ -3599,6 +3602,7 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
> struct task_struct *task;
> struct task_struct *leader = get_proc_task(dir);
> unsigned tid;
> + struct proc_fs_info *fs_info;
> struct pid_namespace *ns;
> struct dentry *result = ERR_PTR(-ENOENT);
>
> @@ -3609,7 +3613,8 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
> if (tid == ~0U)
> goto out;
>
> - ns = dentry->d_sb->s_fs_info;
> + fs_info = proc_sb_info(dentry->d_sb);
> + ns = fs_info->pid_ns;
> rcu_read_lock();
> task = find_task_by_pid_ns(tid, ns);
> if (task)
> diff --git a/fs/proc/inode.c b/fs/proc/inode.c
> index 1e730ea1dcd6..6e4c6728338b 100644
> --- a/fs/proc/inode.c
> +++ b/fs/proc/inode.c
> @@ -167,8 +167,8 @@ void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock
>
> static int proc_show_options(struct seq_file *seq, struct dentry *root)
> {
> - struct super_block *sb = root->d_sb;
> - struct pid_namespace *pid = sb->s_fs_info;
> + struct proc_fs_info *fs_info = proc_sb_info(root->d_sb);
> + struct pid_namespace *pid = fs_info->pid_ns;
>
> if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
> seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index 2633f10446c3..b28adbb0b937 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -30,7 +30,7 @@
> #include "internal.h"
>
> struct proc_fs_context {
> - struct pid_namespace *pid_ns;
> + struct proc_fs_info *fs_info;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please don't do this. As best as I can tell that introduces a memory
leak of proc is not mounted. Please allocate fs_info in
> unsigned int mask;
> int hidepid;
> int gid;
> @@ -92,7 +92,8 @@ static void proc_apply_options(struct super_block *s,
>
> static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> {
> - struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
> + struct proc_fs_context *ctx = fc->fs_private;
> + struct pid_namespace *pid_ns = get_pid_ns(ctx->fs_info->pid_ns);
> struct inode *root_inode;
> int ret;
>
> @@ -106,6 +107,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> s->s_magic = PROC_SUPER_MAGIC;
> s->s_op = &proc_sops;
> s->s_time_gran = 1;
> + s->s_fs_info = ctx->fs_info;
>
> /*
> * procfs isn't actually a stacking filesystem; however, there is
> @@ -113,7 +115,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> * top of it
> */
> s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
> -
> +
> /* procfs dentries and inodes don't require IO to create */
> s->s_shrink.seeks = 0;
>
> @@ -140,7 +142,8 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> static int proc_reconfigure(struct fs_context *fc)
> {
> struct super_block *sb = fc->root->d_sb;
> - struct pid_namespace *pid = sb->s_fs_info;
> + struct proc_fs_info *fs_info = proc_sb_info(sb);
> + struct pid_namespace *pid = fs_info->pid_ns;
>
> sync_filesystem(sb);
>
> @@ -150,16 +153,14 @@ static int proc_reconfigure(struct fs_context *fc)
>
> static int proc_get_tree(struct fs_context *fc)
> {
> - struct proc_fs_context *ctx = fc->fs_private;
> -
> - return get_tree_keyed(fc, proc_fill_super, ctx->pid_ns);
> + return get_tree_nodev(fc, proc_fill_super);
> }
>
> static void proc_fs_context_free(struct fs_context *fc)
> {
> struct proc_fs_context *ctx = fc->fs_private;
>
> - put_pid_ns(ctx->pid_ns);
> + put_pid_ns(ctx->fs_info->pid_ns);
> kfree(ctx);
> }
>
> @@ -178,9 +179,15 @@ static int proc_init_fs_context(struct fs_context *fc)
> if (!ctx)
> return -ENOMEM;
>
> - ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
> + ctx->fs_info = kzalloc(sizeof(struct proc_fs_info), GFP_KERNEL);
> + if (!ctx->fs_info) {
> + kfree(ctx);
> + return -ENOMEM;
> + }
> +
> + ctx->fs_info->pid_ns = get_pid_ns(task_active_pid_ns(current));
> put_user_ns(fc->user_ns);
> - fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
> + fc->user_ns = get_user_ns(ctx->fs_info->pid_ns->user_ns);
> fc->fs_private = ctx;
> fc->ops = &proc_fs_context_ops;
> return 0;
> @@ -188,15 +195,18 @@ static int proc_init_fs_context(struct fs_context *fc)
>
> static void proc_kill_sb(struct super_block *sb)
> {
> - struct pid_namespace *ns;
> + struct proc_fs_info *fs_info = proc_sb_info(sb);
> + struct pid_namespace *ns = fs_info->pid_ns;
> +
> + if (fs_info->proc_self)
> + dput(fs_info->proc_self);
> +
> + if (fs_info->proc_thread_self)
> + dput(fs_info->proc_thread_self);
>
> - ns = (struct pid_namespace *)sb->s_fs_info;
> - if (ns->proc_self)
> - dput(ns->proc_self);
> - if (ns->proc_thread_self)
> - dput(ns->proc_thread_self);
> kill_anon_super(sb);
> put_pid_ns(ns);
> + kfree(fs_info);
> }
>
> static struct file_system_type proc_fs_type = {
> diff --git a/fs/proc/self.c b/fs/proc/self.c
> index 57c0a1047250..309301ac0136 100644
> --- a/fs/proc/self.c
> +++ b/fs/proc/self.c
> @@ -36,10 +36,10 @@ static unsigned self_inum __ro_after_init;
> int proc_setup_self(struct super_block *s)
> {
> struct inode *root_inode = d_inode(s->s_root);
> - struct pid_namespace *ns = proc_pid_ns(root_inode);
> + struct proc_fs_info *fs_info = proc_sb_info(s);
> struct dentry *self;
> int ret = -ENOMEM;
> -
> +
> inode_lock(root_inode);
> self = d_alloc_name(s->s_root, "self");
> if (self) {
> @@ -62,7 +62,7 @@ int proc_setup_self(struct super_block *s)
> if (ret)
> pr_err("proc_fill_super: can't allocate /proc/self\n");
> else
> - ns->proc_self = self;
> + fs_info->proc_self = self;
>
> return ret;
> }
> diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
> index f61ae53533f5..2493cbbdfa6f 100644
> --- a/fs/proc/thread_self.c
> +++ b/fs/proc/thread_self.c
> @@ -36,7 +36,7 @@ static unsigned thread_self_inum __ro_after_init;
> int proc_setup_thread_self(struct super_block *s)
> {
> struct inode *root_inode = d_inode(s->s_root);
> - struct pid_namespace *ns = proc_pid_ns(root_inode);
> + struct proc_fs_info *fs_info = proc_sb_info(s);
> struct dentry *thread_self;
> int ret = -ENOMEM;
>
> @@ -60,9 +60,9 @@ int proc_setup_thread_self(struct super_block *s)
> inode_unlock(root_inode);
>
> if (ret)
> - pr_err("proc_fill_super: can't allocate /proc/thread_self\n");
> + pr_err("proc_fill_super: can't allocate /proc/thread-self\n");
> else
> - ns->proc_thread_self = thread_self;
> + fs_info->proc_thread_self = thread_self;
>
> return ret;
> }
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 4956e362e55e..de4534d93cb6 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -32,10 +32,6 @@ struct pid_namespace {
> struct kmem_cache *pid_cachep;
> unsigned int level;
> struct pid_namespace *parent;
> -#ifdef CONFIG_PROC_FS
> - struct dentry *proc_self;
> - struct dentry *proc_thread_self;
> -#endif
> #ifdef CONFIG_BSD_PROCESS_ACCT
> struct fs_pin *bacct;
> #endif
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 40a7982b7285..5920a4ecd71b 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -27,6 +27,17 @@ struct proc_ops {
> unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
> };
>
> +struct proc_fs_info {
> + struct pid_namespace *pid_ns;
> + struct dentry *proc_self; /* For /proc/self */
> + struct dentry *proc_thread_self; /* For /proc/thread-self */
> +};
> +
> +static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
> +{
> + return sb->s_fs_info;
> +}
> +
> #ifdef CONFIG_PROC_FS
>
> typedef int (*proc_write_t)(struct file *, char *, size_t);
> @@ -161,6 +172,7 @@ int open_related_ns(struct ns_common *ns,
> /* get the associated pid namespace for a file in procfs */
> static inline struct pid_namespace *proc_pid_ns(const struct inode *inode)
> {
> + return proc_sb_info(inode->i_sb)->pid_ns;
> return inode->i_sb->s_fs_info;
> }
next prev parent reply other threads:[~2020-04-17 18:58 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-04-09 12:37 [PATCH RESEND v11 0/8] proc: modernize proc to support multiple private instances Alexey Gladkov
2020-04-09 12:37 ` [PATCH RESEND v11 1/8] proc: rename struct proc_fs_info to proc_fs_opts Alexey Gladkov
2020-04-09 12:37 ` [PATCH RESEND v11 2/8] proc: allow to mount many instances of proc in one pid namespace Alexey Gladkov
2020-04-17 18:55 ` Eric W. Biederman [this message]
2020-04-19 14:17 ` Alexey Gladkov
2020-04-09 12:37 ` [PATCH RESEND v11 3/8] proc: move hide_pid, pid_gid from pid_namespace to proc_fs_info Alexey Gladkov
2020-04-09 12:37 ` [PATCH RESEND v11 4/8] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option Alexey Gladkov
2020-04-09 12:37 ` [PATCH RESEND v11 5/8] proc: add option to mount only a pids subset Alexey Gladkov
2020-04-09 12:37 ` [PATCH RESEND v11 6/8] docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior Alexey Gladkov
2020-04-09 12:37 ` [PATCH RESEND v11 7/8] proc: use human-readable values for hidepid Alexey Gladkov
2020-04-17 19:05 ` Eric W. Biederman
2020-04-19 14:19 ` Alexey Gladkov
2020-04-09 12:37 ` [PATCH RESEND v11 8/8] proc: use named enums for better readability Alexey Gladkov
2020-04-09 12:59 ` [PATCH RESEND v11 0/8] proc: modernize proc to support multiple private instances Eric W. Biederman
2020-04-09 13:42 ` Alexey Gladkov
2020-04-09 14:13 ` Eric W. Biederman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87tv1iaqnq.fsf@x220.int.ebiederm.org \
--to=ebiederm@xmission.com \
--cc=adobriyan@gmail.com \
--cc=akinobu.mita@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=bfields@fieldses.org \
--cc=corbet@lwn.net \
--cc=danielmicay@gmail.com \
--cc=dhowells@redhat.com \
--cc=gladkov.alexey@gmail.com \
--cc=gregkh@linuxfoundation.org \
--cc=jlayton@poochiereds.net \
--cc=keescook@chromium.org \
--cc=kernel-hardening@lists.openwall.com \
--cc=ldv@altlinux.org \
--cc=legion@kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-security-module@vger.kernel.org \
--cc=luto@kernel.org \
--cc=mingo@kernel.org \
--cc=oleg@redhat.com \
--cc=tixxdz@gmail.com \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).