Kernel-hardening archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v8 00/11] proc: modernize proc to support multiple private instances
@ 2020-02-10 15:05 Alexey Gladkov
  2020-02-10 15:05 ` [PATCH v8 01/11] proc: Rename struct proc_fs_info to proc_fs_opts Alexey Gladkov
                   ` (10 more replies)
  0 siblings, 11 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

Greetings!

Preface:
--------
This is patchset v8 to modernize procfs and make it able to support multiple
private instances per the same pid namespace.

This patchset can be applied on top of v5.4-rc7-49-g0e3f1ad80fc8


Procfs modernization:
---------------------
Historically procfs was always tied to pid namespaces, during pid
namespace creation we internally create a procfs mount for it. However,
this has the effect that all new procfs mounts are just a mirror of the
internal one, any change, any mount option update, any new future
introduction will propagate to all other procfs mounts that are in the
same pid namespace.

This may have solved several use cases in that time. However today we
face new requirements, and making procfs able to support new private
instances inside same pid namespace seems a major point. If we want to
to introduce new features and security mechanisms we have to make sure
first that we do not break existing usecases. Supporting private procfs
instances will allow to support new features and behaviour without
propagating it to all other procfs mounts.

Today procfs is more of a burden especially to some Embedded, IoT,
sandbox, container use cases. In user space we are over-mounting null
or inaccessible files on top to hide files and information. If we want
to hide pids we have to create PID namespaces otherwise mount options
propagate to all other proc mounts, changing a mount option value in one
mount will propagate to all other proc mounts. If we want to introduce
new features, then they will propagate to all other mounts too, resulting
either maybe new useful functionality or maybe breaking stuff. We have
also to note that userspace should not workaround procfs, the kernel
should just provide a sane simple interface.

In this regard several developers and maintainers pointed out that
there are problems with procfs and it has to be modernized:

"Here's another one: split up and modernize /proc." by Andy Lutomirski [1]

Discussion about kernel pointer leaks:

"And yes, as Kees and Daniel mentioned, it's definitely not just dmesg.
In fact, the primary things tend to be /proc and /sys, not dmesg
itself." By Linus Torvalds [2]

Lot of other areas in the kernel and filesystems have been updated to be
able to support private instances, devpts is one major example [3].

Which will be used for:

1) Embedded systems and IoT: usually we have one supervisor for
apps, we have some lightweight sandbox support, however if we create
pid namespaces we have to manage all the processes inside too,
where our goal is to be able to run a bunch of apps each one inside
its own mount namespace, maybe use network namespaces for vlans
setups, but right now we only want mount namespaces, without all the
other complexity. We want procfs to behave more like a real file system,
and block access to inodes that belong to other users. The 'hidepid=' will
not work since it is a shared mount option.

2) Containers, sandboxes and Private instances of file systems - devpts case
Historically, lot of file systems inside Linux kernel view when instantiated
were just a mirror of an already created and mounted filesystem. This was the
case of devpts filesystem, it seems at that time the requirements were to
optimize things and reuse the same memory, etc. This design used to work but not
anymore with today's containers, IoT, hostile environments and all the privacy
challenges that Linux faces.

In that regards, devpts was updated so that each new mounts is a total
independent file system by the following patches:

"devpts: Make each mount of devpts an independent filesystem" by
Eric W. Biederman [3] [4]

3) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.

This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.

By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which user can not.

Side notes:

* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change all LSMs should be able to analyze
open/read/write/close... on /proc/<pid>/

4) This will allow to implement new features either in kernel or
userspace without having to worry about procfs.
In containers, sandboxes, etc we have workarounds to hide some /proc
inodes, this should be supported natively without doing extra complex
work, the kernel should be able to support sane options that work with
today and future Linux use cases.

5) Creation of new superblock with all procfs options for each procfs
mount will fix the ignoring of mount options. The problem is that the
second mount of procfs in the same pid namespace ignores the mount
options. The mount options are ignored without error until procfs is
remounted.

Before:

# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=2 0 0

# strace -e mount mount -o hidepid=1 -t proc proc /tmp/proc
mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
+++ exited with 0 +++

# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=2 0 0
proc /tmp/proc proc rw,relatime,hidepid=2 0 0

# mount -o remount,hidepid=1 -t proc proc /tmp/proc

# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=1 0 0
proc /tmp/proc proc rw,relatime,hidepid=1 0 0

After:

# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=2 0 0

# mount -o hidepid=1 -t proc proc /tmp/proc

# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=2 0 0
proc /tmp/proc proc rw,relatime,hidepid=1 0 0


Introduced changes:
-------------------
Each mount of procfs creates a separate procfs instance with its own
mount options.

This series adds few new mount options:

* New 'hidepid=4' mount option to show only ptraceable processes in the procfs.
This allows to support lightweight sandboxes in Embedded Linux, also
solves the case for LSM where now with this mount option, we make sure
that they have a ptrace path in procfs.

* 'subset=pidfs' that allows to hide non-pid inodes from procfs. It can be used
in containers and sandboxes, as these are already trying to hide and block
access to procfs inodes anyway.


ChangeLog:
----------
# v8:
* Started using RCU lock to clean dcache entries as Linus Torvalds suggested.

# v7:
* 'pidonly=1' renamed to 'subset=pidfs' as Alexey Dobriyan suggested.
* HIDEPID_* moved to uapi/ as they are user interface to mount().
  Suggested-by Alexey Dobriyan <adobriyan@gmail.com>

# v6:
* 'hidepid=' and 'gid=' mount options are moved from pid namespace to superblock.
* 'newinstance' mount option removed as Eric W. Biederman suggested.
   Mount of procfs always creates a new instance.
* 'limit_pids' renamed to 'hidepid=3'.
* I took into account the comment of Linus Torvalds [7].
* Documentation added.

# v5:
* Fixed a bug that caused a problem with the Fedora boot.
* The 'pidonly' option is visible among the mount options.

# v2:
* Renamed mount options to 'newinstance' and 'pids='
   Suggested-by: Andy Lutomirski <luto@kernel.org>
* Fixed order of commit, Suggested-by: Andy Lutomirski <luto@kernel.org>
* Many bug fixes.

# v1:
* Removed 'unshared' mount option and replaced it with 'limit_pids'
   which is attached to the current procfs mount.
   Suggested-by Andy Lutomirski <luto@kernel.org>
* Do not fill dcache with pid entries that we can not ptrace.
* Many bug fixes.


References:
-----------
[1] https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
[2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
[3] https://lwn.net/Articles/689539/
[4] http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14
[5] https://lkml.org/lkml/2017/5/2/407
[6] https://lkml.org/lkml/2017/5/3/357
[7] https://lkml.org/lkml/2018/5/11/505


Alexey Gladkov (11):
  proc: Rename struct proc_fs_info to proc_fs_opts
  proc: add proc_fs_info struct to store proc information
  proc: move /proc/{self|thread-self} dentries to proc_fs_info
  proc: move hide_pid, pid_gid from pid_namespace to proc_fs_info
  proc: add helpers to set and get proc hidepid and gid mount options
  proc: support mounting procfs instances inside same pid namespace
  proc: flush task dcache entries from all procfs instances
  proc: instantiate only pids that we can ptrace on 'hidepid=4' mount
    option
  proc: add option to mount only a pids subset
  docs: proc: add documentation for "hidepid=4" and "subset=pidfs"
    options and new mount behavior
  proc: Move hidepid values to uapi as they are user interface to mount

 Documentation/filesystems/proc.txt |  53 +++++++++++
 fs/locks.c                         |   6 +-
 fs/proc/base.c                     |  66 ++++++++++----
 fs/proc/generic.c                  |   9 ++
 fs/proc/inode.c                    |  21 +++--
 fs/proc/internal.h                 |  30 ++++++
 fs/proc/root.c                     | 141 +++++++++++++++++++++++------
 fs/proc/self.c                     |   4 +-
 fs/proc/thread_self.c              |   6 +-
 fs/proc_namespace.c                |  14 +--
 include/linux/pid_namespace.h      |  14 +--
 include/linux/proc_fs.h            |  25 ++++-
 include/uapi/linux/proc_fs.h       |  13 +++
 13 files changed, 324 insertions(+), 78 deletions(-)
 create mode 100644 include/uapi/linux/proc_fs.h

-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v8 01/11] proc: Rename struct proc_fs_info to proc_fs_opts
  2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
@ 2020-02-10 15:05 ` Alexey Gladkov
  2020-02-10 15:05 ` [PATCH v8 02/11] proc: add proc_fs_info struct to store proc information Alexey Gladkov
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 fs/proc_namespace.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 273ee82d8aa9..9a8b624bc3db 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -37,23 +37,23 @@ static __poll_t mounts_poll(struct file *file, poll_table *wait)
 	return res;
 }
 
-struct proc_fs_info {
+struct proc_fs_opts {
 	int flag;
 	const char *str;
 };
 
 static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 {
-	static const struct proc_fs_info fs_info[] = {
+	static const struct proc_fs_opts fs_opts[] = {
 		{ SB_SYNCHRONOUS, ",sync" },
 		{ SB_DIRSYNC, ",dirsync" },
 		{ SB_MANDLOCK, ",mand" },
 		{ SB_LAZYTIME, ",lazytime" },
 		{ 0, NULL }
 	};
-	const struct proc_fs_info *fs_infop;
+	const struct proc_fs_opts *fs_infop;
 
-	for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
+	for (fs_infop = fs_opts; fs_infop->flag; fs_infop++) {
 		if (sb->s_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
@@ -63,7 +63,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 
 static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 {
-	static const struct proc_fs_info mnt_info[] = {
+	static const struct proc_fs_opts mnt_opts[] = {
 		{ MNT_NOSUID, ",nosuid" },
 		{ MNT_NODEV, ",nodev" },
 		{ MNT_NOEXEC, ",noexec" },
@@ -72,9 +72,9 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 		{ MNT_RELATIME, ",relatime" },
 		{ 0, NULL }
 	};
-	const struct proc_fs_info *fs_infop;
+	const struct proc_fs_opts *fs_infop;
 
-	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
+	for (fs_infop = mnt_opts; fs_infop->flag; fs_infop++) {
 		if (mnt->mnt_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v8 02/11] proc: add proc_fs_info struct to store proc information
  2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
  2020-02-10 15:05 ` [PATCH v8 01/11] proc: Rename struct proc_fs_info to proc_fs_opts Alexey Gladkov
@ 2020-02-10 15:05 ` Alexey Gladkov
  2020-02-10 15:05 ` [PATCH v8 03/11] proc: move /proc/{self|thread-self} dentries to proc_fs_info Alexey Gladkov
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

This is a preparation patch that adds proc_fs_info to be able to store
different procfs options and informations. Right now some mount options
are stored inside the pid namespace which makes it hard to change or
modernize procfs without affecting pid namespaces. Plus we do want to
treat proc as more of a real mount point and filesystem. procfs is part
of Linux API where it offers some features using filesystem syscalls and
in order to support some features where we are able to have multiple
instances of procfs, each one with its mount options inside the same pid
namespace, we have to separate these procfs instances.

This is the same feature that was also added to other Linux interfaces
like devpts in order to support containers, sandboxes, and to have
multiple instances of devpts filesystem [1].

[1] https://elixir.bootlin.com/linux/v3.4/source/Documentation/filesystems/devpts.txt

Cc: Kees Cook <keescook@chromium.org>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 fs/locks.c              |  6 +++--
 fs/proc/base.c          |  8 +++++--
 fs/proc/inode.c         |  4 ++--
 fs/proc/root.c          | 49 +++++++++++++++++++++++++++--------------
 include/linux/proc_fs.h | 11 ++++++++-
 5 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 6970f55daf54..21200e3005e4 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2795,7 +2795,8 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
 {
 	struct inode *inode = NULL;
 	unsigned int fl_pid;
-	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb_info(file_inode(f->file)->i_sb);
+	struct pid_namespace *proc_pidns = fs_info->pid_ns;
 
 	fl_pid = locks_translate_pid(fl, proc_pidns);
 	/*
@@ -2873,7 +2874,8 @@ static int locks_show(struct seq_file *f, void *v)
 {
 	struct locks_iterator *iter = f->private;
 	struct file_lock *fl, *bfl;
-	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb_info(file_inode(f->file)->i_sb);
+	struct pid_namespace *proc_pidns = fs_info->pid_ns;
 
 	fl = hlist_entry(v, struct file_lock, fl_link);
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..672e71c52dbd 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3243,6 +3243,7 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
 {
 	struct task_struct *task;
 	unsigned tgid;
+	struct proc_fs_info *fs_info;
 	struct pid_namespace *ns;
 	struct dentry *result = ERR_PTR(-ENOENT);
 
@@ -3250,7 +3251,8 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
 	if (tgid == ~0U)
 		goto out;
 
-	ns = dentry->d_sb->s_fs_info;
+	fs_info = proc_sb_info(dentry->d_sb);
+	ns = fs_info->pid_ns;
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tgid, ns);
 	if (task)
@@ -3538,6 +3540,7 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 	struct task_struct *task;
 	struct task_struct *leader = get_proc_task(dir);
 	unsigned tid;
+	struct proc_fs_info *fs_info;
 	struct pid_namespace *ns;
 	struct dentry *result = ERR_PTR(-ENOENT);
 
@@ -3548,7 +3551,8 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 	if (tid == ~0U)
 		goto out;
 
-	ns = dentry->d_sb->s_fs_info;
+	fs_info = proc_sb_info(dentry->d_sb);
+	ns = fs_info->pid_ns;
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tid, ns);
 	if (task)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index dbe43a50caf2..b631608dfbcc 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -104,8 +104,8 @@ void __init proc_init_kmemcache(void)
 
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
-	struct super_block *sb = root->d_sb;
-	struct pid_namespace *pid = sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb_info(root->d_sb);
+	struct pid_namespace *pid = fs_info->pid_ns;
 
 	if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 0b7c8dffc9ae..d449f095f0f7 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -30,7 +30,7 @@
 #include "internal.h"
 
 struct proc_fs_context {
-	struct pid_namespace	*pid_ns;
+	struct proc_fs_info	*fs_info;
 	unsigned int		mask;
 	int			hidepid;
 	int			gid;
@@ -97,7 +97,8 @@ static void proc_apply_options(struct super_block *s,
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 {
-	struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
+	struct proc_fs_context *ctx = fc->fs_private;
+	struct pid_namespace *pid_ns = get_pid_ns(ctx->fs_info->pid_ns);
 	struct inode *root_inode;
 	int ret;
 
@@ -145,7 +146,8 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 static int proc_reconfigure(struct fs_context *fc)
 {
 	struct super_block *sb = fc->root->d_sb;
-	struct pid_namespace *pid = sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb_info(sb);
+	struct pid_namespace *pid = fs_info->pid_ns;
 
 	sync_filesystem(sb);
 
@@ -157,14 +159,14 @@ static int proc_get_tree(struct fs_context *fc)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
 
-	return get_tree_keyed(fc, proc_fill_super, ctx->pid_ns);
+	return get_tree_keyed(fc, proc_fill_super, ctx->fs_info);
 }
 
 static void proc_fs_context_free(struct fs_context *fc)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
 
-	put_pid_ns(ctx->pid_ns);
+	put_pid_ns(ctx->fs_info->pid_ns);
 	kfree(ctx);
 }
 
@@ -178,14 +180,27 @@ static const struct fs_context_operations proc_fs_context_ops = {
 static int proc_init_fs_context(struct fs_context *fc)
 {
 	struct proc_fs_context *ctx;
+	struct pid_namespace *pid_ns;
 
 	ctx = kzalloc(sizeof(struct proc_fs_context), GFP_KERNEL);
 	if (!ctx)
 		return -ENOMEM;
 
-	ctx->pid_ns = get_pid_ns(task_active_pid_ns(current));
+	pid_ns = get_pid_ns(task_active_pid_ns(current));
+
+	if (!pid_ns->proc_mnt) {
+		ctx->fs_info = kzalloc(sizeof(struct proc_fs_info), GFP_KERNEL);
+		if (!ctx->fs_info) {
+			kfree(ctx);
+			return -ENOMEM;
+		}
+		ctx->fs_info->pid_ns = pid_ns;
+	} else {
+		ctx->fs_info = proc_sb_info(pid_ns->proc_mnt->mnt_sb);
+	}
+
 	put_user_ns(fc->user_ns);
-	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
+	fc->user_ns = get_user_ns(ctx->fs_info->pid_ns->user_ns);
 	fc->fs_private = ctx;
 	fc->ops = &proc_fs_context_ops;
 	return 0;
@@ -193,15 +208,15 @@ static int proc_init_fs_context(struct fs_context *fc)
 
 static void proc_kill_sb(struct super_block *sb)
 {
-	struct pid_namespace *ns;
+	struct proc_fs_info *fs_info = proc_sb_info(sb);
 
-	ns = (struct pid_namespace *)sb->s_fs_info;
-	if (ns->proc_self)
-		dput(ns->proc_self);
-	if (ns->proc_thread_self)
-		dput(ns->proc_thread_self);
+	if (fs_info->pid_ns->proc_self)
+		dput(fs_info->pid_ns->proc_self);
+	if (fs_info->pid_ns->proc_thread_self)
+		dput(fs_info->pid_ns->proc_thread_self);
 	kill_anon_super(sb);
-	put_pid_ns(ns);
+	put_pid_ns(fs_info->pid_ns);
+	kfree(fs_info);
 }
 
 static struct file_system_type proc_fs_type = {
@@ -314,10 +329,10 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
 	}
 
 	ctx = fc->fs_private;
-	if (ctx->pid_ns != ns) {
-		put_pid_ns(ctx->pid_ns);
+	if (ctx->fs_info->pid_ns != ns) {
+		put_pid_ns(ctx->fs_info->pid_ns);
 		get_pid_ns(ns);
-		ctx->pid_ns = ns;
+		ctx->fs_info->pid_ns = ns;
 	}
 
 	mnt = fc_mount(fc);
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index a705aa2d03f9..2d79489e55aa 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -12,6 +12,15 @@ struct proc_dir_entry;
 struct seq_file;
 struct seq_operations;
 
+struct proc_fs_info {
+	struct pid_namespace *pid_ns;
+};
+
+static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
 #ifdef CONFIG_PROC_FS
 
 typedef int (*proc_write_t)(struct file *, char *, size_t);
@@ -146,7 +155,7 @@ int open_related_ns(struct ns_common *ns,
 /* get the associated pid namespace for a file in procfs */
 static inline struct pid_namespace *proc_pid_ns(const struct inode *inode)
 {
-	return inode->i_sb->s_fs_info;
+	return proc_sb_info(inode->i_sb)->pid_ns;
 }
 
 #endif /* _LINUX_PROC_FS_H */
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v8 03/11] proc: move /proc/{self|thread-self} dentries to proc_fs_info
  2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
  2020-02-10 15:05 ` [PATCH v8 01/11] proc: Rename struct proc_fs_info to proc_fs_opts Alexey Gladkov
  2020-02-10 15:05 ` [PATCH v8 02/11] proc: add proc_fs_info struct to store proc information Alexey Gladkov
@ 2020-02-10 15:05 ` Alexey Gladkov
  2020-02-10 18:23   ` Andy Lutomirski
  2020-02-10 15:05 ` [PATCH v8 04/11] proc: move hide_pid, pid_gid from pid_namespace " Alexey Gladkov
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

This is a preparation patch that moves /proc/{self|thread-self} dentries
to be stored inside procfs fs_info struct instead of making them per pid
namespace. Since we want to support multiple procfs instances we need to
make sure that these dentries are also per-superblock instead of
per-pidns, unmounting a private procfs won't clash with other procfs
mounts.

Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 fs/proc/base.c                | 5 +++--
 fs/proc/root.c                | 8 ++++----
 fs/proc/self.c                | 4 ++--
 fs/proc/thread_self.c         | 6 +++---
 include/linux/pid_namespace.h | 4 +---
 include/linux/proc_fs.h       | 2 ++
 6 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 672e71c52dbd..1eb366ad8b06 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3316,6 +3316,7 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
 int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct tgid_iter iter;
+	struct proc_fs_info *fs_info = proc_sb_info(file_inode(file)->i_sb);
 	struct pid_namespace *ns = proc_pid_ns(file_inode(file));
 	loff_t pos = ctx->pos;
 
@@ -3323,13 +3324,13 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 		return 0;
 
 	if (pos == TGID_OFFSET - 2) {
-		struct inode *inode = d_inode(ns->proc_self);
+		struct inode *inode = d_inode(fs_info->proc_self);
 		if (!dir_emit(ctx, "self", 4, inode->i_ino, DT_LNK))
 			return 0;
 		ctx->pos = pos = pos + 1;
 	}
 	if (pos == TGID_OFFSET - 1) {
-		struct inode *inode = d_inode(ns->proc_thread_self);
+		struct inode *inode = d_inode(fs_info->proc_thread_self);
 		if (!dir_emit(ctx, "thread-self", 11, inode->i_ino, DT_LNK))
 			return 0;
 		ctx->pos = pos = pos + 1;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index d449f095f0f7..637e26cc795e 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -210,10 +210,10 @@ static void proc_kill_sb(struct super_block *sb)
 {
 	struct proc_fs_info *fs_info = proc_sb_info(sb);
 
-	if (fs_info->pid_ns->proc_self)
-		dput(fs_info->pid_ns->proc_self);
-	if (fs_info->pid_ns->proc_thread_self)
-		dput(fs_info->pid_ns->proc_thread_self);
+	if (fs_info->proc_self)
+		dput(fs_info->proc_self);
+	if (fs_info->proc_thread_self)
+		dput(fs_info->proc_thread_self);
 	kill_anon_super(sb);
 	put_pid_ns(fs_info->pid_ns);
 	kfree(fs_info);
diff --git a/fs/proc/self.c b/fs/proc/self.c
index 57c0a1047250..846fc2b7c8a8 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -36,7 +36,7 @@ static unsigned self_inum __ro_after_init;
 int proc_setup_self(struct super_block *s)
 {
 	struct inode *root_inode = d_inode(s->s_root);
-	struct pid_namespace *ns = proc_pid_ns(root_inode);
+	struct proc_fs_info *fs_info = proc_sb_info(s);
 	struct dentry *self;
 	int ret = -ENOMEM;
 	
@@ -62,7 +62,7 @@ int proc_setup_self(struct super_block *s)
 	if (ret)
 		pr_err("proc_fill_super: can't allocate /proc/self\n");
 	else
-		ns->proc_self = self;
+		fs_info->proc_self = self;
 
 	return ret;
 }
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index f61ae53533f5..2493cbbdfa6f 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -36,7 +36,7 @@ static unsigned thread_self_inum __ro_after_init;
 int proc_setup_thread_self(struct super_block *s)
 {
 	struct inode *root_inode = d_inode(s->s_root);
-	struct pid_namespace *ns = proc_pid_ns(root_inode);
+	struct proc_fs_info *fs_info = proc_sb_info(s);
 	struct dentry *thread_self;
 	int ret = -ENOMEM;
 
@@ -60,9 +60,9 @@ int proc_setup_thread_self(struct super_block *s)
 	inode_unlock(root_inode);
 
 	if (ret)
-		pr_err("proc_fill_super: can't allocate /proc/thread_self\n");
+		pr_err("proc_fill_super: can't allocate /proc/thread-self\n");
 	else
-		ns->proc_thread_self = thread_self;
+		fs_info->proc_thread_self = thread_self;
 
 	return ret;
 }
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b172483..f91a8bf6e09e 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -31,9 +31,7 @@ struct pid_namespace {
 	unsigned int level;
 	struct pid_namespace *parent;
 #ifdef CONFIG_PROC_FS
-	struct vfsmount *proc_mnt;
-	struct dentry *proc_self;
-	struct dentry *proc_thread_self;
+	struct vfsmount *proc_mnt; /* Internal proc mounted during each new pidns */
 #endif
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct fs_pin *bacct;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 2d79489e55aa..59162988998e 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -14,6 +14,8 @@ struct seq_operations;
 
 struct proc_fs_info {
 	struct pid_namespace *pid_ns;
+	struct dentry *proc_self;        /* For /proc/self */
+	struct dentry *proc_thread_self; /* For /proc/thread-self */
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v8 04/11] proc: move hide_pid, pid_gid from pid_namespace to proc_fs_info
  2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (2 preceding siblings ...)
  2020-02-10 15:05 ` [PATCH v8 03/11] proc: move /proc/{self|thread-self} dentries to proc_fs_info Alexey Gladkov
@ 2020-02-10 15:05 ` " Alexey Gladkov
  2020-02-10 15:05 ` [PATCH v8 05/11] proc: add helpers to set and get proc hidepid and gid mount options Alexey Gladkov
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

This is a preparation patch that moves hide_pid and pid_gid parameters
to be stored inside procfs fs_info struct instead of making them per pid
namespace. Since we want to support multiple procfs instances we need to
make sure that all proc-specific parameters are also per-superblock.

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 fs/proc/base.c                | 18 +++++++++---------
 fs/proc/inode.c               |  9 ++++-----
 fs/proc/root.c                | 10 ++++++++--
 include/linux/pid_namespace.h |  8 --------
 include/linux/proc_fs.h       |  9 +++++++++
 5 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1eb366ad8b06..caca1929fee1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -695,13 +695,13 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
  * May current process learn task's sched/cmdline info (for hide_pid_min=1)
  * or euid/egid (for hide_pid_min=2)?
  */
-static bool has_pid_permissions(struct pid_namespace *pid,
+static bool has_pid_permissions(struct proc_fs_info *fs_info,
 				 struct task_struct *task,
 				 int hide_pid_min)
 {
-	if (pid->hide_pid < hide_pid_min)
+	if (fs_info->hide_pid < hide_pid_min)
 		return true;
-	if (in_group_p(pid->pid_gid))
+	if (in_group_p(fs_info->pid_gid))
 		return true;
 	return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
 }
@@ -709,18 +709,18 @@ static bool has_pid_permissions(struct pid_namespace *pid,
 
 static int proc_pid_permission(struct inode *inode, int mask)
 {
-	struct pid_namespace *pid = proc_pid_ns(inode);
+	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
 	struct task_struct *task;
 	bool has_perms;
 
 	task = get_proc_task(inode);
 	if (!task)
 		return -ESRCH;
-	has_perms = has_pid_permissions(pid, task, HIDEPID_NO_ACCESS);
+	has_perms = has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS);
 	put_task_struct(task);
 
 	if (!has_perms) {
-		if (pid->hide_pid == HIDEPID_INVISIBLE) {
+		if (fs_info->hide_pid == HIDEPID_INVISIBLE) {
 			/*
 			 * Let's make getdents(), stat(), and open()
 			 * consistent with each other.  If a process
@@ -1784,7 +1784,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
 		u32 request_mask, unsigned int query_flags)
 {
 	struct inode *inode = d_inode(path->dentry);
-	struct pid_namespace *pid = proc_pid_ns(inode);
+	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
 	struct task_struct *task;
 
 	generic_fillattr(inode, stat);
@@ -1794,7 +1794,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
 	rcu_read_lock();
 	task = pid_task(proc_pid(inode), PIDTYPE_PID);
 	if (task) {
-		if (!has_pid_permissions(pid, task, HIDEPID_INVISIBLE)) {
+		if (!has_pid_permissions(fs_info, task, HIDEPID_INVISIBLE)) {
 			rcu_read_unlock();
 			/*
 			 * This doesn't prevent learning whether PID exists,
@@ -3344,7 +3344,7 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 		unsigned int len;
 
 		cond_resched();
-		if (!has_pid_permissions(ns, iter.task, HIDEPID_INVISIBLE))
+		if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
 			continue;
 
 		len = snprintf(name, sizeof(name), "%u", iter.tgid);
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index b631608dfbcc..b90c233e5968 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -105,12 +105,11 @@ void __init proc_init_kmemcache(void)
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct proc_fs_info *fs_info = proc_sb_info(root->d_sb);
-	struct pid_namespace *pid = fs_info->pid_ns;
 
-	if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
-		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
-	if (pid->hide_pid != HIDEPID_OFF)
-		seq_printf(seq, ",hidepid=%u", pid->hide_pid);
+	if (!gid_eq(fs_info->pid_gid, GLOBAL_ROOT_GID))
+		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, fs_info->pid_gid));
+	if (fs_info->hide_pid != HIDEPID_OFF)
+		seq_printf(seq, ",hidepid=%u", fs_info->hide_pid);
 
 	return 0;
 }
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 637e26cc795e..1ca47d446aa4 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -89,10 +89,16 @@ static void proc_apply_options(struct super_block *s,
 {
 	struct proc_fs_context *ctx = fc->fs_private;
 
+	if (pid_ns->proc_mnt) {
+		struct proc_fs_info *fs_info = proc_sb_info(pid_ns->proc_mnt->mnt_sb);
+		ctx->fs_info->pid_gid = fs_info->pid_gid;
+		ctx->fs_info->hide_pid = fs_info->hide_pid;
+	}
+
 	if (ctx->mask & (1 << Opt_gid))
-		pid_ns->pid_gid = make_kgid(user_ns, ctx->gid);
+		ctx->fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
-		pid_ns->hide_pid = ctx->hidepid;
+		ctx->fs_info->hide_pid = ctx->hidepid;
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index f91a8bf6e09e..66f47f1afe0d 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -15,12 +15,6 @@
 
 struct fs_pin;
 
-enum { /* definitions for pid_namespace's hide_pid field */
-	HIDEPID_OFF	  = 0,
-	HIDEPID_NO_ACCESS = 1,
-	HIDEPID_INVISIBLE = 2,
-};
-
 struct pid_namespace {
 	struct kref kref;
 	struct idr idr;
@@ -39,8 +33,6 @@ struct pid_namespace {
 	struct user_namespace *user_ns;
 	struct ucounts *ucounts;
 	struct work_struct proc_work;
-	kgid_t pid_gid;
-	int hide_pid;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
 } __randomize_layout;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 59162988998e..5f0b1b7e4271 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -12,10 +12,19 @@ struct proc_dir_entry;
 struct seq_file;
 struct seq_operations;
 
+/* definitions for hide_pid field */
+enum {
+	HIDEPID_OFF	  = 0,
+	HIDEPID_NO_ACCESS = 1,
+	HIDEPID_INVISIBLE = 2,
+};
+
 struct proc_fs_info {
 	struct pid_namespace *pid_ns;
 	struct dentry *proc_self;        /* For /proc/self */
 	struct dentry *proc_thread_self; /* For /proc/thread-self */
+	kgid_t pid_gid;
+	int hide_pid;
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v8 05/11] proc: add helpers to set and get proc hidepid and gid mount options
  2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (3 preceding siblings ...)
  2020-02-10 15:05 ` [PATCH v8 04/11] proc: move hide_pid, pid_gid from pid_namespace " Alexey Gladkov
@ 2020-02-10 15:05 ` Alexey Gladkov
  2020-02-10 18:30   ` Andy Lutomirski
  2020-02-10 15:05 ` [PATCH v8 06/11] proc: support mounting procfs instances inside same pid namespace Alexey Gladkov
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

This is a cleaning patch to add helpers to set and get proc mount
options instead of directly using them. This make it easy to track
what's happening and easy to update in future.

Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 fs/proc/base.c     |  6 +++---
 fs/proc/inode.c    | 11 +++++++----
 fs/proc/internal.h | 20 ++++++++++++++++++++
 fs/proc/root.c     |  8 ++++----
 4 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index caca1929fee1..4ccb280a3e79 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -699,9 +699,9 @@ static bool has_pid_permissions(struct proc_fs_info *fs_info,
 				 struct task_struct *task,
 				 int hide_pid_min)
 {
-	if (fs_info->hide_pid < hide_pid_min)
+	if (proc_fs_hide_pid(fs_info) < hide_pid_min)
 		return true;
-	if (in_group_p(fs_info->pid_gid))
+	if (in_group_p(proc_fs_pid_gid(fs_info)))
 		return true;
 	return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
 }
@@ -720,7 +720,7 @@ static int proc_pid_permission(struct inode *inode, int mask)
 	put_task_struct(task);
 
 	if (!has_perms) {
-		if (fs_info->hide_pid == HIDEPID_INVISIBLE) {
+		if (proc_fs_hide_pid(fs_info) == HIDEPID_INVISIBLE) {
 			/*
 			 * Let's make getdents(), stat(), and open()
 			 * consistent with each other.  If a process
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index b90c233e5968..70b722fb8811 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -105,11 +105,14 @@ void __init proc_init_kmemcache(void)
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct proc_fs_info *fs_info = proc_sb_info(root->d_sb);
+	int hidepid = proc_fs_hide_pid(fs_info);
+	kgid_t gid = proc_fs_pid_gid(fs_info);
 
-	if (!gid_eq(fs_info->pid_gid, GLOBAL_ROOT_GID))
-		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, fs_info->pid_gid));
-	if (fs_info->hide_pid != HIDEPID_OFF)
-		seq_printf(seq, ",hidepid=%u", fs_info->hide_pid);
+	if (!gid_eq(gid, GLOBAL_ROOT_GID))
+		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, gid));
+
+	if (hidepid != HIDEPID_OFF)
+		seq_printf(seq, ",hidepid=%u", hidepid);
 
 	return 0;
 }
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cd0c8d5ce9a1..ff2f274b2e0d 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -121,6 +121,26 @@ static inline struct task_struct *get_proc_task(const struct inode *inode)
 	return get_pid_task(proc_pid(inode), PIDTYPE_PID);
 }
 
+static inline void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int hide_pid)
+{
+	fs_info->hide_pid = hide_pid;
+}
+
+static inline void proc_fs_set_pid_gid(struct proc_fs_info *fs_info, kgid_t gid)
+{
+	fs_info->pid_gid = gid;
+}
+
+static inline int proc_fs_hide_pid(struct proc_fs_info *fs_info)
+{
+	return fs_info->hide_pid;
+}
+
+static inline kgid_t proc_fs_pid_gid(struct proc_fs_info *fs_info)
+{
+	return fs_info->pid_gid;
+}
+
 void task_dump_owner(struct task_struct *task, umode_t mode,
 		     kuid_t *ruid, kgid_t *rgid);
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 1ca47d446aa4..efd76c004e86 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -91,14 +91,14 @@ static void proc_apply_options(struct super_block *s,
 
 	if (pid_ns->proc_mnt) {
 		struct proc_fs_info *fs_info = proc_sb_info(pid_ns->proc_mnt->mnt_sb);
-		ctx->fs_info->pid_gid = fs_info->pid_gid;
-		ctx->fs_info->hide_pid = fs_info->hide_pid;
+		proc_fs_set_pid_gid(ctx->fs_info, proc_fs_pid_gid(fs_info));
+		proc_fs_set_hide_pid(ctx->fs_info, proc_fs_hide_pid(fs_info));
 	}
 
 	if (ctx->mask & (1 << Opt_gid))
-		ctx->fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
+		proc_fs_set_pid_gid(ctx->fs_info, make_kgid(user_ns, ctx->gid));
 	if (ctx->mask & (1 << Opt_hidepid))
-		ctx->fs_info->hide_pid = ctx->hidepid;
+		proc_fs_set_hide_pid(ctx->fs_info, ctx->hidepid);
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v8 06/11] proc: support mounting procfs instances inside same pid namespace
  2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (4 preceding siblings ...)
  2020-02-10 15:05 ` [PATCH v8 05/11] proc: add helpers to set and get proc hidepid and gid mount options Alexey Gladkov
@ 2020-02-10 15:05 ` Alexey Gladkov
  2020-02-10 15:05 ` [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances Alexey Gladkov
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.

1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.

2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.

This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.

By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.

Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...

In the new patchset version I removed the 'newinstance' option
as Eric W. Biederman suggested.

Cc: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 fs/proc/root.c | 41 ++++++++++++++++++-----------------------
 1 file changed, 18 insertions(+), 23 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index efd76c004e86..5d5cba4c899b 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -82,7 +82,7 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	return 0;
 }
 
-static void proc_apply_options(struct super_block *s,
+static void proc_apply_options(struct proc_fs_info *fs_info,
 			       struct fs_context *fc,
 			       struct pid_namespace *pid_ns,
 			       struct user_namespace *user_ns)
@@ -90,15 +90,17 @@ static void proc_apply_options(struct super_block *s,
 	struct proc_fs_context *ctx = fc->fs_private;
 
 	if (pid_ns->proc_mnt) {
-		struct proc_fs_info *fs_info = proc_sb_info(pid_ns->proc_mnt->mnt_sb);
-		proc_fs_set_pid_gid(ctx->fs_info, proc_fs_pid_gid(fs_info));
-		proc_fs_set_hide_pid(ctx->fs_info, proc_fs_hide_pid(fs_info));
+		struct proc_fs_info *pidns_fs_info = proc_sb_info(pid_ns->proc_mnt->mnt_sb);
+
+		proc_fs_set_pid_gid(fs_info, proc_fs_pid_gid(pidns_fs_info));
+		proc_fs_set_hide_pid(fs_info, proc_fs_hide_pid(pidns_fs_info));
 	}
 
 	if (ctx->mask & (1 << Opt_gid))
-		proc_fs_set_pid_gid(ctx->fs_info, make_kgid(user_ns, ctx->gid));
+		proc_fs_set_pid_gid(fs_info, make_kgid(user_ns, ctx->gid));
+
 	if (ctx->mask & (1 << Opt_hidepid))
-		proc_fs_set_hide_pid(ctx->fs_info, ctx->hidepid);
+		proc_fs_set_hide_pid(fs_info, ctx->hidepid);
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -108,7 +110,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	struct inode *root_inode;
 	int ret;
 
-	proc_apply_options(s, fc, pid_ns, current_user_ns());
+	proc_apply_options(ctx->fs_info, fc, pid_ns, current_user_ns());
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -118,6 +120,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_magic = PROC_SUPER_MAGIC;
 	s->s_op = &proc_sops;
 	s->s_time_gran = 1;
+	s->s_fs_info = ctx->fs_info;
 
 	/*
 	 * procfs isn't actually a stacking filesystem; however, there is
@@ -157,15 +160,13 @@ static int proc_reconfigure(struct fs_context *fc)
 
 	sync_filesystem(sb);
 
-	proc_apply_options(sb, fc, pid, current_user_ns());
+	proc_apply_options(fs_info, fc, pid, current_user_ns());
 	return 0;
 }
 
 static int proc_get_tree(struct fs_context *fc)
 {
-	struct proc_fs_context *ctx = fc->fs_private;
-
-	return get_tree_keyed(fc, proc_fill_super, ctx->fs_info);
+	return get_tree_nodev(fc, proc_fill_super);
 }
 
 static void proc_fs_context_free(struct fs_context *fc)
@@ -186,25 +187,19 @@ static const struct fs_context_operations proc_fs_context_ops = {
 static int proc_init_fs_context(struct fs_context *fc)
 {
 	struct proc_fs_context *ctx;
-	struct pid_namespace *pid_ns;
 
 	ctx = kzalloc(sizeof(struct proc_fs_context), GFP_KERNEL);
 	if (!ctx)
 		return -ENOMEM;
 
-	pid_ns = get_pid_ns(task_active_pid_ns(current));
-
-	if (!pid_ns->proc_mnt) {
-		ctx->fs_info = kzalloc(sizeof(struct proc_fs_info), GFP_KERNEL);
-		if (!ctx->fs_info) {
-			kfree(ctx);
-			return -ENOMEM;
-		}
-		ctx->fs_info->pid_ns = pid_ns;
-	} else {
-		ctx->fs_info = proc_sb_info(pid_ns->proc_mnt->mnt_sb);
+	ctx->fs_info = kzalloc(sizeof(struct proc_fs_info), GFP_KERNEL);
+	if (!ctx->fs_info) {
+		kfree(ctx);
+		return -ENOMEM;
 	}
 
+	ctx->fs_info->pid_ns = get_pid_ns(task_active_pid_ns(current));
+
 	put_user_ns(fc->user_ns);
 	fc->user_ns = get_user_ns(ctx->fs_info->pid_ns->user_ns);
 	fc->fs_private = ctx;
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (5 preceding siblings ...)
  2020-02-10 15:05 ` [PATCH v8 06/11] proc: support mounting procfs instances inside same pid namespace Alexey Gladkov
@ 2020-02-10 15:05 ` Alexey Gladkov
  2020-02-10 17:46   ` Linus Torvalds
                     ` (2 more replies)
  2020-02-10 15:05 ` [PATCH v8 08/11] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option Alexey Gladkov
                   ` (3 subsequent siblings)
  10 siblings, 3 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

This allows to flush dcache entries of a task on multiple procfs mounts
per pid namespace.

The RCU lock is used because the number of reads at the task exit time
is much larger than the number of procfs mounts.

Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 fs/proc/base.c                | 20 +++++++++++++++-----
 fs/proc/root.c                | 27 ++++++++++++++++++++++++++-
 include/linux/pid_namespace.h |  2 ++
 include/linux/proc_fs.h       |  2 ++
 4 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4ccb280a3e79..24b7c620ded3 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3133,7 +3133,7 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
 	.permission	= proc_pid_permission,
 };
 
-static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
+static void proc_flush_task_mnt_root(struct dentry *mnt_root, pid_t pid, pid_t tgid)
 {
 	struct dentry *dentry, *leader, *dir;
 	char buf[10 + 1];
@@ -3142,7 +3142,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
 	name.name = buf;
 	name.len = snprintf(buf, sizeof(buf), "%u", pid);
 	/* no ->d_hash() rejects on procfs */
-	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
+	dentry = d_hash_and_lookup(mnt_root, &name);
 	if (dentry) {
 		d_invalidate(dentry);
 		dput(dentry);
@@ -3153,7 +3153,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
 
 	name.name = buf;
 	name.len = snprintf(buf, sizeof(buf), "%u", tgid);
-	leader = d_hash_and_lookup(mnt->mnt_root, &name);
+	leader = d_hash_and_lookup(mnt_root, &name);
 	if (!leader)
 		goto out;
 
@@ -3208,14 +3208,24 @@ void proc_flush_task(struct task_struct *task)
 	int i;
 	struct pid *pid, *tgid;
 	struct upid *upid;
+	struct dentry *mnt_root;
+	struct proc_fs_info *fs_info;
 
 	pid = task_pid(task);
 	tgid = task_tgid(task);
 
 	for (i = 0; i <= pid->level; i++) {
 		upid = &pid->numbers[i];
-		proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
-					tgid->numbers[i].nr);
+
+		rcu_read_lock();
+		list_for_each_entry_rcu(fs_info, &upid->ns->proc_mounts, pidns_entry) {
+			mnt_root = fs_info->m_super->s_root;
+			proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
+		}
+		rcu_read_unlock();
+
+		mnt_root = upid->ns->proc_mnt->mnt_root;
+		proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
 	}
 }
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5d5cba4c899b..e2bb015da1a8 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -149,7 +149,22 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	if (ret) {
 		return ret;
 	}
-	return proc_setup_thread_self(s);
+
+	ret = proc_setup_thread_self(s);
+	if (ret) {
+		return ret;
+	}
+
+	/*
+	 * back reference to flush dcache entries at process exit time.
+	 */
+	ctx->fs_info->m_super = s;
+
+	spin_lock(&pid_ns->proc_mounts_lock);
+	list_add_tail_rcu(&ctx->fs_info->pidns_entry, &pid_ns->proc_mounts);
+	spin_unlock(&pid_ns->proc_mounts_lock);
+
+	return 0;
 }
 
 static int proc_reconfigure(struct fs_context *fc)
@@ -211,10 +226,17 @@ static void proc_kill_sb(struct super_block *sb)
 {
 	struct proc_fs_info *fs_info = proc_sb_info(sb);
 
+	spin_lock(&fs_info->pid_ns->proc_mounts_lock);
+	list_del_rcu(&fs_info->pidns_entry);
+	spin_unlock(&fs_info->pid_ns->proc_mounts_lock);
+
+	synchronize_rcu();
+
 	if (fs_info->proc_self)
 		dput(fs_info->proc_self);
 	if (fs_info->proc_thread_self)
 		dput(fs_info->proc_thread_self);
+
 	kill_anon_super(sb);
 	put_pid_ns(fs_info->pid_ns);
 	kfree(fs_info);
@@ -336,6 +358,9 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
 		ctx->fs_info->pid_ns = ns;
 	}
 
+	spin_lock_init(&ns->proc_mounts_lock);
+	INIT_LIST_HEAD_RCU(&ns->proc_mounts);
+
 	mnt = fc_mount(fc);
 	put_fs_context(fc);
 	if (IS_ERR(mnt))
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 66f47f1afe0d..c36af1dfd862 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -26,6 +26,8 @@ struct pid_namespace {
 	struct pid_namespace *parent;
 #ifdef CONFIG_PROC_FS
 	struct vfsmount *proc_mnt; /* Internal proc mounted during each new pidns */
+	spinlock_t proc_mounts_lock;
+	struct list_head proc_mounts; /* list of separated procfs mounts */
 #endif
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct fs_pin *bacct;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 5f0b1b7e4271..f307940f8311 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -20,6 +20,8 @@ enum {
 };
 
 struct proc_fs_info {
+	struct list_head pidns_entry;    /* Node in procfs_mounts of a pidns */
+	struct super_block *m_super;
 	struct pid_namespace *pid_ns;
 	struct dentry *proc_self;        /* For /proc/self */
 	struct dentry *proc_thread_self; /* For /proc/thread-self */
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v8 08/11] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
  2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (6 preceding siblings ...)
  2020-02-10 15:05 ` [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances Alexey Gladkov
@ 2020-02-10 15:05 ` Alexey Gladkov
  2020-02-10 16:29   ` Jordan Glover
  2020-02-10 15:05 ` [PATCH v8 09/11] proc: add option to mount only a pids subset Alexey Gladkov
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

If "hidepid=4" mount option is set then do not instantiate pids that
we can not ptrace. "hidepid=4" means that procfs should only contain
pids that the caller can ptrace.

Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 fs/proc/base.c          | 15 +++++++++++++++
 fs/proc/root.c          | 14 +++++++++++---
 include/linux/proc_fs.h |  1 +
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 24b7c620ded3..49937d54e745 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -699,6 +699,14 @@ static bool has_pid_permissions(struct proc_fs_info *fs_info,
 				 struct task_struct *task,
 				 int hide_pid_min)
 {
+	/*
+	 * If 'hidpid' mount option is set force a ptrace check,
+	 * we indicate that we are using a filesystem syscall
+	 * by passing PTRACE_MODE_READ_FSCREDS
+	 */
+	if (proc_fs_hide_pid(fs_info) == HIDEPID_NOT_PTRACABLE)
+		return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
+
 	if (proc_fs_hide_pid(fs_info) < hide_pid_min)
 		return true;
 	if (in_group_p(proc_fs_pid_gid(fs_info)))
@@ -3271,7 +3279,14 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
 	if (!task)
 		goto out;
 
+	/* Limit procfs to only ptracable tasks */
+	if (proc_fs_hide_pid(fs_info) == HIDEPID_NOT_PTRACABLE) {
+		if (!has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS))
+			goto out_put_task;
+	}
+
 	result = proc_pid_instantiate(dentry, task, NULL);
+out_put_task:
 	put_task_struct(task);
 out:
 	return result;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index e2bb015da1a8..5e27bb31f125 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -52,6 +52,15 @@ static const struct fs_parameter_description proc_fs_parameters = {
 	.specs		= proc_param_specs,
 };
 
+static inline int
+valid_hidepid(unsigned int value)
+{
+	return (value == HIDEPID_OFF ||
+		value == HIDEPID_NO_ACCESS ||
+		value == HIDEPID_INVISIBLE ||
+		value == HIDEPID_NOT_PTRACABLE);
+}
+
 static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
@@ -68,10 +77,9 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		break;
 
 	case Opt_hidepid:
+		if (!valid_hidepid(result.uint_32))
+			return invalf(fc, "proc: unknown value of hidepid.\n");
 		ctx->hidepid = result.uint_32;
-		if (ctx->hidepid < HIDEPID_OFF ||
-		    ctx->hidepid > HIDEPID_INVISIBLE)
-			return invalf(fc, "proc: hidepid value must be between 0 and 2.\n");
 		break;
 
 	default:
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index f307940f8311..6822548405a7 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -17,6 +17,7 @@ enum {
 	HIDEPID_OFF	  = 0,
 	HIDEPID_NO_ACCESS = 1,
 	HIDEPID_INVISIBLE = 2,
+	HIDEPID_NOT_PTRACABLE = 4, /* Limit pids to only ptracable pids */
 };
 
 struct proc_fs_info {
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v8 09/11] proc: add option to mount only a pids subset
  2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (7 preceding siblings ...)
  2020-02-10 15:05 ` [PATCH v8 08/11] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option Alexey Gladkov
@ 2020-02-10 15:05 ` Alexey Gladkov
  2020-02-10 15:05 ` [PATCH v8 10/11] docs: proc: add documentation for "hidepid=4" and "subset=pidfs" options and new mount behavior Alexey Gladkov
  2020-02-10 15:05 ` [PATCH v8 11/11] proc: Move hidepid values to uapi as they are user interface to mount Alexey Gladkov
  10 siblings, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

This allows to hide all files and directories in the procfs that are not
related to tasks.

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 fs/proc/generic.c       |  9 +++++++++
 fs/proc/inode.c         |  7 +++++++
 fs/proc/internal.h      | 10 ++++++++++
 fs/proc/root.c          | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/proc_fs.h |  7 +++++++
 5 files changed, 69 insertions(+)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 64e9ee1b129e..6f6517d63053 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -267,6 +267,11 @@ struct dentry *proc_lookup_de(struct inode *dir, struct dentry *dentry,
 struct dentry *proc_lookup(struct inode *dir, struct dentry *dentry,
 		unsigned int flags)
 {
+	struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
+
+	if (proc_fs_pidonly(fs_info) == PROC_PIDONLY_ON)
+		return ERR_PTR(-ENOENT);
+
 	return proc_lookup_de(dir, dentry, PDE(dir));
 }
 
@@ -323,6 +328,10 @@ int proc_readdir_de(struct file *file, struct dir_context *ctx,
 int proc_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct inode *inode = file_inode(file);
+	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
+
+	if (proc_fs_pidonly(fs_info) == PROC_PIDONLY_ON)
+		return 1;
 
 	return proc_readdir_de(file, ctx, PDE(inode));
 }
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 70b722fb8811..f35eef117775 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -114,6 +114,9 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
 	if (hidepid != HIDEPID_OFF)
 		seq_printf(seq, ",hidepid=%u", hidepid);
 
+	if (proc_fs_pidonly(fs_info) != PROC_PIDONLY_OFF)
+		seq_printf(seq, ",subset=pidfs");
+
 	return 0;
 }
 
@@ -333,12 +336,16 @@ proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr,
 
 static int proc_reg_open(struct inode *inode, struct file *file)
 {
+	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
 	struct proc_dir_entry *pde = PDE(inode);
 	int rv = 0;
 	typeof_member(struct file_operations, open) open;
 	typeof_member(struct file_operations, release) release;
 	struct pde_opener *pdeo;
 
+	if (proc_fs_pidonly(fs_info) == PROC_PIDONLY_ON)
+		return -ENOENT;
+
 	/*
 	 * Ensure that
 	 * 1) PDE's ->release hook will be called no matter what
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index ff2f274b2e0d..e2c729267317 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -126,6 +126,11 @@ static inline void proc_fs_set_hide_pid(struct proc_fs_info *fs_info, int hide_p
 	fs_info->hide_pid = hide_pid;
 }
 
+static inline void proc_fs_set_pidonly(struct proc_fs_info *fs_info, int value)
+{
+	fs_info->pidonly = value;
+}
+
 static inline void proc_fs_set_pid_gid(struct proc_fs_info *fs_info, kgid_t gid)
 {
 	fs_info->pid_gid = gid;
@@ -141,6 +146,11 @@ static inline kgid_t proc_fs_pid_gid(struct proc_fs_info *fs_info)
 	return fs_info->pid_gid;
 }
 
+static inline int proc_fs_pidonly(struct proc_fs_info *fs_info)
+{
+	return fs_info->pidonly;
+}
+
 void task_dump_owner(struct task_struct *task, umode_t mode,
 		     kuid_t *ruid, kgid_t *rgid);
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5e27bb31f125..4dce77639c2b 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -34,16 +34,19 @@ struct proc_fs_context {
 	unsigned int		mask;
 	int			hidepid;
 	int			gid;
+	int			pidonly;
 };
 
 enum proc_param {
 	Opt_gid,
 	Opt_hidepid,
+	Opt_subset,
 };
 
 static const struct fs_parameter_spec proc_param_specs[] = {
 	fsparam_u32("gid",	Opt_gid),
 	fsparam_u32("hidepid",	Opt_hidepid),
+	fsparam_string("subset",	Opt_subset),
 	{}
 };
 
@@ -61,6 +64,30 @@ valid_hidepid(unsigned int value)
 		value == HIDEPID_NOT_PTRACABLE);
 }
 
+static inline int
+proc_parse_subset_param(struct fs_context *fc, char *value)
+{
+	struct proc_fs_context *ctx = fc->fs_private;
+
+	while (value) {
+		char *ptr = strchr(value, ',');
+
+		if (ptr != NULL)
+			*ptr++ = '\0';
+
+		if (*value != '\0') {
+			if (!strcmp(value, "pidfs")) {
+				ctx->pidonly = PROC_PIDONLY_ON;
+			} else {
+				return invalf(fc, "proc: unsupported subset option - %s\n", value);
+			}
+		}
+		value = ptr;
+	}
+
+	return 0;
+}
+
 static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
@@ -82,6 +109,11 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		ctx->hidepid = result.uint_32;
 		break;
 
+	case Opt_subset:
+		if (proc_parse_subset_param(fc, param->string) < 0)
+			return -EINVAL;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -102,6 +134,7 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 
 		proc_fs_set_pid_gid(fs_info, proc_fs_pid_gid(pidns_fs_info));
 		proc_fs_set_hide_pid(fs_info, proc_fs_hide_pid(pidns_fs_info));
+		proc_fs_set_pidonly(fs_info, proc_fs_pidonly(pidns_fs_info));
 	}
 
 	if (ctx->mask & (1 << Opt_gid))
@@ -109,6 +142,9 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 
 	if (ctx->mask & (1 << Opt_hidepid))
 		proc_fs_set_hide_pid(fs_info, ctx->hidepid);
+
+	if (ctx->mask & (1 << Opt_subset))
+		proc_fs_set_pidonly(fs_info, ctx->pidonly);
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 6822548405a7..3ad0a47c3556 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -20,6 +20,12 @@ enum {
 	HIDEPID_NOT_PTRACABLE = 4, /* Limit pids to only ptracable pids */
 };
 
+/* definitions for proc mount option pidonly */
+enum {
+	PROC_PIDONLY_OFF = 0,
+	PROC_PIDONLY_ON  = 1,
+};
+
 struct proc_fs_info {
 	struct list_head pidns_entry;    /* Node in procfs_mounts of a pidns */
 	struct super_block *m_super;
@@ -28,6 +34,7 @@ struct proc_fs_info {
 	struct dentry *proc_thread_self; /* For /proc/thread-self */
 	kgid_t pid_gid;
 	int hide_pid;
+	int pidonly;
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v8 10/11] docs: proc: add documentation for "hidepid=4" and "subset=pidfs" options and new mount behavior
  2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (8 preceding siblings ...)
  2020-02-10 15:05 ` [PATCH v8 09/11] proc: add option to mount only a pids subset Alexey Gladkov
@ 2020-02-10 15:05 ` Alexey Gladkov
  2020-02-10 18:29   ` Andy Lutomirski
  2020-02-10 15:05 ` [PATCH v8 11/11] proc: Move hidepid values to uapi as they are user interface to mount Alexey Gladkov
  10 siblings, 1 reply; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 Documentation/filesystems/proc.txt | 53 ++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 99ca040e3f90..4741fd092f36 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -50,6 +50,8 @@ Table of Contents
   4	Configuring procfs
   4.1	Mount options
 
+  5	Filesystem behavior
+
 ------------------------------------------------------------------------------
 Preface
 ------------------------------------------------------------------------------
@@ -2021,6 +2023,7 @@ The following mount options are supported:
 
 	hidepid=	Set /proc/<pid>/ access mode.
 	gid=		Set the group authorized to learn processes information.
+	subset=		Show only the specified subset of procfs.
 
 hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
 (default).
@@ -2042,6 +2045,56 @@ information about running processes, whether some daemon runs with elevated
 privileges, whether other user runs some sensitive program, whether other users
 run any program at all, etc.
 
+hidepid=4 means that procfs should only contain /proc/<pid>/ directories
+that the caller can ptrace.
+
 gid= defines a group authorized to learn processes information otherwise
 prohibited by hidepid=.  If you use some daemon like identd which needs to learn
 information about processes information, just add identd to this group.
+
+subset=pidfs hides all top level files and directories in the procfs that
+are not related to tasks.
+
+------------------------------------------------------------------------------
+5 Filesystem behavior
+------------------------------------------------------------------------------
+
+Originally, before the advent of pid namepsace, procfs was a global file
+system. It means that there was only one procfs instance in the system.
+
+When pid namespace was added, a separate procfs instance was mounted in
+each pid namespace. So, procfs mount options are global among all
+mountpoints within the same namespace.
+
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=2 0 0
+
+# strace -e mount mount -o hidepid=1 -t proc proc /tmp/proc
+mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
++++ exited with 0 +++
+
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=2 0 0
+proc /tmp/proc proc rw,relatime,hidepid=2 0 0
+
+and only after remounting procfs mount options will change at all
+mountpoints.
+
+# mount -o remount,hidepid=1 -t proc proc /tmp/proc
+
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=1 0 0
+proc /tmp/proc proc rw,relatime,hidepid=1 0 0
+
+This behavior is different from the behavior of other filesystems.
+
+The new procfs behavior is more like other filesystems. Each procfs mount
+creates a new procfs instance. Mount options affect own procfs instance.
+It means that it became possible to have several procfs instances
+displaying tasks with different filtering options in one pid namespace.
+
+# mount -o hidepid=2 -t proc proc /proc
+# mount -o hidepid=1 -t proc proc /tmp/proc
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=2 0 0
+proc /tmp/proc proc rw,relatime,hidepid=1 0 0
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v8 11/11] proc: Move hidepid values to uapi as they are user interface to mount
  2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (9 preceding siblings ...)
  2020-02-10 15:05 ` [PATCH v8 10/11] docs: proc: add documentation for "hidepid=4" and "subset=pidfs" options and new mount behavior Alexey Gladkov
@ 2020-02-10 15:05 ` Alexey Gladkov
  10 siblings, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-10 15:05 UTC (permalink / raw)
  To: LKML, Kernel Hardening, Linux API, Linux FS Devel, Linux Security Module
  Cc: Akinobu Mita, Alexander Viro, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
---
 include/linux/proc_fs.h      |  9 +--------
 include/uapi/linux/proc_fs.h | 13 +++++++++++++
 2 files changed, 14 insertions(+), 8 deletions(-)
 create mode 100644 include/uapi/linux/proc_fs.h

diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 3ad0a47c3556..f2b4a411d371 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -7,19 +7,12 @@
 
 #include <linux/types.h>
 #include <linux/fs.h>
+#include <uapi/linux/proc_fs.h>
 
 struct proc_dir_entry;
 struct seq_file;
 struct seq_operations;
 
-/* definitions for hide_pid field */
-enum {
-	HIDEPID_OFF	  = 0,
-	HIDEPID_NO_ACCESS = 1,
-	HIDEPID_INVISIBLE = 2,
-	HIDEPID_NOT_PTRACABLE = 4, /* Limit pids to only ptracable pids */
-};
-
 /* definitions for proc mount option pidonly */
 enum {
 	PROC_PIDONLY_OFF = 0,
diff --git a/include/uapi/linux/proc_fs.h b/include/uapi/linux/proc_fs.h
new file mode 100644
index 000000000000..1e3374efffe2
--- /dev/null
+++ b/include/uapi/linux/proc_fs.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_PROC_FS_H
+#define _UAPI_PROC_FS_H
+
+/* definitions for hide_pid field */
+enum {
+	HIDEPID_OFF           = 0,
+	HIDEPID_NO_ACCESS     = 1,
+	HIDEPID_INVISIBLE     = 2,
+	HIDEPID_NOT_PTRACABLE = 4,
+};
+
+#endif
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 08/11] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
  2020-02-10 15:05 ` [PATCH v8 08/11] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option Alexey Gladkov
@ 2020-02-10 16:29   ` Jordan Glover
  2020-02-12 14:34     ` Alexey Gladkov
  0 siblings, 1 reply; 67+ messages in thread
From: Jordan Glover @ 2020-02-10 16:29 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Eric W . Biederman,
	Greg Kroah-Hartman, Ingo Molnar, J . Bruce Fields, Jeff Layton,
	Jonathan Corbet, Kees Cook, Linus Torvalds, Oleg Nesterov,
	Solar Designer

On Monday, February 10, 2020 3:05 PM, Alexey Gladkov <gladkov.alexey@gmail.com> wrote:

> If "hidepid=4" mount option is set then do not instantiate pids that
> we can not ptrace. "hidepid=4" means that procfs should only contain
> pids that the caller can ptrace.
>
> Cc: Kees Cook keescook@chromium.org
> Cc: Andy Lutomirski luto@kernel.org
> Signed-off-by: Djalal Harouni tixxdz@gmail.com
> Signed-off-by: Alexey Gladkov gladkov.alexey@gmail.com
>
> fs/proc/base.c | 15 +++++++++++++++
> fs/proc/root.c | 14 +++++++++++---
> include/linux/proc_fs.h | 1 +
> 3 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 24b7c620ded3..49937d54e745 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -699,6 +699,14 @@ static bool has_pid_permissions(struct proc_fs_info *fs_info,
> struct task_struct *task,
> int hide_pid_min)
> {
>
> -   /*
> -   -   If 'hidpid' mount option is set force a ptrace check,
> -   -   we indicate that we are using a filesystem syscall
> -   -   by passing PTRACE_MODE_READ_FSCREDS
> -   */
> -   if (proc_fs_hide_pid(fs_info) == HIDEPID_NOT_PTRACABLE)
> -         return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
>
>
> -   if (proc_fs_hide_pid(fs_info) < hide_pid_min)
>     return true;
>     if (in_group_p(proc_fs_pid_gid(fs_info)))
>     @@ -3271,7 +3279,14 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
>     if (!task)
>     goto out;
>
> -   /* Limit procfs to only ptracable tasks */
> -   if (proc_fs_hide_pid(fs_info) == HIDEPID_NOT_PTRACABLE) {
> -         if (!has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS))
>
>
> -         	goto out_put_task;
>
>
> -   }
> -   result = proc_pid_instantiate(dentry, task, NULL);
>     +out_put_task:
>     put_task_struct(task);
>     out:
>     return result;
>     diff --git a/fs/proc/root.c b/fs/proc/root.c
>     index e2bb015da1a8..5e27bb31f125 100644
>     --- a/fs/proc/root.c
>     +++ b/fs/proc/root.c
>     @@ -52,6 +52,15 @@ static const struct fs_parameter_description proc_fs_parameters = {
>     .specs = proc_param_specs,
>     };
>
>     +static inline int
>     +valid_hidepid(unsigned int value)
>     +{
>
> -   return (value == HIDEPID_OFF ||
> -         value == HIDEPID_NO_ACCESS ||
>
>
> -         value == HIDEPID_INVISIBLE ||
>
>
> -         value == HIDEPID_NOT_PTRACABLE);
>
>
>
> +}
> +
> static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
> {
> struct proc_fs_context *ctx = fc->fs_private;
> @@ -68,10 +77,9 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
> break;
>
> case Opt_hidepid:
>
> -         if (!valid_hidepid(result.uint_32))
>
>
> -         	return invalf(fc, "proc: unknown value of hidepid.\\n");
>           ctx->hidepid = result.uint_32;
>
>
>
> -         if (ctx->hidepid < HIDEPID_OFF ||
>
>
> -             ctx->hidepid > HIDEPID_INVISIBLE)
>
>
> -         	return invalf(fc, "proc: hidepid value must be between 0 and 2.\\n");
>           break;
>
>
>
> default:
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index f307940f8311..6822548405a7 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -17,6 +17,7 @@ enum {
> HIDEPID_OFF = 0,
> HIDEPID_NO_ACCESS = 1,
> HIDEPID_INVISIBLE = 2,
>
> -   HIDEPID_NOT_PTRACABLE = 4, /* Limit pids to only ptracable pids */

Is there a reason new option is "4" instead of "3"? The order 1..2..4 may be
confusing for people.

Jordan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-10 15:05 ` [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances Alexey Gladkov
@ 2020-02-10 17:46   ` Linus Torvalds
  2020-02-10 19:23     ` Al Viro
  2020-02-11  1:36   ` ebiederm
  2020-02-11 22:45   ` Al Viro
  2 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2020-02-10 17:46 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Eric W . Biederman,
	Greg Kroah-Hartman, Ingo Molnar, J . Bruce Fields, Jeff Layton,
	Jonathan Corbet, Kees Cook, Oleg Nesterov, Solar Designer

On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>
> This allows to flush dcache entries of a task on multiple procfs mounts
> per pid namespace.
>
> The RCU lock is used because the number of reads at the task exit time
> is much larger than the number of procfs mounts.

Ok, this looks better to me than the previous version.

But that may be the "pee-in-the-snow" effect, and I _really_ want
others to take a good look at the whole series.

The right people seem to be cc'd, but this is pretty core, and /proc
has a tendency to cause interesting issues because of how it's
involved in a lot of areas indirectly.

Al, Oleg, Andy, Eric?

             Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 03/11] proc: move /proc/{self|thread-self} dentries to proc_fs_info
  2020-02-10 15:05 ` [PATCH v8 03/11] proc: move /proc/{self|thread-self} dentries to proc_fs_info Alexey Gladkov
@ 2020-02-10 18:23   ` Andy Lutomirski
  2020-02-12 15:00     ` Alexey Gladkov
  0 siblings, 1 reply; 67+ messages in thread
From: Andy Lutomirski @ 2020-02-10 18:23 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Eric W . Biederman,
	Greg Kroah-Hartman, Ingo Molnar, J . Bruce Fields, Jeff Layton,
	Jonathan Corbet, Kees Cook, Linus Torvalds, Oleg Nesterov,
	Solar Designer

On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>
> This is a preparation patch that moves /proc/{self|thread-self} dentries
> to be stored inside procfs fs_info struct instead of making them per pid
> namespace. Since we want to support multiple procfs instances we need to
> make sure that these dentries are also per-superblock instead of
> per-pidns,

The changelog makes perfect sense so far...

> unmounting a private procfs won't clash with other procfs
> mounts.

This doesn't parse as part of the previous sentence.  I'm also not
convinced that this really involves unmounting per se.  Maybe just
delete these words.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 10/11] docs: proc: add documentation for "hidepid=4" and "subset=pidfs" options and new mount behavior
  2020-02-10 15:05 ` [PATCH v8 10/11] docs: proc: add documentation for "hidepid=4" and "subset=pidfs" options and new mount behavior Alexey Gladkov
@ 2020-02-10 18:29   ` Andy Lutomirski
  2020-02-12 16:03     ` Alexey Gladkov
  0 siblings, 1 reply; 67+ messages in thread
From: Andy Lutomirski @ 2020-02-10 18:29 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Eric W . Biederman,
	Greg Kroah-Hartman, Ingo Molnar, J . Bruce Fields, Jeff Layton,
	Jonathan Corbet, Kees Cook, Linus Torvalds, Oleg Nesterov,
	Solar Designer

On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>
> Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> ---
>  Documentation/filesystems/proc.txt | 53 ++++++++++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
>
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index 99ca040e3f90..4741fd092f36 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -50,6 +50,8 @@ Table of Contents
>    4    Configuring procfs
>    4.1  Mount options
>
> +  5    Filesystem behavior
> +
>  ------------------------------------------------------------------------------
>  Preface
>  ------------------------------------------------------------------------------
> @@ -2021,6 +2023,7 @@ The following mount options are supported:
>
>         hidepid=        Set /proc/<pid>/ access mode.
>         gid=            Set the group authorized to learn processes information.
> +       subset=         Show only the specified subset of procfs.
>
>  hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
>  (default).
> @@ -2042,6 +2045,56 @@ information about running processes, whether some daemon runs with elevated
>  privileges, whether other user runs some sensitive program, whether other users
>  run any program at all, etc.
>
> +hidepid=4 means that procfs should only contain /proc/<pid>/ directories
> +that the caller can ptrace.

I have a couple of minor nits here.

First, perhaps we could stop using magic numbers and use words.
hidepid=ptraceable is actually comprehensible, whereas hidepid=4
requires looking up what '4' means.

Second, there is PTRACE_MODE_ATTACH and PTRACE_MODE_READ.  Which is it?

> +
>  gid= defines a group authorized to learn processes information otherwise
>  prohibited by hidepid=.  If you use some daemon like identd which needs to learn
>  information about processes information, just add identd to this group.

How is this better than just creating an entirely separate mount a
different hidepid and a different gid owning it?  In any event,
usually gid= means that this gid is the group owner of inodes.  Let's
call it something different.  gid_override_hidepid might be credible.
But it's also really weird -- do different groups really see different
contents when they read a directory?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 05/11] proc: add helpers to set and get proc hidepid and gid mount options
  2020-02-10 15:05 ` [PATCH v8 05/11] proc: add helpers to set and get proc hidepid and gid mount options Alexey Gladkov
@ 2020-02-10 18:30   ` Andy Lutomirski
  2020-02-12 14:57     ` Alexey Gladkov
  0 siblings, 1 reply; 67+ messages in thread
From: Andy Lutomirski @ 2020-02-10 18:30 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Eric W . Biederman,
	Greg Kroah-Hartman, Ingo Molnar, J . Bruce Fields, Jeff Layton,
	Jonathan Corbet, Kees Cook, Linus Torvalds, Oleg Nesterov,
	Solar Designer

On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
>
> This is a cleaning patch to add helpers to set and get proc mount
> options instead of directly using them. This make it easy to track
> what's happening and easy to update in future.

On a cursory inspection, this looks like it obfuscates the code, and I
don't see where it does something useful later in the series.  What is
this abstraction for?

--Andy

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-10 17:46   ` Linus Torvalds
@ 2020-02-10 19:23     ` Al Viro
  0 siblings, 0 replies; 67+ messages in thread
From: Al Viro @ 2020-02-10 19:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Gladkov, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Eric W . Biederman,
	Greg Kroah-Hartman, Ingo Molnar, J . Bruce Fields, Jeff Layton,
	Jonathan Corbet, Kees Cook, Oleg Nesterov, Solar Designer

On Mon, Feb 10, 2020 at 09:46:26AM -0800, Linus Torvalds wrote:
> On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> >
> > This allows to flush dcache entries of a task on multiple procfs mounts
> > per pid namespace.
> >
> > The RCU lock is used because the number of reads at the task exit time
> > is much larger than the number of procfs mounts.
> 
> Ok, this looks better to me than the previous version.
> 
> But that may be the "pee-in-the-snow" effect, and I _really_ want
> others to take a good look at the whole series.
> 
> The right people seem to be cc'd, but this is pretty core, and /proc
> has a tendency to cause interesting issues because of how it's
> involved in a lot of areas indirectly.
> 
> Al, Oleg, Andy, Eric?

Will check tonight (ears-deep in sorting out the old branches right now)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-10 15:05 ` [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances Alexey Gladkov
  2020-02-10 17:46   ` Linus Torvalds
@ 2020-02-11  1:36   ` ebiederm
  2020-02-11  4:01     ` ebiederm
  2020-02-12 14:49     ` Alexey Gladkov
  2020-02-11 22:45   ` Al Viro
  2 siblings, 2 replies; 67+ messages in thread
From: ebiederm @ 2020-02-11  1:36 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

Alexey Gladkov <gladkov.alexey@gmail.com> writes:

> This allows to flush dcache entries of a task on multiple procfs mounts
> per pid namespace.
>
> The RCU lock is used because the number of reads at the task exit time
> is much larger than the number of procfs mounts.

A couple of quick comments.

> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> ---
>  fs/proc/base.c                | 20 +++++++++++++++-----
>  fs/proc/root.c                | 27 ++++++++++++++++++++++++++-
>  include/linux/pid_namespace.h |  2 ++
>  include/linux/proc_fs.h       |  2 ++
>  4 files changed, 45 insertions(+), 6 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 4ccb280a3e79..24b7c620ded3 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3133,7 +3133,7 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
>  	.permission	= proc_pid_permission,
>  };
>  
> -static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
> +static void proc_flush_task_mnt_root(struct dentry *mnt_root, pid_t pid, pid_t tgid)
Perhaps just rename things like:
> +static void proc_flush_task_root(struct dentry *root, pid_t pid, pid_t tgid)
>  {

I don't think the mnt_ prefix conveys any information, and it certainly
makes everything longer and more cumbersome.

>  	struct dentry *dentry, *leader, *dir;
>  	char buf[10 + 1];
> @@ -3142,7 +3142,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
>  	name.name = buf;
>  	name.len = snprintf(buf, sizeof(buf), "%u", pid);
>  	/* no ->d_hash() rejects on procfs */
> -	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
> +	dentry = d_hash_and_lookup(mnt_root, &name);
>  	if (dentry) {
>  		d_invalidate(dentry);
>  		dput(dentry);
> @@ -3153,7 +3153,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
>  
>  	name.name = buf;
>  	name.len = snprintf(buf, sizeof(buf), "%u", tgid);
> -	leader = d_hash_and_lookup(mnt->mnt_root, &name);
> +	leader = d_hash_and_lookup(mnt_root, &name);
>  	if (!leader)
>  		goto out;
>  
> @@ -3208,14 +3208,24 @@ void proc_flush_task(struct task_struct *task)
>  	int i;
>  	struct pid *pid, *tgid;
>  	struct upid *upid;
> +	struct dentry *mnt_root;
> +	struct proc_fs_info *fs_info;
>  
>  	pid = task_pid(task);
>  	tgid = task_tgid(task);
>  
>  	for (i = 0; i <= pid->level; i++) {
>  		upid = &pid->numbers[i];
> -		proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
> -					tgid->numbers[i].nr);
> +
> +		rcu_read_lock();
> +		list_for_each_entry_rcu(fs_info, &upid->ns->proc_mounts, pidns_entry) {
> +			mnt_root = fs_info->m_super->s_root;
> +			proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
> +		}
> +		rcu_read_unlock();
> +
> +		mnt_root = upid->ns->proc_mnt->mnt_root;
> +		proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);

I don't think this following of proc_mnt is needed.  It certainly
shouldn't be.  The loop through all of the super blocks should be
enough.

Once this change goes through.  UML can be given it's own dedicated
proc_mnt for the initial pid namespace, and proc_mnt can be removed
entirely.

Unless something has changed recently UML is the only other user of
pid_ns->proc_mnt.  That proc_mnt really only exists to make the loop in
proc_flush_task easy to write.

It also probably makes sense to take the rcu_read_lock() over
that entire for loop.


>  	}
>  }
>  
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index 5d5cba4c899b..e2bb015da1a8 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -149,7 +149,22 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
>  	if (ret) {
>  		return ret;
>  	}
> -	return proc_setup_thread_self(s);
> +
> +	ret = proc_setup_thread_self(s);
> +	if (ret) {
> +		return ret;
> +	}
> +
> +	/*
> +	 * back reference to flush dcache entries at process exit time.
> +	 */
> +	ctx->fs_info->m_super = s;
> +
> +	spin_lock(&pid_ns->proc_mounts_lock);
> +	list_add_tail_rcu(&ctx->fs_info->pidns_entry, &pid_ns->proc_mounts);
> +	spin_unlock(&pid_ns->proc_mounts_lock);
> +
> +	return 0;
>  }
>  
>  static int proc_reconfigure(struct fs_context *fc)
> @@ -211,10 +226,17 @@ static void proc_kill_sb(struct super_block *sb)
>  {
>  	struct proc_fs_info *fs_info = proc_sb_info(sb);
>  
> +	spin_lock(&fs_info->pid_ns->proc_mounts_lock);
> +	list_del_rcu(&fs_info->pidns_entry);
> +	spin_unlock(&fs_info->pid_ns->proc_mounts_lock);
> +
> +	synchronize_rcu();
> +

Rather than a heavyweight synchronize_rcu here,
it probably makes sense to instead to change

	kfree(fs_info)
to
	kfree_rcu(fs_info, some_rcu_head_in_fs_info)

Or possibly doing a full call_rcu.

The problem is that synchronize_rcu takes about 10ms when HZ=100.
Which makes synchronize_rcu incredibly expensive on any path where
anything can wait for it.

>  	if (fs_info->proc_self)
>  		dput(fs_info->proc_self);
>  	if (fs_info->proc_thread_self)
>  		dput(fs_info->proc_thread_self);
> +
>  	kill_anon_super(sb);
>  	put_pid_ns(fs_info->pid_ns);
>  	kfree(fs_info);
> @@ -336,6 +358,9 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
>  		ctx->fs_info->pid_ns = ns;
>  	}
>  
> +	spin_lock_init(&ns->proc_mounts_lock);
> +	INIT_LIST_HEAD_RCU(&ns->proc_mounts);
> +
>  	mnt = fc_mount(fc);
>  	put_fs_context(fc);
>  	if (IS_ERR(mnt))
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 66f47f1afe0d..c36af1dfd862 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -26,6 +26,8 @@ struct pid_namespace {
>  	struct pid_namespace *parent;
>  #ifdef CONFIG_PROC_FS
>  	struct vfsmount *proc_mnt; /* Internal proc mounted during each new pidns */
> +	spinlock_t proc_mounts_lock;
> +	struct list_head proc_mounts; /* list of separated procfs mounts */
>  #endif
>  #ifdef CONFIG_BSD_PROCESS_ACCT
>  	struct fs_pin *bacct;
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 5f0b1b7e4271..f307940f8311 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -20,6 +20,8 @@ enum {
>  };
>  
>  struct proc_fs_info {
> +	struct list_head pidns_entry;    /* Node in procfs_mounts of a pidns */
> +	struct super_block *m_super;
>  	struct pid_namespace *pid_ns;
>  	struct dentry *proc_self;        /* For /proc/self */
>  	struct dentry *proc_thread_self; /* For /proc/thread-self */


Eric

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-11  1:36   ` ebiederm
@ 2020-02-11  4:01     ` ebiederm
  2020-02-12 14:49     ` Alexey Gladkov
  1 sibling, 0 replies; 67+ messages in thread
From: ebiederm @ 2020-02-11  4:01 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

ebiederm@xmission.com (Eric W. Biederman) writes:

> Alexey Gladkov <gladkov.alexey@gmail.com> writes:
>
>> This allows to flush dcache entries of a task on multiple procfs mounts
>> per pid namespace.
>>
>> The RCU lock is used because the number of reads at the task exit time
>> is much larger than the number of procfs mounts.
>
> A couple of quick comments.
>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
>> ---
>>  fs/proc/base.c                | 20 +++++++++++++++-----
>>  fs/proc/root.c                | 27 ++++++++++++++++++++++++++-
>>  include/linux/pid_namespace.h |  2 ++
>>  include/linux/proc_fs.h       |  2 ++
>>  4 files changed, 45 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 4ccb280a3e79..24b7c620ded3 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -3133,7 +3133,7 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
>>  	.permission	= proc_pid_permission,
>>  };
>>  
>> -static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
>> +static void proc_flush_task_mnt_root(struct dentry *mnt_root, pid_t pid, pid_t tgid)
> Perhaps just rename things like:
>> +static void proc_flush_task_root(struct dentry *root, pid_t pid, pid_t tgid)
>>  {
>
> I don't think the mnt_ prefix conveys any information, and it certainly
> makes everything longer and more cumbersome.
>
>>  	struct dentry *dentry, *leader, *dir;
>>  	char buf[10 + 1];
>> @@ -3142,7 +3142,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
>>  	name.name = buf;
>>  	name.len = snprintf(buf, sizeof(buf), "%u", pid);
>>  	/* no ->d_hash() rejects on procfs */
>> -	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
>> +	dentry = d_hash_and_lookup(mnt_root, &name);
>>  	if (dentry) {
>>  		d_invalidate(dentry);
>>  		dput(dentry);
>> @@ -3153,7 +3153,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
>>  
>>  	name.name = buf;
>>  	name.len = snprintf(buf, sizeof(buf), "%u", tgid);
>> -	leader = d_hash_and_lookup(mnt->mnt_root, &name);
>> +	leader = d_hash_and_lookup(mnt_root, &name);
>>  	if (!leader)
>>  		goto out;
>>  
>> @@ -3208,14 +3208,24 @@ void proc_flush_task(struct task_struct *task)
>>  	int i;
>>  	struct pid *pid, *tgid;
>>  	struct upid *upid;
>> +	struct dentry *mnt_root;
>> +	struct proc_fs_info *fs_info;
>>  
>>  	pid = task_pid(task);
>>  	tgid = task_tgid(task);
>>  
>>  	for (i = 0; i <= pid->level; i++) {
>>  		upid = &pid->numbers[i];
>> -		proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
>> -					tgid->numbers[i].nr);
>> +
>> +		rcu_read_lock();
>> +		list_for_each_entry_rcu(fs_info, &upid->ns->proc_mounts, pidns_entry) {
>> +			mnt_root = fs_info->m_super->s_root;
>> +			proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
>> +		}
>> +		rcu_read_unlock();
>> +
>> +		mnt_root = upid->ns->proc_mnt->mnt_root;
>> +		proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
>
> I don't think this following of proc_mnt is needed.  It certainly
> shouldn't be.  The loop through all of the super blocks should be
> enough.
>
> Once this change goes through.  UML can be given it's own dedicated
> proc_mnt for the initial pid namespace, and proc_mnt can be removed
> entirely.
>
> Unless something has changed recently UML is the only other user of
> pid_ns->proc_mnt.  That proc_mnt really only exists to make the loop in
> proc_flush_task easy to write.
>
> It also probably makes sense to take the rcu_read_lock() over
> that entire for loop.
>
>
>>  	}
>>  }
>>  
>> diff --git a/fs/proc/root.c b/fs/proc/root.c
>> index 5d5cba4c899b..e2bb015da1a8 100644
>> --- a/fs/proc/root.c
>> +++ b/fs/proc/root.c
>> @@ -149,7 +149,22 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
>>  	if (ret) {
>>  		return ret;
>>  	}
>> -	return proc_setup_thread_self(s);
>> +
>> +	ret = proc_setup_thread_self(s);
>> +	if (ret) {
>> +		return ret;
>> +	}
>> +
>> +	/*
>> +	 * back reference to flush dcache entries at process exit time.
>> +	 */
>> +	ctx->fs_info->m_super = s;
>> +
>> +	spin_lock(&pid_ns->proc_mounts_lock);
>> +	list_add_tail_rcu(&ctx->fs_info->pidns_entry, &pid_ns->proc_mounts);
>> +	spin_unlock(&pid_ns->proc_mounts_lock);
>> +
>> +	return 0;
>>  }
>>  
>>  static int proc_reconfigure(struct fs_context *fc)
>> @@ -211,10 +226,17 @@ static void proc_kill_sb(struct super_block *sb)
>>  {
>>  	struct proc_fs_info *fs_info = proc_sb_info(sb);
>>  
>> +	spin_lock(&fs_info->pid_ns->proc_mounts_lock);
>> +	list_del_rcu(&fs_info->pidns_entry);
>> +	spin_unlock(&fs_info->pid_ns->proc_mounts_lock);
>> +
>> +	synchronize_rcu();
>> +
>
> Rather than a heavyweight synchronize_rcu here,
> it probably makes sense to instead to change
>
> 	kfree(fs_info)
> to
> 	kfree_rcu(fs_info, some_rcu_head_in_fs_info)
>
> Or possibly doing a full call_rcu.
>
> The problem is that synchronize_rcu takes about 10ms when HZ=100.
> Which makes synchronize_rcu incredibly expensive on any path where
> anything can wait for it.

I take it back.

The crucial thing to insert a rcu delay before is
shrink_dcache_for_umount.

With the call chain:
kill_anon_super
    generic_shutdown_super
        shrink_dcache_for_umount.

So we do need synchronize_rcu or to put everything below this point into
a call_rcu function.

The practical concern is mntput calls does task_work_add.
The task_work function is cleanup_mnt.
When called cleanup_mnt calls deactivate_locked_super
which in turn calls proc_kill_sb.

Which is a long way of saying that the code will delay the
exit of the process that calls mntput by 10ms (or
however syncrhonize_rcu runs).

We probably don't care.  But that is long enough in to be user
visible, and potentially cause problems.

But all it requires putting the code below in it's own
function and triggering it with call_rcu if that causes a regression
by bying so slow.

>>  	if (fs_info->proc_self)
>>  		dput(fs_info->proc_self);
>>  	if (fs_info->proc_thread_self)
>>  		dput(fs_info->proc_thread_self);
>> +
>>  	kill_anon_super(sb);
>>  	put_pid_ns(fs_info->pid_ns);
>>  	kfree(fs_info);
>> @@ -336,6 +358,9 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
>>  		ctx->fs_info->pid_ns = ns;
>>  	}
>>  
>> +	spin_lock_init(&ns->proc_mounts_lock);
>> +	INIT_LIST_HEAD_RCU(&ns->proc_mounts);
>> +
>>  	mnt = fc_mount(fc);
>>  	put_fs_context(fc);
>>  	if (IS_ERR(mnt))
>> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
>> index 66f47f1afe0d..c36af1dfd862 100644
>> --- a/include/linux/pid_namespace.h
>> +++ b/include/linux/pid_namespace.h
>> @@ -26,6 +26,8 @@ struct pid_namespace {
>>  	struct pid_namespace *parent;
>>  #ifdef CONFIG_PROC_FS
>>  	struct vfsmount *proc_mnt; /* Internal proc mounted during each new pidns */
>> +	spinlock_t proc_mounts_lock;
>> +	struct list_head proc_mounts; /* list of separated procfs mounts */
>>  #endif
>>  #ifdef CONFIG_BSD_PROCESS_ACCT
>>  	struct fs_pin *bacct;
>> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
>> index 5f0b1b7e4271..f307940f8311 100644
>> --- a/include/linux/proc_fs.h
>> +++ b/include/linux/proc_fs.h
>> @@ -20,6 +20,8 @@ enum {
>>  };
>>  
>>  struct proc_fs_info {
>> +	struct list_head pidns_entry;    /* Node in procfs_mounts of a pidns */
>> +	struct super_block *m_super;
>>  	struct pid_namespace *pid_ns;
>>  	struct dentry *proc_self;        /* For /proc/self */
>>  	struct dentry *proc_thread_self; /* For /proc/thread-self */
>
>
> Eric

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-10 15:05 ` [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances Alexey Gladkov
  2020-02-10 17:46   ` Linus Torvalds
  2020-02-11  1:36   ` ebiederm
@ 2020-02-11 22:45   ` Al Viro
  2020-02-12 14:26     ` Alexey Gladkov
  2 siblings, 1 reply; 67+ messages in thread
From: Al Viro @ 2020-02-11 22:45 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

On Mon, Feb 10, 2020 at 04:05:15PM +0100, Alexey Gladkov wrote:
> This allows to flush dcache entries of a task on multiple procfs mounts
> per pid namespace.
> 
> The RCU lock is used because the number of reads at the task exit time
> is much larger than the number of procfs mounts.
> 
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> ---
>  fs/proc/base.c                | 20 +++++++++++++++-----
>  fs/proc/root.c                | 27 ++++++++++++++++++++++++++-
>  include/linux/pid_namespace.h |  2 ++
>  include/linux/proc_fs.h       |  2 ++
>  4 files changed, 45 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 4ccb280a3e79..24b7c620ded3 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3133,7 +3133,7 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
>  	.permission	= proc_pid_permission,
>  };
>  
> -static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
> +static void proc_flush_task_mnt_root(struct dentry *mnt_root, pid_t pid, pid_t tgid)
>  {
>  	struct dentry *dentry, *leader, *dir;
>  	char buf[10 + 1];
> @@ -3142,7 +3142,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
>  	name.name = buf;
>  	name.len = snprintf(buf, sizeof(buf), "%u", pid);
>  	/* no ->d_hash() rejects on procfs */
> -	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
> +	dentry = d_hash_and_lookup(mnt_root, &name);
>  	if (dentry) {
>  		d_invalidate(dentry);
... which can block
>  		dput(dentry);
... and so can this

> +		rcu_read_lock();
> +		list_for_each_entry_rcu(fs_info, &upid->ns->proc_mounts, pidns_entry) {
> +			mnt_root = fs_info->m_super->s_root;
> +			proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);

... making that more than slightly unsafe.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-11 22:45   ` Al Viro
@ 2020-02-12 14:26     ` Alexey Gladkov
  0 siblings, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-12 14:26 UTC (permalink / raw)
  To: Al Viro
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

On Tue, Feb 11, 2020 at 10:45:53PM +0000, Al Viro wrote:
> On Mon, Feb 10, 2020 at 04:05:15PM +0100, Alexey Gladkov wrote:
> > This allows to flush dcache entries of a task on multiple procfs mounts
> > per pid namespace.
> > 
> > The RCU lock is used because the number of reads at the task exit time
> > is much larger than the number of procfs mounts.
> > 
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> > ---
> >  fs/proc/base.c                | 20 +++++++++++++++-----
> >  fs/proc/root.c                | 27 ++++++++++++++++++++++++++-
> >  include/linux/pid_namespace.h |  2 ++
> >  include/linux/proc_fs.h       |  2 ++
> >  4 files changed, 45 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 4ccb280a3e79..24b7c620ded3 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -3133,7 +3133,7 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
> >  	.permission	= proc_pid_permission,
> >  };
> >  
> > -static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
> > +static void proc_flush_task_mnt_root(struct dentry *mnt_root, pid_t pid, pid_t tgid)
> >  {
> >  	struct dentry *dentry, *leader, *dir;
> >  	char buf[10 + 1];
> > @@ -3142,7 +3142,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
> >  	name.name = buf;
> >  	name.len = snprintf(buf, sizeof(buf), "%u", pid);
> >  	/* no ->d_hash() rejects on procfs */
> > -	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
> > +	dentry = d_hash_and_lookup(mnt_root, &name);
> >  	if (dentry) {
> >  		d_invalidate(dentry);
> ... which can block
> >  		dput(dentry);
> ... and so can this
> 
> > +		rcu_read_lock();
> > +		list_for_each_entry_rcu(fs_info, &upid->ns->proc_mounts, pidns_entry) {
> > +			mnt_root = fs_info->m_super->s_root;
> > +			proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
> 
> ... making that more than slightly unsafe.

I see. So, I can't use rcu locks here as well as spinlocks.
Thanks!

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 08/11] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
  2020-02-10 16:29   ` Jordan Glover
@ 2020-02-12 14:34     ` Alexey Gladkov
  0 siblings, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-12 14:34 UTC (permalink / raw)
  To: Jordan Glover
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Eric W . Biederman,
	Greg Kroah-Hartman, Ingo Molnar, J . Bruce Fields, Jeff Layton,
	Jonathan Corbet, Kees Cook, Linus Torvalds, Oleg Nesterov,
	Solar Designer

On Mon, Feb 10, 2020 at 04:29:31PM +0000, Jordan Glover wrote:
> On Monday, February 10, 2020 3:05 PM, Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> 
> > If "hidepid=4" mount option is set then do not instantiate pids that
> > we can not ptrace. "hidepid=4" means that procfs should only contain
> > pids that the caller can ptrace.
> >
> > Cc: Kees Cook keescook@chromium.org
> > Cc: Andy Lutomirski luto@kernel.org
> > Signed-off-by: Djalal Harouni tixxdz@gmail.com
> > Signed-off-by: Alexey Gladkov gladkov.alexey@gmail.com
> >
> > fs/proc/base.c | 15 +++++++++++++++
> > fs/proc/root.c | 14 +++++++++++---
> > include/linux/proc_fs.h | 1 +
> > 3 files changed, 27 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 24b7c620ded3..49937d54e745 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -699,6 +699,14 @@ static bool has_pid_permissions(struct proc_fs_info *fs_info,
> > struct task_struct *task,
> > int hide_pid_min)
> > {
> >
> > -   /*
> > -   -   If 'hidpid' mount option is set force a ptrace check,
> > -   -   we indicate that we are using a filesystem syscall
> > -   -   by passing PTRACE_MODE_READ_FSCREDS
> > -   */
> > -   if (proc_fs_hide_pid(fs_info) == HIDEPID_NOT_PTRACABLE)
> > -         return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
> >
> >
> > -   if (proc_fs_hide_pid(fs_info) < hide_pid_min)
> >     return true;
> >     if (in_group_p(proc_fs_pid_gid(fs_info)))
> >     @@ -3271,7 +3279,14 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
> >     if (!task)
> >     goto out;
> >
> > -   /* Limit procfs to only ptracable tasks */
> > -   if (proc_fs_hide_pid(fs_info) == HIDEPID_NOT_PTRACABLE) {
> > -         if (!has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS))
> >
> >
> > -         	goto out_put_task;
> >
> >
> > -   }
> > -   result = proc_pid_instantiate(dentry, task, NULL);
> >     +out_put_task:
> >     put_task_struct(task);
> >     out:
> >     return result;
> >     diff --git a/fs/proc/root.c b/fs/proc/root.c
> >     index e2bb015da1a8..5e27bb31f125 100644
> >     --- a/fs/proc/root.c
> >     +++ b/fs/proc/root.c
> >     @@ -52,6 +52,15 @@ static const struct fs_parameter_description proc_fs_parameters = {
> >     .specs = proc_param_specs,
> >     };
> >
> >     +static inline int
> >     +valid_hidepid(unsigned int value)
> >     +{
> >
> > -   return (value == HIDEPID_OFF ||
> > -         value == HIDEPID_NO_ACCESS ||
> >
> >
> > -         value == HIDEPID_INVISIBLE ||
> >
> >
> > -         value == HIDEPID_NOT_PTRACABLE);
> >
> >
> >
> > +}
> > +
> > static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
> > {
> > struct proc_fs_context *ctx = fc->fs_private;
> > @@ -68,10 +77,9 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
> > break;
> >
> > case Opt_hidepid:
> >
> > -         if (!valid_hidepid(result.uint_32))
> >
> >
> > -         	return invalf(fc, "proc: unknown value of hidepid.\\n");
> >           ctx->hidepid = result.uint_32;
> >
> >
> >
> > -         if (ctx->hidepid < HIDEPID_OFF ||
> >
> >
> > -             ctx->hidepid > HIDEPID_INVISIBLE)
> >
> >
> > -         	return invalf(fc, "proc: hidepid value must be between 0 and 2.\\n");
> >           break;
> >
> >
> >
> > default:
> > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> > index f307940f8311..6822548405a7 100644
> > --- a/include/linux/proc_fs.h
> > +++ b/include/linux/proc_fs.h
> > @@ -17,6 +17,7 @@ enum {
> > HIDEPID_OFF = 0,
> > HIDEPID_NO_ACCESS = 1,
> > HIDEPID_INVISIBLE = 2,
> >
> > -   HIDEPID_NOT_PTRACABLE = 4, /* Limit pids to only ptracable pids */
> 
> Is there a reason new option is "4" instead of "3"? The order 1..2..4 may be
> confusing for people.

This is just mask. For now hidepid values are mutually exclusive, but
since it moved to uapi, I thought it would be good if there was an
opportunity to combine values.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-11  1:36   ` ebiederm
  2020-02-11  4:01     ` ebiederm
@ 2020-02-12 14:49     ` Alexey Gladkov
  2020-02-12 14:59       ` ebiederm
  1 sibling, 1 reply; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-12 14:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

On Mon, Feb 10, 2020 at 07:36:08PM -0600, Eric W. Biederman wrote:
> Alexey Gladkov <gladkov.alexey@gmail.com> writes:
> 
> > This allows to flush dcache entries of a task on multiple procfs mounts
> > per pid namespace.
> >
> > The RCU lock is used because the number of reads at the task exit time
> > is much larger than the number of procfs mounts.
> 
> A couple of quick comments.
> 
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Andy Lutomirski <luto@kernel.org>
> > Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> > ---
> >  fs/proc/base.c                | 20 +++++++++++++++-----
> >  fs/proc/root.c                | 27 ++++++++++++++++++++++++++-
> >  include/linux/pid_namespace.h |  2 ++
> >  include/linux/proc_fs.h       |  2 ++
> >  4 files changed, 45 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 4ccb280a3e79..24b7c620ded3 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -3133,7 +3133,7 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
> >  	.permission	= proc_pid_permission,
> >  };
> >  
> > -static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
> > +static void proc_flush_task_mnt_root(struct dentry *mnt_root, pid_t pid, pid_t tgid)
> Perhaps just rename things like:
> > +static void proc_flush_task_root(struct dentry *root, pid_t pid, pid_t tgid)
> >  {
> 
> I don't think the mnt_ prefix conveys any information, and it certainly
> makes everything longer and more cumbersome.
> 
> >  	struct dentry *dentry, *leader, *dir;
> >  	char buf[10 + 1];
> > @@ -3142,7 +3142,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
> >  	name.name = buf;
> >  	name.len = snprintf(buf, sizeof(buf), "%u", pid);
> >  	/* no ->d_hash() rejects on procfs */
> > -	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
> > +	dentry = d_hash_and_lookup(mnt_root, &name);
> >  	if (dentry) {
> >  		d_invalidate(dentry);
> >  		dput(dentry);
> > @@ -3153,7 +3153,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
> >  
> >  	name.name = buf;
> >  	name.len = snprintf(buf, sizeof(buf), "%u", tgid);
> > -	leader = d_hash_and_lookup(mnt->mnt_root, &name);
> > +	leader = d_hash_and_lookup(mnt_root, &name);
> >  	if (!leader)
> >  		goto out;
> >  
> > @@ -3208,14 +3208,24 @@ void proc_flush_task(struct task_struct *task)
> >  	int i;
> >  	struct pid *pid, *tgid;
> >  	struct upid *upid;
> > +	struct dentry *mnt_root;
> > +	struct proc_fs_info *fs_info;
> >  
> >  	pid = task_pid(task);
> >  	tgid = task_tgid(task);
> >  
> >  	for (i = 0; i <= pid->level; i++) {
> >  		upid = &pid->numbers[i];
> > -		proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
> > -					tgid->numbers[i].nr);
> > +
> > +		rcu_read_lock();
> > +		list_for_each_entry_rcu(fs_info, &upid->ns->proc_mounts, pidns_entry) {
> > +			mnt_root = fs_info->m_super->s_root;
> > +			proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
> > +		}
> > +		rcu_read_unlock();
> > +
> > +		mnt_root = upid->ns->proc_mnt->mnt_root;
> > +		proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
> 
> I don't think this following of proc_mnt is needed.  It certainly
> shouldn't be.  The loop through all of the super blocks should be
> enough.

Yes, thanks!

> Once this change goes through.  UML can be given it's own dedicated
> proc_mnt for the initial pid namespace, and proc_mnt can be removed
> entirely.

After you deleted the old sysctl syscall we could probably do it.

> Unless something has changed recently UML is the only other user of
> pid_ns->proc_mnt.  That proc_mnt really only exists to make the loop in
> proc_flush_task easy to write.

Now I think, is there any way to get rid of proc_mounts or even
proc_flush_task somehow.

> It also probably makes sense to take the rcu_read_lock() over
> that entire for loop.

Al Viro pointed out to me that I cannot use rcu locks here :(

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 05/11] proc: add helpers to set and get proc hidepid and gid mount options
  2020-02-10 18:30   ` Andy Lutomirski
@ 2020-02-12 14:57     ` Alexey Gladkov
  0 siblings, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-12 14:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

On Mon, Feb 10, 2020 at 10:30:45AM -0800, Andy Lutomirski wrote:
> On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> >
> > This is a cleaning patch to add helpers to set and get proc mount
> > options instead of directly using them. This make it easy to track
> > what's happening and easy to update in future.
> 
> On a cursory inspection, this looks like it obfuscates the code, and I
> don't see where it does something useful later in the series.  What is
> this abstraction for?

To be honest, this part is from Djalal Harouni. For me, these wrappers
exist for a slight increase in readability.

I will not cry if you tell me to remove them :)

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 14:49     ` Alexey Gladkov
@ 2020-02-12 14:59       ` ebiederm
  2020-02-12 17:08         ` Alexey Gladkov
  2020-02-12 18:45         ` Linus Torvalds
  0 siblings, 2 replies; 67+ messages in thread
From: ebiederm @ 2020-02-12 14:59 UTC (permalink / raw)
  To: LKML
  Cc: Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

Alexey Gladkov <gladkov.alexey@gmail.com> writes:

> On Mon, Feb 10, 2020 at 07:36:08PM -0600, Eric W. Biederman wrote:
>> Alexey Gladkov <gladkov.alexey@gmail.com> writes:
>> 
>> > This allows to flush dcache entries of a task on multiple procfs mounts
>> > per pid namespace.
>> >
>> > The RCU lock is used because the number of reads at the task exit time
>> > is much larger than the number of procfs mounts.
>> 
>> A couple of quick comments.
>> 
>> > Cc: Kees Cook <keescook@chromium.org>
>> > Cc: Andy Lutomirski <luto@kernel.org>
>> > Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
>> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>> > Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
>> > ---
>> >  fs/proc/base.c                | 20 +++++++++++++++-----
>> >  fs/proc/root.c                | 27 ++++++++++++++++++++++++++-
>> >  include/linux/pid_namespace.h |  2 ++
>> >  include/linux/proc_fs.h       |  2 ++
>> >  4 files changed, 45 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/fs/proc/base.c b/fs/proc/base.c
>> > index 4ccb280a3e79..24b7c620ded3 100644
>> > --- a/fs/proc/base.c
>> > +++ b/fs/proc/base.c
>> > @@ -3133,7 +3133,7 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
>> >  	.permission	= proc_pid_permission,
>> >  };
>> >  
>> > -static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
>> > +static void proc_flush_task_mnt_root(struct dentry *mnt_root, pid_t pid, pid_t tgid)
>> Perhaps just rename things like:
>> > +static void proc_flush_task_root(struct dentry *root, pid_t pid, pid_t tgid)
>> >  {
>> 
>> I don't think the mnt_ prefix conveys any information, and it certainly
>> makes everything longer and more cumbersome.
>> 
>> >  	struct dentry *dentry, *leader, *dir;
>> >  	char buf[10 + 1];
>> > @@ -3142,7 +3142,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
>> >  	name.name = buf;
>> >  	name.len = snprintf(buf, sizeof(buf), "%u", pid);
>> >  	/* no ->d_hash() rejects on procfs */
>> > -	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
>> > +	dentry = d_hash_and_lookup(mnt_root, &name);
>> >  	if (dentry) {
>> >  		d_invalidate(dentry);
>> >  		dput(dentry);
>> > @@ -3153,7 +3153,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
>> >  
>> >  	name.name = buf;
>> >  	name.len = snprintf(buf, sizeof(buf), "%u", tgid);
>> > -	leader = d_hash_and_lookup(mnt->mnt_root, &name);
>> > +	leader = d_hash_and_lookup(mnt_root, &name);
>> >  	if (!leader)
>> >  		goto out;
>> >  
>> > @@ -3208,14 +3208,24 @@ void proc_flush_task(struct task_struct *task)
>> >  	int i;
>> >  	struct pid *pid, *tgid;
>> >  	struct upid *upid;
>> > +	struct dentry *mnt_root;
>> > +	struct proc_fs_info *fs_info;
>> >  
>> >  	pid = task_pid(task);
>> >  	tgid = task_tgid(task);
>> >  
>> >  	for (i = 0; i <= pid->level; i++) {
>> >  		upid = &pid->numbers[i];
>> > -		proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
>> > -					tgid->numbers[i].nr);
>> > +
>> > +		rcu_read_lock();
>> > +		list_for_each_entry_rcu(fs_info, &upid->ns->proc_mounts, pidns_entry) {
>> > +			mnt_root = fs_info->m_super->s_root;
>> > +			proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
>> > +		}
>> > +		rcu_read_unlock();
>> > +
>> > +		mnt_root = upid->ns->proc_mnt->mnt_root;
>> > +		proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
>> 
>> I don't think this following of proc_mnt is needed.  It certainly
>> shouldn't be.  The loop through all of the super blocks should be
>> enough.
>
> Yes, thanks!
>
>> Once this change goes through.  UML can be given it's own dedicated
>> proc_mnt for the initial pid namespace, and proc_mnt can be removed
>> entirely.
>
> After you deleted the old sysctl syscall we could probably do it.
>
>> Unless something has changed recently UML is the only other user of
>> pid_ns->proc_mnt.  That proc_mnt really only exists to make the loop in
>> proc_flush_task easy to write.
>
> Now I think, is there any way to get rid of proc_mounts or even
> proc_flush_task somehow.
>
>> It also probably makes sense to take the rcu_read_lock() over
>> that entire for loop.
>
> Al Viro pointed out to me that I cannot use rcu locks here :(

Fundamentally proc_flush_task is an optimization.  Just getting rid of
dentries earlier.  At least at one point it was an important
optimization because the old process dentries would just sit around
doing nothing for anyone.

I wonder if instead of invalidating specific dentries we could instead
fire wake up a shrinker and point it at one or more instances of proc.

The practical challenge I see is something might need to access the
dentries to see that they are invalid.

We definitely could try without this optimization and see what happens.

Eric


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 03/11] proc: move /proc/{self|thread-self} dentries to proc_fs_info
  2020-02-10 18:23   ` Andy Lutomirski
@ 2020-02-12 15:00     ` Alexey Gladkov
  0 siblings, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-12 15:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

On Mon, Feb 10, 2020 at 10:23:23AM -0800, Andy Lutomirski wrote:
> On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> >
> > This is a preparation patch that moves /proc/{self|thread-self} dentries
> > to be stored inside procfs fs_info struct instead of making them per pid
> > namespace. Since we want to support multiple procfs instances we need to
> > make sure that these dentries are also per-superblock instead of
> > per-pidns,
> 
> The changelog makes perfect sense so far...
> 
> > unmounting a private procfs won't clash with other procfs
> > mounts.
> 
> This doesn't parse as part of the previous sentence.  I'm also not
> convinced that this really involves unmounting per se.  Maybe just
> delete these words.

Sure. I will remove this part.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 10/11] docs: proc: add documentation for "hidepid=4" and "subset=pidfs" options and new mount behavior
  2020-02-10 18:29   ` Andy Lutomirski
@ 2020-02-12 16:03     ` Alexey Gladkov
  0 siblings, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-12 16:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Eric W . Biederman, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

On Mon, Feb 10, 2020 at 10:29:23AM -0800, Andy Lutomirski wrote:
> On Mon, Feb 10, 2020 at 7:06 AM Alexey Gladkov <gladkov.alexey@gmail.com> wrote:
> >
> > Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> > ---
> >  Documentation/filesystems/proc.txt | 53 ++++++++++++++++++++++++++++++
> >  1 file changed, 53 insertions(+)
> >
> > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> > index 99ca040e3f90..4741fd092f36 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -50,6 +50,8 @@ Table of Contents
> >    4    Configuring procfs
> >    4.1  Mount options
> >
> > +  5    Filesystem behavior
> > +
> >  ------------------------------------------------------------------------------
> >  Preface
> >  ------------------------------------------------------------------------------
> > @@ -2021,6 +2023,7 @@ The following mount options are supported:
> >
> >         hidepid=        Set /proc/<pid>/ access mode.
> >         gid=            Set the group authorized to learn processes information.
> > +       subset=         Show only the specified subset of procfs.
> >
> >  hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
> >  (default).
> > @@ -2042,6 +2045,56 @@ information about running processes, whether some daemon runs with elevated
> >  privileges, whether other user runs some sensitive program, whether other users
> >  run any program at all, etc.
> >
> > +hidepid=4 means that procfs should only contain /proc/<pid>/ directories
> > +that the caller can ptrace.
> 
> I have a couple of minor nits here.
> 
> First, perhaps we could stop using magic numbers and use words.
> hidepid=ptraceable is actually comprehensible, whereas hidepid=4
> requires looking up what '4' means.

Do you mean to add string aliases for the values?

hidepid=0 == hidepid=default
hidepid=1 == hidepid=restrict
hidepid=2 == hidepid=ownonly
hidepid=4 == hidepid=ptraceable

Something like that ?

> Second, there is PTRACE_MODE_ATTACH and PTRACE_MODE_READ.  Which is it?

This is PTRACE_MODE_READ.

> > +
> >  gid= defines a group authorized to learn processes information otherwise
> >  prohibited by hidepid=.  If you use some daemon like identd which needs to learn
> >  information about processes information, just add identd to this group.
> 
> How is this better than just creating an entirely separate mount a
> different hidepid and a different gid owning it?

I'm not sure I understand the question. Now you cannot have two proc with
different hidepid in the same pid_namespace. 

> In any event,
> usually gid= means that this gid is the group owner of inodes.  Let's
> call it something different.  gid_override_hidepid might be credible.
> But it's also really weird -- do different groups really see different
> contents when they read a directory?

If you use hidepid=2,gid=wheel options then the user is not in the wheel
group will see only their processes and the user in the wheel group will
see whole tree. The gid= is a kind of whitelist for hidepid=1|2.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 14:59       ` ebiederm
@ 2020-02-12 17:08         ` Alexey Gladkov
  2020-02-12 18:45         ` Linus Torvalds
  1 sibling, 0 replies; 67+ messages in thread
From: Alexey Gladkov @ 2020-02-12 17:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 08:59:29AM -0600, Eric W. Biederman wrote:
> Alexey Gladkov <gladkov.alexey@gmail.com> writes:
> 
> > On Mon, Feb 10, 2020 at 07:36:08PM -0600, Eric W. Biederman wrote:
> >> Alexey Gladkov <gladkov.alexey@gmail.com> writes:
> >> 
> >> > This allows to flush dcache entries of a task on multiple procfs mounts
> >> > per pid namespace.
> >> >
> >> > The RCU lock is used because the number of reads at the task exit time
> >> > is much larger than the number of procfs mounts.
> >> 
> >> A couple of quick comments.
> >> 
> >> > Cc: Kees Cook <keescook@chromium.org>
> >> > Cc: Andy Lutomirski <luto@kernel.org>
> >> > Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
> >> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> >> > Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
> >> > ---
> >> >  fs/proc/base.c                | 20 +++++++++++++++-----
> >> >  fs/proc/root.c                | 27 ++++++++++++++++++++++++++-
> >> >  include/linux/pid_namespace.h |  2 ++
> >> >  include/linux/proc_fs.h       |  2 ++
> >> >  4 files changed, 45 insertions(+), 6 deletions(-)
> >> >
> >> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> >> > index 4ccb280a3e79..24b7c620ded3 100644
> >> > --- a/fs/proc/base.c
> >> > +++ b/fs/proc/base.c
> >> > @@ -3133,7 +3133,7 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
> >> >  	.permission	= proc_pid_permission,
> >> >  };
> >> >  
> >> > -static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
> >> > +static void proc_flush_task_mnt_root(struct dentry *mnt_root, pid_t pid, pid_t tgid)
> >> Perhaps just rename things like:
> >> > +static void proc_flush_task_root(struct dentry *root, pid_t pid, pid_t tgid)
> >> >  {
> >> 
> >> I don't think the mnt_ prefix conveys any information, and it certainly
> >> makes everything longer and more cumbersome.
> >> 
> >> >  	struct dentry *dentry, *leader, *dir;
> >> >  	char buf[10 + 1];
> >> > @@ -3142,7 +3142,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
> >> >  	name.name = buf;
> >> >  	name.len = snprintf(buf, sizeof(buf), "%u", pid);
> >> >  	/* no ->d_hash() rejects on procfs */
> >> > -	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
> >> > +	dentry = d_hash_and_lookup(mnt_root, &name);
> >> >  	if (dentry) {
> >> >  		d_invalidate(dentry);
> >> >  		dput(dentry);
> >> > @@ -3153,7 +3153,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
> >> >  
> >> >  	name.name = buf;
> >> >  	name.len = snprintf(buf, sizeof(buf), "%u", tgid);
> >> > -	leader = d_hash_and_lookup(mnt->mnt_root, &name);
> >> > +	leader = d_hash_and_lookup(mnt_root, &name);
> >> >  	if (!leader)
> >> >  		goto out;
> >> >  
> >> > @@ -3208,14 +3208,24 @@ void proc_flush_task(struct task_struct *task)
> >> >  	int i;
> >> >  	struct pid *pid, *tgid;
> >> >  	struct upid *upid;
> >> > +	struct dentry *mnt_root;
> >> > +	struct proc_fs_info *fs_info;
> >> >  
> >> >  	pid = task_pid(task);
> >> >  	tgid = task_tgid(task);
> >> >  
> >> >  	for (i = 0; i <= pid->level; i++) {
> >> >  		upid = &pid->numbers[i];
> >> > -		proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
> >> > -					tgid->numbers[i].nr);
> >> > +
> >> > +		rcu_read_lock();
> >> > +		list_for_each_entry_rcu(fs_info, &upid->ns->proc_mounts, pidns_entry) {
> >> > +			mnt_root = fs_info->m_super->s_root;
> >> > +			proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
> >> > +		}
> >> > +		rcu_read_unlock();
> >> > +
> >> > +		mnt_root = upid->ns->proc_mnt->mnt_root;
> >> > +		proc_flush_task_mnt_root(mnt_root, upid->nr, tgid->numbers[i].nr);
> >> 
> >> I don't think this following of proc_mnt is needed.  It certainly
> >> shouldn't be.  The loop through all of the super blocks should be
> >> enough.
> >
> > Yes, thanks!
> >
> >> Once this change goes through.  UML can be given it's own dedicated
> >> proc_mnt for the initial pid namespace, and proc_mnt can be removed
> >> entirely.
> >
> > After you deleted the old sysctl syscall we could probably do it.
> >
> >> Unless something has changed recently UML is the only other user of
> >> pid_ns->proc_mnt.  That proc_mnt really only exists to make the loop in
> >> proc_flush_task easy to write.
> >
> > Now I think, is there any way to get rid of proc_mounts or even
> > proc_flush_task somehow.
> >
> >> It also probably makes sense to take the rcu_read_lock() over
> >> that entire for loop.
> >
> > Al Viro pointed out to me that I cannot use rcu locks here :(
> 
> Fundamentally proc_flush_task is an optimization.  Just getting rid of
> dentries earlier.  At least at one point it was an important
> optimization because the old process dentries would just sit around
> doing nothing for anyone.
> 
> I wonder if instead of invalidating specific dentries we could instead
> fire wake up a shrinker and point it at one or more instances of proc.
> 
> The practical challenge I see is something might need to access the
> dentries to see that they are invalid.
> 
> We definitely could try without this optimization and see what happens.

When Linus said that a semaphore for proc_mounts is a bad idea, I tried
to come up with some kind of asynchronous way to clear it per superblock.
I gave up with the asynchronous GC because userspace can quite easily get
ahead of it.

Without this optimization the kernel starts to consume a lot of memory
during intensive reading /proc. I tried to do:

while :; do
	for x in `seq 0 9`; do sleep 0.1; done;
	ls /proc/[0-9]*;
done >/dev/null;

and memory consumption went up without proc_flush_task. Since we have
mounted procfs in each container, this is dangerous.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 14:59       ` ebiederm
  2020-02-12 17:08         ` Alexey Gladkov
@ 2020-02-12 18:45         ` Linus Torvalds
  2020-02-12 19:16           ` ebiederm
  2020-02-12 19:47           ` Al Viro
  1 sibling, 2 replies; 67+ messages in thread
From: Linus Torvalds @ 2020-02-12 18:45 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 7:01 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Fundamentally proc_flush_task is an optimization.  Just getting rid of
> dentries earlier.  At least at one point it was an important
> optimization because the old process dentries would just sit around
> doing nothing for anyone.

I'm pretty sure it's still important. It's very easy to generate a
_ton_ of dentries with /proc.

> I wonder if instead of invalidating specific dentries we could instead
> fire wake up a shrinker and point it at one or more instances of proc.

It shouldn't be the dentries themselves that are a freeing problem.
They're being RCU-free'd anyway because of lookup. It's the
proc_mounts list that is the problem, isn't it?

So it's just fs_info that needs to be rcu-delayed because it contains
that list. Or is there something else?

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 18:45         ` Linus Torvalds
@ 2020-02-12 19:16           ` ebiederm
  2020-02-12 19:49             ` Linus Torvalds
  2020-02-12 19:47           ` Al Viro
  1 sibling, 1 reply; 67+ messages in thread
From: ebiederm @ 2020-02-12 19:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Feb 12, 2020 at 7:01 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Fundamentally proc_flush_task is an optimization.  Just getting rid of
>> dentries earlier.  At least at one point it was an important
>> optimization because the old process dentries would just sit around
>> doing nothing for anyone.
>
> I'm pretty sure it's still important. It's very easy to generate a
> _ton_ of dentries with /proc.
>
>> I wonder if instead of invalidating specific dentries we could instead
>> fire wake up a shrinker and point it at one or more instances of proc.
>
> It shouldn't be the dentries themselves that are a freeing problem.
> They're being RCU-free'd anyway because of lookup. It's the
> proc_mounts list that is the problem, isn't it?
>
> So it's just fs_info that needs to be rcu-delayed because it contains
> that list. Or is there something else?

The fundamental dcache thing we are playing with is:

	dentry = d_hash_and_lookup(proc_root, &name);
	if (dentry) {
		d_invalidate(dentry);
		dput(dentry);
	}

As Al pointed out upthread dput and d_invalidate can both sleep.

The dput can potentially go away if we use __d_lookup_rcu instead of
d_lookup.

The challenge is d_invalidate.

It has the fundamentally sleeping detach_mounts loop.  Even
shrink_dcache_parent has a cond_sched() in there to ensure it doesn't
live lock the system.

We could and arguabley should set DCACHE_CANT_MOUNT on the proc pid
dentries.  Which will prevent having to deal with mounts.

But I don't see an easy way of getting shrink_dcache_parent to run
without sleeping.  Ideas?


Eric








^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 18:45         ` Linus Torvalds
  2020-02-12 19:16           ` ebiederm
@ 2020-02-12 19:47           ` Al Viro
  1 sibling, 0 replies; 67+ messages in thread
From: Al Viro @ 2020-02-12 19:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 10:45:06AM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2020 at 7:01 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > Fundamentally proc_flush_task is an optimization.  Just getting rid of
> > dentries earlier.  At least at one point it was an important
> > optimization because the old process dentries would just sit around
> > doing nothing for anyone.
> 
> I'm pretty sure it's still important. It's very easy to generate a
> _ton_ of dentries with /proc.
> 
> > I wonder if instead of invalidating specific dentries we could instead
> > fire wake up a shrinker and point it at one or more instances of proc.
> 
> It shouldn't be the dentries themselves that are a freeing problem.
> They're being RCU-free'd anyway because of lookup. It's the
> proc_mounts list that is the problem, isn't it?
> 
> So it's just fs_info that needs to be rcu-delayed because it contains
> that list. Or is there something else?

Large part of the headache is the possibility that some joker has
done something like mounting tmpfs on /proc/<pid>/map_files, or
binding /dev/null on top of /proc/<pid>/syscall, etc.

IOW, that d_invalidate() can very well have to grab namespace_sem.
And possibly do a full-blown fs shutdown of something NFS-mounted,
etc...

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 19:16           ` ebiederm
@ 2020-02-12 19:49             ` Linus Torvalds
  2020-02-12 20:03               ` Al Viro
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2020-02-12 19:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 11:18 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> > So it's just fs_info that needs to be rcu-delayed because it contains
> > that list. Or is there something else?
>
> The fundamental dcache thing we are playing with is:
>
>         dentry = d_hash_and_lookup(proc_root, &name);
>         if (dentry) {
>                 d_invalidate(dentry);
>                 dput(dentry);
>         }

Ahh. And we can't do that part under the RCU read lock. So it's not
the freeing, it's the list traversal itself.

Fair enough.

Hmm.

I wonder if we could split up d_invalidate(). It already ends up being
two phases: first the unhashing under the d_lock, and then the
recursive shrinking of parents and children.

The recursive shrinking of the parent isn't actually interesting for
the proc shrinking case: we just looked up one child, after all. So we
only care about the d_walk of the children.

So if we only did the first part under the RCU lock, and just
collected the dentries (can we perhaps then re-use the hash list to
collect them to another list?) and then did the child d_walk
afterwards?

             Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 19:49             ` Linus Torvalds
@ 2020-02-12 20:03               ` Al Viro
  2020-02-12 20:35                 ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Al Viro @ 2020-02-12 20:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 11:49:58AM -0800, Linus Torvalds wrote:

> I wonder if we could split up d_invalidate(). It already ends up being
> two phases: first the unhashing under the d_lock, and then the
> recursive shrinking of parents and children.
> 
> The recursive shrinking of the parent isn't actually interesting for
> the proc shrinking case: we just looked up one child, after all. So we
> only care about the d_walk of the children.
> 
> So if we only did the first part under the RCU lock, and just
> collected the dentries (can we perhaps then re-use the hash list to
> collect them to another list?) and then did the child d_walk
> afterwards?

What's to prevent racing with fs shutdown while you are doing the second part?
We could, after all, just have them[*] on procfs-private list (anchored in
task_struct) from the very beginning; evict on ->d_prune(), walk the list
on exit...  How do you make sure the fs instance won't go away right under
you while you are doing the real work?  Suppose you are looking at one
of those dentries and you've found something blocking to do.  You can't
pin that dentry; you can pin ->s_active on its superblock (if it's already
zero, you can skip it - fs shutdown already in progress will take care of
the damn thing), but that will lead to quite a bit of cacheline pingpong...

[*] only /proc/<pid> and /proc/*/task/<pid> dentries, obviously.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 20:03               ` Al Viro
@ 2020-02-12 20:35                 ` Linus Torvalds
  2020-02-12 20:38                   ` Al Viro
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2020-02-12 20:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 12:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What's to prevent racing with fs shutdown while you are doing the second part?

I was thinking that only the proc_flush_task() code would do this.

And that holds a ref to the vfsmount through upid->ns.

So I wasn't suggesting doing this in general - just splitting up the
implementation of d_invalidate() so that proc_flush_task_mnt() could
delay the complex part to after having traversed the RCU-protected
list.

But hey - I missed this part of the problem originally, so maybe I'm
just missing something else this time. Wouldn't be the first time.

               Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 20:35                 ` Linus Torvalds
@ 2020-02-12 20:38                   ` Al Viro
  2020-02-12 20:41                     ` Al Viro
  2020-02-14  3:49                     ` [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances ebiederm
  0 siblings, 2 replies; 67+ messages in thread
From: Al Viro @ 2020-02-12 20:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 12:35:04PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2020 at 12:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > What's to prevent racing with fs shutdown while you are doing the second part?
> 
> I was thinking that only the proc_flush_task() code would do this.
> 
> And that holds a ref to the vfsmount through upid->ns.
> 
> So I wasn't suggesting doing this in general - just splitting up the
> implementation of d_invalidate() so that proc_flush_task_mnt() could
> delay the complex part to after having traversed the RCU-protected
> list.
> 
> But hey - I missed this part of the problem originally, so maybe I'm
> just missing something else this time. Wouldn't be the first time.

Wait, I thought the whole point of that had been to allow multiple
procfs instances for the same userns?  Confused...

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 20:38                   ` Al Viro
@ 2020-02-12 20:41                     ` Al Viro
  2020-02-12 21:02                       ` Linus Torvalds
  2020-02-14  3:49                     ` [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances ebiederm
  1 sibling, 1 reply; 67+ messages in thread
From: Al Viro @ 2020-02-12 20:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 08:38:33PM +0000, Al Viro wrote:
> On Wed, Feb 12, 2020 at 12:35:04PM -0800, Linus Torvalds wrote:
> > On Wed, Feb 12, 2020 at 12:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > What's to prevent racing with fs shutdown while you are doing the second part?
> > 
> > I was thinking that only the proc_flush_task() code would do this.
> > 
> > And that holds a ref to the vfsmount through upid->ns.
> > 
> > So I wasn't suggesting doing this in general - just splitting up the
> > implementation of d_invalidate() so that proc_flush_task_mnt() could
> > delay the complex part to after having traversed the RCU-protected
> > list.
> > 
> > But hey - I missed this part of the problem originally, so maybe I'm
> > just missing something else this time. Wouldn't be the first time.
> 
> Wait, I thought the whole point of that had been to allow multiple
> procfs instances for the same userns?  Confused...

s/userns/pidns/, sorry

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 20:41                     ` Al Viro
@ 2020-02-12 21:02                       ` Linus Torvalds
  2020-02-12 21:46                         ` ebiederm
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2020-02-12 21:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 12:41 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Feb 12, 2020 at 08:38:33PM +0000, Al Viro wrote:
> >
> > Wait, I thought the whole point of that had been to allow multiple
> > procfs instances for the same userns?  Confused...
>
> s/userns/pidns/, sorry

Right, but we still hold the ref to it here...

[ Looks more ]

Oooh. No we don't. Exactly because we don't hold the lock, only the
rcu lifetime, the ref can go away from under us. I see what your
concern is.

Ouch, this is more painful than I expected - the code flow looked so
simple. I really wanted to avoid a new lock during process shutdown,
because that has always been somewhat painful.

            Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 21:02                       ` Linus Torvalds
@ 2020-02-12 21:46                         ` ebiederm
  2020-02-13  0:48                           ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: ebiederm @ 2020-02-12 21:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Feb 12, 2020 at 12:41 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> On Wed, Feb 12, 2020 at 08:38:33PM +0000, Al Viro wrote:
>> >
>> > Wait, I thought the whole point of that had been to allow multiple
>> > procfs instances for the same userns?  Confused...
>>
>> s/userns/pidns/, sorry
>
> Right, but we still hold the ref to it here...
>
> [ Looks more ]
>
> Oooh. No we don't. Exactly because we don't hold the lock, only the
> rcu lifetime, the ref can go away from under us. I see what your
> concern is.
>
> Ouch, this is more painful than I expected - the code flow looked so
> simple. I really wanted to avoid a new lock during process shutdown,
> because that has always been somewhat painful.

The good news is proc_flush_task isn't exactly called from process exit.
proc_flush_task is called during zombie clean up. AKA release_task.

So proc_flush_task isn't called with any locks held, and it is
called in a context where it can sleep.

Further after proc_flush_task does it's thing the code goes
and does "write_lock_irq(&task_list_lock);"

So the code is definitely serialized to one processor already.

What would be downside of having a mutex for a list of proc superblocks?
A mutex that is taken for both reading and writing the list.

Eric

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 21:46                         ` ebiederm
@ 2020-02-13  0:48                           ` Linus Torvalds
  2020-02-13  4:37                             ` ebiederm
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2020-02-13  0:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 1:48 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The good news is proc_flush_task isn't exactly called from process exit.
> proc_flush_task is called during zombie clean up. AKA release_task.

Yeah, that at least avoids some of the nasty locking while dying debug problems.

But the one I was more worried about was actually the lock contention
issue with lots of processes. The lock is basically a single global
lock in many situations - yes, it's technically per-ns, but in a lot
of cases you really only have one namespace anyway.

And we've had problems with global locks in this area before, notably
the one you call out:

> Further after proc_flush_task does it's thing the code goes
> and does "write_lock_irq(&task_list_lock);"

Yeah, so it's not introducing a new issue, but it is potentially
making something we already know is bad even worse.

> What would be downside of having a mutex for a list of proc superblocks?
> A mutex that is taken for both reading and writing the list.

That's what the original patch actually was, and I was hoping we could
avoid that thing.

An rwsem would be possibly better, since most cases by far are likely
about reading.

And yes, I'm very aware of the task_list_lock, but it's literally why
I don't want to make a new one.

I'm _hoping_ we can some day come up with something better than task_list_lock.

            Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-13  0:48                           ` Linus Torvalds
@ 2020-02-13  4:37                             ` ebiederm
  2020-02-13  5:55                               ` Al Viro
  2020-02-20 20:46                               ` [PATCH 0/7] proc: Dentry flushing without proc_mnt ebiederm
  0 siblings, 2 replies; 67+ messages in thread
From: ebiederm @ 2020-02-13  4:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Feb 12, 2020 at 1:48 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> The good news is proc_flush_task isn't exactly called from process exit.
>> proc_flush_task is called during zombie clean up. AKA release_task.
>
> Yeah, that at least avoids some of the nasty locking while dying debug problems.
>
> But the one I was more worried about was actually the lock contention
> issue with lots of processes. The lock is basically a single global
> lock in many situations - yes, it's technically per-ns, but in a lot
> of cases you really only have one namespace anyway.
>
> And we've had problems with global locks in this area before, notably
> the one you call out:
>
>> Further after proc_flush_task does it's thing the code goes
>> and does "write_lock_irq(&task_list_lock);"
>
> Yeah, so it's not introducing a new issue, but it is potentially
> making something we already know is bad even worse.
>
>> What would be downside of having a mutex for a list of proc superblocks?
>> A mutex that is taken for both reading and writing the list.
>
> That's what the original patch actually was, and I was hoping we could
> avoid that thing.
>
> An rwsem would be possibly better, since most cases by far are likely
> about reading.
>
> And yes, I'm very aware of the task_list_lock, but it's literally why
> I don't want to make a new one.
>
> I'm _hoping_ we can some day come up with something better than
> task_list_lock.

Yes.  I understand that.

I occassionally play with ideas, and converted all of proc to rcu
to help with situation but I haven't come up with anything clearly
better.


All of this is why I was really hoping we could have a change in
strategy and see if we can make the shrinker be able to better prune
proc inodes.



I think I have an alternate idea that could work.  Add some extra code
into proc_task_readdir, that would look for dentries that no longer
point to tasks and d_invalidate them.  With the same logic probably
being called from a few more places as well like proc_pid_readdir,
proc_task_lookup, and proc_pid_lookup.

We could even optimize it and have a process died flag we set in the
superblock.

That would would batch up the freeing work until the next time someone
reads from proc in a way that would create more dentries.  So it would
prevent dentries from reaped zombies from growing without bound.

Hmm.  Given the existence of proc_fill_cache it would really be a good
idea if readdir and lookup performed some of the freeing work as well.
As on readdir we always populate the dcache for all of the directory
entries.

I am booked solid for the next little while but if no one beats me to it
I will try and code something like that up where at least readdir
looks for and invalidates stale dentries.

Eric

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-13  4:37                             ` ebiederm
@ 2020-02-13  5:55                               ` Al Viro
  2020-02-13 21:30                                 ` Linus Torvalds
  2020-02-14  3:48                                 ` ebiederm
  2020-02-20 20:46                               ` [PATCH 0/7] proc: Dentry flushing without proc_mnt ebiederm
  1 sibling, 2 replies; 67+ messages in thread
From: Al Viro @ 2020-02-13  5:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 10:37:52PM -0600, Eric W. Biederman wrote:

> I think I have an alternate idea that could work.  Add some extra code
> into proc_task_readdir, that would look for dentries that no longer
> point to tasks and d_invalidate them.  With the same logic probably
> being called from a few more places as well like proc_pid_readdir,
> proc_task_lookup, and proc_pid_lookup.
> 
> We could even optimize it and have a process died flag we set in the
> superblock.
> 
> That would would batch up the freeing work until the next time someone
> reads from proc in a way that would create more dentries.  So it would
> prevent dentries from reaped zombies from growing without bound.
> 
> Hmm.  Given the existence of proc_fill_cache it would really be a good
> idea if readdir and lookup performed some of the freeing work as well.
> As on readdir we always populate the dcache for all of the directory
> entries.

First of all, that won't do a damn thing when nobody is accessing
given superblock.  What's more, readdir in root of that procfs instance
is not enough - you need it in task/ of group leader.

What I don't understand is the insistence on getting those dentries
via dcache lookups.  _IF_ we are willing to live with cacheline
contention (on ->d_lock of root dentry, if nothing else), why not
do the following:
	* put all dentries of such directories ([0-9]* and [0-9]*/task/*)
into a list anchored in task_struct; have non-counting reference to
task_struct stored in them (might simplify part of get_proc_task() users,
BTW - avoids pid-to-task_struct lookups if we have a dentry and not just
the inode; many callers do)
	* have ->d_release() remove from it (protecting per-task_struct lock
nested outside of all ->d_lock)
	* on exit:
	lock the (per-task_struct) list
	while list is non-empty
		pick the first dentry
		remove from the list
		sb = dentry->d_sb
		try to bump sb->s_active (if non-zero, that is).
		if failed
			continue // move on to the next one - nothing to do here
		grab ->d_lock
		res = handle_it(dentry, &temp_list)
		drop ->d_lock
		unlock the list
		if (!list_empty(&temp_list))
			shrink_dentry_list(&temp_list)
		if (res)
			d_invalidate(dentry)
			dput(dentry)
		deactivate_super(sb)
		lock the list
	unlock the list

handle_it(dentry, temp_list) // ->d_lock held; that one should be in dcache.c
	if ->d_count is negative // unlikely
		return 0;
	if ->d_count is positive,
		increment ->d_count
		return 1;
	// OK, it's still alive, but ->d_count is 0
	__d_drop	// equivalent of d_invalidate in this case
	if not on a shrink list // otherwise it's not our headache
		if on lru list
			d_lru_del
		d_shrink_add dentry to temp_list
	return 0;

And yeah, that'll dirty ->s_active for each procfs superblock that
has dentry for our process present in dcache.  On exit()...

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-13  5:55                               ` Al Viro
@ 2020-02-13 21:30                                 ` Linus Torvalds
  2020-02-13 22:23                                   ` Al Viro
  2020-02-14  3:48                                 ` ebiederm
  1 sibling, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2020-02-13 21:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Wed, Feb 12, 2020 at 9:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What I don't understand is the insistence on getting those dentries
> via dcache lookups.

I don't think that's an "insistence", it's more of a "historical
behavior" together with "several changes over the years to deal with
dentry-level cleanups and updates".

> _IF_ we are willing to live with cacheline
> contention (on ->d_lock of root dentry, if nothing else), why not
> do the following:
>         * put all dentries of such directories ([0-9]* and [0-9]*/task/*)
> into a list anchored in task_struct; have non-counting reference to
> task_struct stored in them (might simplify part of get_proc_task() users,

Hmm.

Right now I don't think we actually create any dentries at all for the
short-lived process case.

Wouldn't your suggestion make fork/exit rather worse?

Or would you create the dentries dynamically still at lookup time, and
then attach them to the process at that point?

What list would you use for the dentry chaining? Would you play games
with the dentry hashing, and "hash" them off the process, and never
hit in the lookup cache?

Am I misunderstanding what you suggest?

            Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-13 21:30                                 ` Linus Torvalds
@ 2020-02-13 22:23                                   ` Al Viro
  2020-02-13 22:47                                     ` Linus Torvalds
  0 siblings, 1 reply; 67+ messages in thread
From: Al Viro @ 2020-02-13 22:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Thu, Feb 13, 2020 at 01:30:11PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2020 at 9:55 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > What I don't understand is the insistence on getting those dentries
> > via dcache lookups.
> 
> I don't think that's an "insistence", it's more of a "historical
> behavior" together with "several changes over the years to deal with
> dentry-level cleanups and updates".
> 
> > _IF_ we are willing to live with cacheline
> > contention (on ->d_lock of root dentry, if nothing else), why not
> > do the following:
> >         * put all dentries of such directories ([0-9]* and [0-9]*/task/*)
> > into a list anchored in task_struct; have non-counting reference to
> > task_struct stored in them (might simplify part of get_proc_task() users,
> 
> Hmm.
> 
> Right now I don't think we actually create any dentries at all for the
> short-lived process case.
> 
> Wouldn't your suggestion make fork/exit rather worse?
> 
> Or would you create the dentries dynamically still at lookup time, and
> then attach them to the process at that point?
> 
> What list would you use for the dentry chaining? Would you play games
> with the dentry hashing, and "hash" them off the process, and never
> hit in the lookup cache?

I'd been thinking of ->d_fsdata pointing to a structure with list_head
and a (non-counting) task_struct pointer for those guys.  Allocated
on lookup, of course (as well as readdir ;-/) and put on the list
at the same time.

IOW, for short-lived process we simply have an empty (h)list anchored
in task_struct and that's it.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-13 22:23                                   ` Al Viro
@ 2020-02-13 22:47                                     ` Linus Torvalds
  2020-02-14 14:15                                       ` ebiederm
  0 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2020-02-13 22:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Thu, Feb 13, 2020 at 2:23 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I'd been thinking of ->d_fsdata pointing to a structure with list_head
> and a (non-counting) task_struct pointer for those guys.  Allocated
> on lookup, of course (as well as readdir ;-/) and put on the list
> at the same time.

Hmm. That smells like potentially a lot of small allocations, and
making readdir() even nastier.

Do we really want to create the dentries at readdir time? We do now
(with proc_fill_cache()) but do we actually _need_ to?

I guess a lot of readdir users end up doing a stat on it immediately
afterwards. I think right now we do it to get the inode number, and
maybe that is a basic requirement (even if I don't think it's really
stable - an inode could be evicted and then the ino changes, no?)

Ho humm. This all doesn't make me happy. But I guess the proof is in
the pudding - and if you come up with a good patch, I won't complain.

              Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-13  5:55                               ` Al Viro
  2020-02-13 21:30                                 ` Linus Torvalds
@ 2020-02-14  3:48                                 ` ebiederm
  1 sibling, 0 replies; 67+ messages in thread
From: ebiederm @ 2020-02-14  3:48 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Wed, Feb 12, 2020 at 10:37:52PM -0600, Eric W. Biederman wrote:
>
>> I think I have an alternate idea that could work.  Add some extra code
>> into proc_task_readdir, that would look for dentries that no longer
>> point to tasks and d_invalidate them.  With the same logic probably
>> being called from a few more places as well like proc_pid_readdir,
>> proc_task_lookup, and proc_pid_lookup.
>> 
>> We could even optimize it and have a process died flag we set in the
>> superblock.
>> 
>> That would would batch up the freeing work until the next time someone
>> reads from proc in a way that would create more dentries.  So it would
>> prevent dentries from reaped zombies from growing without bound.
>> 
>> Hmm.  Given the existence of proc_fill_cache it would really be a good
>> idea if readdir and lookup performed some of the freeing work as well.
>> As on readdir we always populate the dcache for all of the directory
>> entries.
>
> First of all, that won't do a damn thing when nobody is accessing
> given superblock.  What's more, readdir in root of that procfs instance
> is not enough - you need it in task/ of group leader.

It should give a rough bound on the number of stale dentries a
superblock can have.  The same basic concept has been used very
successfully in many incremental garbage collectors.  In those malloc
(or the equivalent) does a finite amount of garbage collection work to
roughly balance out the amount of memory allocated.  I am proposing
something similar for proc instances.

Further if no one is accessing a superblock we don't have a problem
either.


> What I don't understand is the insistence on getting those dentries
> via dcache lookups.  _IF_ we are willing to live with cacheline
> contention (on ->d_lock of root dentry, if nothing else), why not
> do the following:

No insistence from this side.

I was not seeing atomic_inc_not_zero(sb->s_active) from rcu
context as option earlier.  But it is an option.

> 	* put all dentries of such directories ([0-9]* and [0-9]*/task/*)
> into a list anchored in task_struct; have non-counting reference to
> task_struct stored in them (might simplify part of get_proc_task() users,
> BTW - avoids pid-to-task_struct lookups if we have a dentry and not just
> the inode; many callers do)
> 	* have ->d_release() remove from it (protecting per-task_struct lock
> nested outside of all ->d_lock)
> 	* on exit:
> 	lock the (per-task_struct) list
> 	while list is non-empty
> 		pick the first dentry
> 		remove from the list
> 		sb = dentry->d_sb
> 		try to bump sb->s_active (if non-zero, that is).
> 		if failed
> 			continue // move on to the next one - nothing to do here
> 		grab ->d_lock
> 		res = handle_it(dentry, &temp_list)
> 		drop ->d_lock
> 		unlock the list
> 		if (!list_empty(&temp_list))
> 			shrink_dentry_list(&temp_list)
> 		if (res)
> 			d_invalidate(dentry)
> 			dput(dentry)
> 		deactivate_super(sb)
> 		lock the list
> 	unlock the list
>
> handle_it(dentry, temp_list) // ->d_lock held; that one should be in dcache.c
> 	if ->d_count is negative // unlikely
> 		return 0;
> 	if ->d_count is positive,
> 		increment ->d_count
> 		return 1;
> 	// OK, it's still alive, but ->d_count is 0
> 	__d_drop	// equivalent of d_invalidate in this case
> 	if not on a shrink list // otherwise it's not our headache
> 		if on lru list
> 			d_lru_del
> 		d_shrink_add dentry to temp_list
> 	return 0;
>
> And yeah, that'll dirty ->s_active for each procfs superblock that
> has dentry for our process present in dcache.  On exit()...


I would thread the whole thing through the proc_inode instead of coming
up with a new allocation per dentry so an extra memory allocation isn't
needed.  We already have i_dentry.  So going from the vfs_inode to
the dentry is trivial.



But truthfully I don't like proc_flush_task.

The problem is that proc_flush_task is a layering violation and magic
code that pretty much no one understands.  We have some very weird
cases where dput or d_invalidate wound up triggering ext3 code.  It has
been fixed for a long time now, but it wasy crazy weird unexpected
stuff.


Al your logic above just feels very clever, and like many pieces of the
kernel have to know how other pieces of the kernel work.  If we can find
something stupid and simple that also solves the problem I would be much
happier.   Than anyone could understand and fix it if something goes
wrong.

Eric







^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-12 20:38                   ` Al Viro
  2020-02-12 20:41                     ` Al Viro
@ 2020-02-14  3:49                     ` ebiederm
  1 sibling, 0 replies; 67+ messages in thread
From: ebiederm @ 2020-02-14  3:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Wed, Feb 12, 2020 at 12:35:04PM -0800, Linus Torvalds wrote:
>> On Wed, Feb 12, 2020 at 12:03 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>> >
>> > What's to prevent racing with fs shutdown while you are doing the second part?
>> 
>> I was thinking that only the proc_flush_task() code would do this.
>> 
>> And that holds a ref to the vfsmount through upid->ns.
>> 
>> So I wasn't suggesting doing this in general - just splitting up the
>> implementation of d_invalidate() so that proc_flush_task_mnt() could
>> delay the complex part to after having traversed the RCU-protected
>> list.
>> 
>> But hey - I missed this part of the problem originally, so maybe I'm
>> just missing something else this time. Wouldn't be the first time.
>
> Wait, I thought the whole point of that had been to allow multiple
> procfs instances for the same userns?  Confused...

Multiple procfs instances for the same pidns.  Exactly.

Which would let people have their own set of procfs mount
options without having to worry about stomping on someone else.

The fundamental problem with multiple procfs instances per pidns
is there isn't an obvous place to put a vfs mount.


...


Which means we need some way to keep the file system from going away
while anyone in the kernel is running proc_flush_task.

One was I can see to solve this that would give us cheap readers, is to
have a percpu count of the number of processes in proc_flush_task.
That would work something like mnt_count.

Then forbid proc_kill_sb from removing any super block from the list
or otherwise making progress until the proc_flush_task_count goes
to zero.


f we wanted cheap readers and an expensive writer
kind of flag that proc_kill_sb can

Thinking out loud perhaps we have add a list_head on task_struct
and a list_head in proc_inode.  That would let us find the inodes
and by extention the dentries we care about quickly.

Then in evict_inode we could remove the proc_inode from the list.


Eric


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances
  2020-02-13 22:47                                     ` Linus Torvalds
@ 2020-02-14 14:15                                       ` ebiederm
  0 siblings, 0 replies; 67+ messages in thread
From: ebiederm @ 2020-02-14 14:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer

Linus Torvalds <torvalds@linux-foundation.org> writes:

> I guess a lot of readdir users end up doing a stat on it immediately
> afterwards. I think right now we do it to get the inode number, and
> maybe that is a basic requirement (even if I don't think it's really
> stable - an inode could be evicted and then the ino changes, no?)

All I know is proc_fill_cache seemed like a good idea at the time.
I may have been to clever.

While I think proc_fill_cache probably exacerbates the issue
it isn't the reason we have the flushing logic.  The proc
flushing logic was introduced in around 2.5.9 much earlier
than the other proc things.

commit 0030633355db2bba32d97655df73b04215018ab9
Author: Alexander Viro <viro@math.psu.edu>
Date:   Sun Apr 21 23:03:37 2002 -0700

    [PATCH] (3/5) sane procfs/dcache interaction
    
     - sane dentry retention.  Namely, we don't kill /proc/<pid> dentries at the
       first opportunity (as the current tree does).  Instead we do the following:
            * ->d_delete() kills it only if process is already dead.
            * all ->lookup() in proc/base.c end with checking if process is still
              alive and unhash if it isn't.
            * proc_pid_lookup() (lookup for /proc/<pid>) caches reference to dentry
              in task_struct.  It's _not_ counted in ->d_count.
            * ->d_iput() resets said reference to NULL.
            * release_task() (burying a zombie) checks if there is a cached
              reference and if there is - shrinks the subtree.
            * tasklist_lock is used for exclusion.
       That way we are guaranteed that after release_task() all dentries in
       /proc/<pid> will go away as soon as possible; OTOH, before release_task()
       we have normal retention policy - they go away under memory pressure with
       the same rules as for dentries on any other fs.

Tracking down when this logic was introduced I also see that this code
has broken again and again any time proc changes (like now).  So it is
definitely subtle and fragile.

Eric

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 0/7] proc: Dentry flushing without proc_mnt
  2020-02-13  4:37                             ` ebiederm
  2020-02-13  5:55                               ` Al Viro
@ 2020-02-20 20:46                               ` ebiederm
  2020-02-20 20:47                                 ` [PATCH 1/7] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes ebiederm
                                                   ` (7 more replies)
  1 sibling, 8 replies; 67+ messages in thread
From: ebiederm @ 2020-02-20 20:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer


Just because it is less of a fundamental change and less testing I went
and looked at updating proc_flush_task to use a list as Al suggested.

If we can stand an sget/deactivate_super pair for every dentry we want
to invalidate I think I have something.

Comments from anyone will be appreciated I gave this some light testing
and the code is based on something similar already present in proc so
I think there is a high chance this code is correct but I could easily
be wrong.

Linus, does this approach look like something you can stand?

Eric

Eric W. Biederman (7):
      proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
      proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
      proc: Mov rcu_read_(lock|unlock) in proc_prune_siblings_dcache
      proc: Use d_invalidate in proc_prune_siblings_dcache
      proc: Clear the pieces of proc_inode that proc_evict_inode cares about
      proc: Use a list of inodes to flush from proc
      proc: Ensure we see the exit of each process tid exactly once

 fs/exec.c               |   5 +--
 fs/proc/base.c          | 111 ++++++++++++++++--------------------------------
 fs/proc/inode.c         |  60 +++++++++++++++++++++++---
 fs/proc/internal.h      |   4 +-
 fs/proc/proc_sysctl.c   |  45 +++-----------------
 include/linux/pid.h     |   2 +
 include/linux/proc_fs.h |   4 +-
 kernel/exit.c           |   4 +-
 kernel/pid.c            |  16 +++++++
 9 files changed, 124 insertions(+), 127 deletions(-)



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 1/7] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
  2020-02-20 20:46                               ` [PATCH 0/7] proc: Dentry flushing without proc_mnt ebiederm
@ 2020-02-20 20:47                                 ` ebiederm
  2020-02-20 20:48                                 ` [PATCH 2/7] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache ebiederm
                                                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 67+ messages in thread
From: ebiederm @ 2020-02-20 20:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer


I about to need and use the same functionality for pid based
inodes and there is no point in adding a second field when
this field is already here and serving the same purporse.

Just give the field a generic name so it is clear that
it is no longer sysctl specific.

Also for good measure initialize sibling_inodes when
proc_inode is initialized.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/inode.c       | 1 +
 fs/proc/internal.h    | 2 +-
 fs/proc/proc_sysctl.c | 8 ++++----
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 6da18316d209..bdae442d5262 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -68,6 +68,7 @@ static struct inode *proc_alloc_inode(struct super_block *sb)
 	ei->pde = NULL;
 	ei->sysctl = NULL;
 	ei->sysctl_entry = NULL;
+	INIT_HLIST_NODE(&ei->sibling_inodes);
 	ei->ns_ops = NULL;
 	return &ei->vfs_inode;
 }
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 41587276798e..366cd3aa690b 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -91,7 +91,7 @@ struct proc_inode {
 	struct proc_dir_entry *pde;
 	struct ctl_table_header *sysctl;
 	struct ctl_table *sysctl_entry;
-	struct hlist_node sysctl_inodes;
+	struct hlist_node sibling_inodes;
 	const struct proc_ns_operations *ns_ops;
 	struct inode vfs_inode;
 } __randomize_layout;
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index c75bb4632ed1..42fbb7f3c587 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -279,9 +279,9 @@ static void proc_sys_prune_dcache(struct ctl_table_header *head)
 		node = hlist_first_rcu(&head->inodes);
 		if (!node)
 			break;
-		ei = hlist_entry(node, struct proc_inode, sysctl_inodes);
+		ei = hlist_entry(node, struct proc_inode, sibling_inodes);
 		spin_lock(&sysctl_lock);
-		hlist_del_init_rcu(&ei->sysctl_inodes);
+		hlist_del_init_rcu(&ei->sibling_inodes);
 		spin_unlock(&sysctl_lock);
 
 		inode = &ei->vfs_inode;
@@ -483,7 +483,7 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 	}
 	ei->sysctl = head;
 	ei->sysctl_entry = table;
-	hlist_add_head_rcu(&ei->sysctl_inodes, &head->inodes);
+	hlist_add_head_rcu(&ei->sibling_inodes, &head->inodes);
 	head->count++;
 	spin_unlock(&sysctl_lock);
 
@@ -514,7 +514,7 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 void proc_sys_evict_inode(struct inode *inode, struct ctl_table_header *head)
 {
 	spin_lock(&sysctl_lock);
-	hlist_del_init_rcu(&PROC_I(inode)->sysctl_inodes);
+	hlist_del_init_rcu(&PROC_I(inode)->sibling_inodes);
 	if (!--head->count)
 		kfree_rcu(head, rcu);
 	spin_unlock(&sysctl_lock);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 2/7] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
  2020-02-20 20:46                               ` [PATCH 0/7] proc: Dentry flushing without proc_mnt ebiederm
  2020-02-20 20:47                                 ` [PATCH 1/7] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes ebiederm
@ 2020-02-20 20:48                                 ` ebiederm
  2020-02-20 20:49                                 ` [PATCH 3/7] proc: Mov rcu_read_(lock|unlock) in proc_prune_siblings_dcache ebiederm
                                                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 67+ messages in thread
From: ebiederm @ 2020-02-20 20:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer


This prepares the way for allowing the pid part of proc to use this
dcache pruning code as well.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/inode.c       | 38 ++++++++++++++++++++++++++++++++++++++
 fs/proc/internal.h    |  1 +
 fs/proc/proc_sysctl.c | 35 +----------------------------------
 3 files changed, 40 insertions(+), 34 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index bdae442d5262..74ce4a8d05eb 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -103,6 +103,44 @@ void __init proc_init_kmemcache(void)
 	BUILD_BUG_ON(sizeof(struct proc_dir_entry) >= SIZEOF_PDE);
 }
 
+void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+{
+	struct inode *inode;
+	struct proc_inode *ei;
+	struct hlist_node *node;
+	struct super_block *sb;
+
+	rcu_read_lock();
+	for (;;) {
+		node = hlist_first_rcu(inodes);
+		if (!node)
+			break;
+		ei = hlist_entry(node, struct proc_inode, sibling_inodes);
+		spin_lock(lock);
+		hlist_del_init_rcu(&ei->sibling_inodes);
+		spin_unlock(lock);
+
+		inode = &ei->vfs_inode;
+		sb = inode->i_sb;
+		if (!atomic_inc_not_zero(&sb->s_active))
+			continue;
+		inode = igrab(inode);
+		rcu_read_unlock();
+		if (unlikely(!inode)) {
+			deactivate_super(sb);
+			rcu_read_lock();
+			continue;
+		}
+
+		d_prune_aliases(inode);
+		iput(inode);
+		deactivate_super(sb);
+
+		rcu_read_lock();
+	}
+	rcu_read_unlock();
+}
+
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct super_block *sb = root->d_sb;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 366cd3aa690b..ba9a991824a5 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -210,6 +210,7 @@ extern const struct inode_operations proc_pid_link_inode_operations;
 extern const struct super_operations proc_sops;
 
 void proc_init_kmemcache(void);
+void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
 void set_proc_pid_nlink(void);
 extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
 extern void proc_entry_rundown(struct proc_dir_entry *);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 42fbb7f3c587..5da9d7f7ae34 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -269,40 +269,7 @@ static void unuse_table(struct ctl_table_header *p)
 
 static void proc_sys_prune_dcache(struct ctl_table_header *head)
 {
-	struct inode *inode;
-	struct proc_inode *ei;
-	struct hlist_node *node;
-	struct super_block *sb;
-
-	rcu_read_lock();
-	for (;;) {
-		node = hlist_first_rcu(&head->inodes);
-		if (!node)
-			break;
-		ei = hlist_entry(node, struct proc_inode, sibling_inodes);
-		spin_lock(&sysctl_lock);
-		hlist_del_init_rcu(&ei->sibling_inodes);
-		spin_unlock(&sysctl_lock);
-
-		inode = &ei->vfs_inode;
-		sb = inode->i_sb;
-		if (!atomic_inc_not_zero(&sb->s_active))
-			continue;
-		inode = igrab(inode);
-		rcu_read_unlock();
-		if (unlikely(!inode)) {
-			deactivate_super(sb);
-			rcu_read_lock();
-			continue;
-		}
-
-		d_prune_aliases(inode);
-		iput(inode);
-		deactivate_super(sb);
-
-		rcu_read_lock();
-	}
-	rcu_read_unlock();
+	proc_prune_siblings_dcache(&head->inodes, &sysctl_lock);
 }
 
 /* called under sysctl_lock, will reacquire if has to wait */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 3/7] proc: Mov rcu_read_(lock|unlock) in proc_prune_siblings_dcache
  2020-02-20 20:46                               ` [PATCH 0/7] proc: Dentry flushing without proc_mnt ebiederm
  2020-02-20 20:47                                 ` [PATCH 1/7] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes ebiederm
  2020-02-20 20:48                                 ` [PATCH 2/7] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache ebiederm
@ 2020-02-20 20:49                                 ` ebiederm
  2020-02-20 22:33                                   ` Linus Torvalds
  2020-02-20 20:49                                 ` [PATCH 4/7] proc: Use d_invalidate " ebiederm
                                                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 67+ messages in thread
From: ebiederm @ 2020-02-20 20:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer


Don't make it look like rcu_read_lock is held over the entire loop
instead just take the rcu_read_lock over the part of the loop that
matters.  This makes the intent of the code a little clearer.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/inode.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 74ce4a8d05eb..38a7baa41aba 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -110,11 +110,13 @@ void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
 	struct hlist_node *node;
 	struct super_block *sb;
 
-	rcu_read_lock();
 	for (;;) {
+		rcu_read_lock();
 		node = hlist_first_rcu(inodes);
-		if (!node)
+		if (!node) {
+			rcu_read_unlock();
 			break;
+		}
 		ei = hlist_entry(node, struct proc_inode, sibling_inodes);
 		spin_lock(lock);
 		hlist_del_init_rcu(&ei->sibling_inodes);
@@ -122,23 +124,21 @@ void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
 
 		inode = &ei->vfs_inode;
 		sb = inode->i_sb;
-		if (!atomic_inc_not_zero(&sb->s_active))
+		if (!atomic_inc_not_zero(&sb->s_active)) {
+			rcu_read_unlock();
 			continue;
+		}
 		inode = igrab(inode);
 		rcu_read_unlock();
 		if (unlikely(!inode)) {
 			deactivate_super(sb);
-			rcu_read_lock();
 			continue;
 		}
 
 		d_prune_aliases(inode);
 		iput(inode);
 		deactivate_super(sb);
-
-		rcu_read_lock();
 	}
-	rcu_read_unlock();
 }
 
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 4/7] proc: Use d_invalidate in proc_prune_siblings_dcache
  2020-02-20 20:46                               ` [PATCH 0/7] proc: Dentry flushing without proc_mnt ebiederm
                                                   ` (2 preceding siblings ...)
  2020-02-20 20:49                                 ` [PATCH 3/7] proc: Mov rcu_read_(lock|unlock) in proc_prune_siblings_dcache ebiederm
@ 2020-02-20 20:49                                 ` " ebiederm
  2020-02-20 22:43                                   ` Linus Torvalds
  2020-02-20 22:54                                   ` Al Viro
  2020-02-20 20:51                                 ` [PATCH 5/7] proc: Clear the pieces of proc_inode that proc_evict_inode cares about ebiederm
                                                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 67+ messages in thread
From: ebiederm @ 2020-02-20 20:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer


The function d_prune_aliases has the problem that it will only prune
aliases thare are completely unused.  It will not remove aliases for
the dcache or even think of removing mounts from the dcache.  For that
behavior d_invalidate is needed.

To use d_invalidate replace d_prune_aliases with d_find_alias
followed by d_invalidate and dput.  This is safe and complete
because no inode in proc has any hardlinks or aliases.

To make this behavior change clear rename
proc_prune_siblings_dache proc_invalidate_siblings_dcache,
and rename proc_sys_prune_dcache proc_sys_invalidate_dcache.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/inode.c       | 9 +++++++--
 fs/proc/internal.h    | 2 +-
 fs/proc/proc_sysctl.c | 8 ++++----
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 38a7baa41aba..c4528c419876 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -103,12 +103,13 @@ void __init proc_init_kmemcache(void)
 	BUILD_BUG_ON(sizeof(struct proc_dir_entry) >= SIZEOF_PDE);
 }
 
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
 {
 	struct inode *inode;
 	struct proc_inode *ei;
 	struct hlist_node *node;
 	struct super_block *sb;
+	struct dentry *dentry;
 
 	for (;;) {
 		rcu_read_lock();
@@ -135,7 +136,11 @@ void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
 			continue;
 		}
 
-		d_prune_aliases(inode);
+		dentry = d_find_alias(inode);
+		if (dentry) {
+			d_invalidate(dentry);
+			dput(dentry);
+		}
 		iput(inode);
 		deactivate_super(sb);
 	}
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index ba9a991824a5..fd470172675f 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -210,7 +210,7 @@ extern const struct inode_operations proc_pid_link_inode_operations;
 extern const struct super_operations proc_sops;
 
 void proc_init_kmemcache(void);
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
 void set_proc_pid_nlink(void);
 extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
 extern void proc_entry_rundown(struct proc_dir_entry *);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 5da9d7f7ae34..b6f5d459b087 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -267,9 +267,9 @@ static void unuse_table(struct ctl_table_header *p)
 			complete(p->unregistering);
 }
 
-static void proc_sys_prune_dcache(struct ctl_table_header *head)
+static void proc_sys_invalidate_dcache(struct ctl_table_header *head)
 {
-	proc_prune_siblings_dcache(&head->inodes, &sysctl_lock);
+	proc_invalidate_siblings_dcache(&head->inodes, &sysctl_lock);
 }
 
 /* called under sysctl_lock, will reacquire if has to wait */
@@ -291,10 +291,10 @@ static void start_unregistering(struct ctl_table_header *p)
 		spin_unlock(&sysctl_lock);
 	}
 	/*
-	 * Prune dentries for unregistered sysctls: namespaced sysctls
+	 * Invalidate dentries for unregistered sysctls: namespaced sysctls
 	 * can have duplicate names and contaminate dcache very badly.
 	 */
-	proc_sys_prune_dcache(p);
+	proc_sys_invalidate_dcache(p);
 	/*
 	 * do not remove from the list until nobody holds it; walking the
 	 * list in do_sysctl() relies on that.
-- 
2.20.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 5/7] proc: Clear the pieces of proc_inode that proc_evict_inode cares about
  2020-02-20 20:46                               ` [PATCH 0/7] proc: Dentry flushing without proc_mnt ebiederm
                                                   ` (3 preceding siblings ...)
  2020-02-20 20:49                                 ` [PATCH 4/7] proc: Use d_invalidate " ebiederm
@ 2020-02-20 20:51                                 ` ebiederm
  2020-02-20 20:52                                 ` [PATCH 6/7] proc: Use a list of inodes to flush from proc ebiederm
                                                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 67+ messages in thread
From: ebiederm @ 2020-02-20 20:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer


This just keeps everything tidier, and allows for using flags like
SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse.
I don't see reuse without reinitializing happening with the proc_inode
but I had a false alarm while reworking flushing of proc dentries and
indoes when a process dies that caused me to tidy this up.

The code is a little easier to follow and reason about this
way so I figured the changes might as well be kept.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/inode.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index c4528c419876..3c9082cd257b 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -33,21 +33,27 @@ static void proc_evict_inode(struct inode *inode)
 {
 	struct proc_dir_entry *de;
 	struct ctl_table_header *head;
+	struct proc_inode *ei = PROC_I(inode);
 
 	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 
 	/* Stop tracking associated processes */
-	put_pid(PROC_I(inode)->pid);
+	if (ei->pid) {
+		put_pid(ei->pid);
+		ei->pid = NULL;
+	}
 
 	/* Let go of any associated proc directory entry */
-	de = PDE(inode);
-	if (de)
+	de = ei->pde;
+	if (de) {
 		pde_put(de);
+		ei->pde = NULL;
+	}
 
-	head = PROC_I(inode)->sysctl;
+	head = ei->sysctl;
 	if (head) {
-		RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
+		RCU_INIT_POINTER(ei->sysctl, NULL);
 		proc_sys_evict_inode(inode, head);
 	}
 }
-- 
2.20.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 6/7] proc: Use a list of inodes to flush from proc
  2020-02-20 20:46                               ` [PATCH 0/7] proc: Dentry flushing without proc_mnt ebiederm
                                                   ` (4 preceding siblings ...)
  2020-02-20 20:51                                 ` [PATCH 5/7] proc: Clear the pieces of proc_inode that proc_evict_inode cares about ebiederm
@ 2020-02-20 20:52                                 ` ebiederm
  2020-02-20 20:52                                 ` [PATCH 7/7] proc: Ensure we see the exit of each process tid exactly once ebiederm
  2020-02-20 23:02                                 ` [PATCH 0/7] proc: Dentry flushing without proc_mnt Linus Torvalds
  7 siblings, 0 replies; 67+ messages in thread
From: ebiederm @ 2020-02-20 20:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer


Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization.  The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died.  When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid.  Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid.  Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place.  This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/base.c          | 111 +++++++++++++---------------------------
 fs/proc/inode.c         |   2 +-
 fs/proc/internal.h      |   1 +
 include/linux/pid.h     |   1 +
 include/linux/proc_fs.h |   4 +-
 kernel/exit.c           |   4 +-
 6 files changed, 44 insertions(+), 79 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index c7c64272b0fa..e7efe9d6f3d6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1834,11 +1834,25 @@ void task_dump_owner(struct task_struct *task, umode_t mode,
 	*rgid = gid;
 }
 
+void proc_pid_evict_inode(struct proc_inode *ei)
+{
+	struct pid *pid = ei->pid;
+
+	if (S_ISDIR(ei->vfs_inode.i_mode)) {
+		spin_lock(&pid->wait_pidfd.lock);
+		hlist_del_init_rcu(&ei->sibling_inodes);
+		spin_unlock(&pid->wait_pidfd.lock);
+	}
+
+	put_pid(pid);
+}
+
 struct inode *proc_pid_make_inode(struct super_block * sb,
 				  struct task_struct *task, umode_t mode)
 {
 	struct inode * inode;
 	struct proc_inode *ei;
+	struct pid *pid;
 
 	/* We need a new inode */
 
@@ -1856,10 +1870,18 @@ struct inode *proc_pid_make_inode(struct super_block * sb,
 	/*
 	 * grab the reference to task.
 	 */
-	ei->pid = get_task_pid(task, PIDTYPE_PID);
-	if (!ei->pid)
+	pid = get_task_pid(task, PIDTYPE_PID);
+	if (!pid)
 		goto out_unlock;
 
+	/* Let the pid remember us for quick removal */
+	ei->pid = pid;
+	if (S_ISDIR(mode)) {
+		spin_lock(&pid->wait_pidfd.lock);
+		hlist_add_head_rcu(&ei->sibling_inodes, &pid->inodes);
+		spin_unlock(&pid->wait_pidfd.lock);
+	}
+
 	task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
 	security_task_to_inode(task, inode);
 
@@ -3230,90 +3252,29 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
 	.permission	= proc_pid_permission,
 };
 
-static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
-{
-	struct dentry *dentry, *leader, *dir;
-	char buf[10 + 1];
-	struct qstr name;
-
-	name.name = buf;
-	name.len = snprintf(buf, sizeof(buf), "%u", pid);
-	/* no ->d_hash() rejects on procfs */
-	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
-	if (dentry) {
-		d_invalidate(dentry);
-		dput(dentry);
-	}
-
-	if (pid == tgid)
-		return;
-
-	name.name = buf;
-	name.len = snprintf(buf, sizeof(buf), "%u", tgid);
-	leader = d_hash_and_lookup(mnt->mnt_root, &name);
-	if (!leader)
-		goto out;
-
-	name.name = "task";
-	name.len = strlen(name.name);
-	dir = d_hash_and_lookup(leader, &name);
-	if (!dir)
-		goto out_put_leader;
-
-	name.name = buf;
-	name.len = snprintf(buf, sizeof(buf), "%u", pid);
-	dentry = d_hash_and_lookup(dir, &name);
-	if (dentry) {
-		d_invalidate(dentry);
-		dput(dentry);
-	}
-
-	dput(dir);
-out_put_leader:
-	dput(leader);
-out:
-	return;
-}
-
 /**
- * proc_flush_task -  Remove dcache entries for @task from the /proc dcache.
- * @task: task that should be flushed.
+ * proc_flush_pid -  Remove dcache entries for @pid from the /proc dcache.
+ * @pid: pid that should be flushed.
  *
- * When flushing dentries from proc, one needs to flush them from global
- * proc (proc_mnt) and from all the namespaces' procs this task was seen
- * in. This call is supposed to do all of this job.
- *
- * Looks in the dcache for
- * /proc/@pid
- * /proc/@tgid/task/@pid
- * if either directory is present flushes it and all of it'ts children
- * from the dcache.
+ * This function walks a list of inodes (that belong to any proc
+ * filesystem) that are attached to the pid and flushes them from
+ * the dentry cache.
  *
  * It is safe and reasonable to cache /proc entries for a task until
  * that task exits.  After that they just clog up the dcache with
  * useless entries, possibly causing useful dcache entries to be
- * flushed instead.  This routine is proved to flush those useless
- * dcache entries at process exit time.
+ * flushed instead.  This routine is provided to flush those useless
+ * dcache entries when a process is reaped.
  *
  * NOTE: This routine is just an optimization so it does not guarantee
- *       that no dcache entries will exist at process exit time it
- *       just makes it very unlikely that any will persist.
+ *       that no dcache entries will exist after a process is reaped
+ *       it just makes it very unlikely that any will persist.
  */
 
-void proc_flush_task(struct task_struct *task)
+void proc_flush_pid(struct pid *pid)
 {
-	int i;
-	struct pid *pid, *tgid;
-	struct upid *upid;
-
-	pid = task_pid(task);
-	tgid = task_tgid(task);
-
-	for (i = 0; i <= pid->level; i++) {
-		upid = &pid->numbers[i];
-		proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
-					tgid->numbers[i].nr);
-	}
+	proc_invalidate_siblings_dcache(&pid->inodes, &pid->wait_pidfd.lock);
+	put_pid(pid);
 }
 
 static struct dentry *proc_pid_instantiate(struct dentry * dentry,
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 3c9082cd257b..979e867be3e5 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -40,7 +40,7 @@ static void proc_evict_inode(struct inode *inode)
 
 	/* Stop tracking associated processes */
 	if (ei->pid) {
-		put_pid(ei->pid);
+		proc_pid_evict_inode(ei);
 		ei->pid = NULL;
 	}
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index fd470172675f..9e294f0290e5 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -158,6 +158,7 @@ extern int proc_pid_statm(struct seq_file *, struct pid_namespace *,
 extern const struct dentry_operations pid_dentry_operations;
 extern int pid_getattr(const struct path *, struct kstat *, u32, unsigned int);
 extern int proc_setattr(struct dentry *, struct iattr *);
+extern void proc_pid_evict_inode(struct proc_inode *);
 extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t);
 extern void pid_update_inode(struct task_struct *, struct inode *);
 extern int pid_delete_dentry(const struct dentry *);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 998ae7d24450..01a0d4e28506 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -62,6 +62,7 @@ struct pid
 	unsigned int level;
 	/* lists of tasks that use this pid */
 	struct hlist_head tasks[PIDTYPE_MAX];
+	struct hlist_head inodes;
 	/* wait queue for pidfd notifications */
 	wait_queue_head_t wait_pidfd;
 	struct rcu_head rcu;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 3dfa92633af3..40a7982b7285 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -32,7 +32,7 @@ struct proc_ops {
 typedef int (*proc_write_t)(struct file *, char *, size_t);
 
 extern void proc_root_init(void);
-extern void proc_flush_task(struct task_struct *);
+extern void proc_flush_pid(struct pid *);
 
 extern struct proc_dir_entry *proc_symlink(const char *,
 		struct proc_dir_entry *, const char *);
@@ -105,7 +105,7 @@ static inline void proc_root_init(void)
 {
 }
 
-static inline void proc_flush_task(struct task_struct *task)
+static inline void proc_flush_pid(struct pid *pid)
 {
 }
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 2833ffb0c211..502b4995b688 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -191,6 +191,7 @@ void put_task_struct_rcu_user(struct task_struct *task)
 void release_task(struct task_struct *p)
 {
 	struct task_struct *leader;
+	struct pid *thread_pid;
 	int zap_leader;
 repeat:
 	/* don't need to get the RCU readlock here - the process is dead and
@@ -199,11 +200,11 @@ void release_task(struct task_struct *p)
 	atomic_dec(&__task_cred(p)->user->processes);
 	rcu_read_unlock();
 
-	proc_flush_task(p);
 	cgroup_release(p);
 
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
+	thread_pid = get_pid(p->thread_pid);
 	__exit_signal(p);
 
 	/*
@@ -226,6 +227,7 @@ void release_task(struct task_struct *p)
 	}
 
 	write_unlock_irq(&tasklist_lock);
+	proc_flush_pid(thread_pid);
 	release_thread(p);
 	put_task_struct_rcu_user(p);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH 7/7] proc: Ensure we see the exit of each process tid exactly once
  2020-02-20 20:46                               ` [PATCH 0/7] proc: Dentry flushing without proc_mnt ebiederm
                                                   ` (5 preceding siblings ...)
  2020-02-20 20:52                                 ` [PATCH 6/7] proc: Use a list of inodes to flush from proc ebiederm
@ 2020-02-20 20:52                                 ` ebiederm
  2020-02-21 16:50                                   ` Oleg Nesterov
  2020-02-20 23:02                                 ` [PATCH 0/7] proc: Dentry flushing without proc_mnt Linus Torvalds
  7 siblings, 1 reply; 67+ messages in thread
From: ebiederm @ 2020-02-20 20:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer


When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.

Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.

When I removed switch_exec_pids and introduced this behavior
in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.

This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization").  Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.

The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently.  Fix
this just so there are no little surprise corner cases waiting to bite
people.

Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/exec.c           |  5 +----
 include/linux/pid.h |  1 +
 kernel/pid.c        | 16 ++++++++++++++++
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index db17be51b112..3f0bc293442e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1148,11 +1148,8 @@ static int de_thread(struct task_struct *tsk)
 
 		/* Become a process group leader with the old leader's pid.
 		 * The old leader becomes a thread of the this thread group.
-		 * Note: The old leader also uses this pid until release_task
-		 *       is called.  Odd but simple and correct.
 		 */
-		tsk->pid = leader->pid;
-		change_pid(tsk, PIDTYPE_PID, task_pid(leader));
+		exchange_tids(tsk, leader);
 		transfer_pid(leader, tsk, PIDTYPE_TGID);
 		transfer_pid(leader, tsk, PIDTYPE_PGID);
 		transfer_pid(leader, tsk, PIDTYPE_SID);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 01a0d4e28506..0f40b5f1c32c 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -101,6 +101,7 @@ extern void attach_pid(struct task_struct *task, enum pid_type);
 extern void detach_pid(struct task_struct *task, enum pid_type);
 extern void change_pid(struct task_struct *task, enum pid_type,
 			struct pid *pid);
+extern void exchange_tids(struct task_struct *task, struct task_struct *old);
 extern void transfer_pid(struct task_struct *old, struct task_struct *new,
 			 enum pid_type);
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 0f4ecb57214c..0085b15478fb 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -359,6 +359,22 @@ void change_pid(struct task_struct *task, enum pid_type type,
 	attach_pid(task, type);
 }
 
+void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
+{
+	/* pid_links[PIDTYPE_PID].next is always NULL */
+	struct pid *npid = READ_ONCE(ntask->thread_pid);
+	struct pid *opid = READ_ONCE(otask->thread_pid);
+
+	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
+	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
+	rcu_assign_pointer(ntask->thread_pid, opid);
+	rcu_assign_pointer(otask->thread_pid, npid);
+	WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
+	WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
+	WRITE_ONCE(ntask->pid, pid_nr(opid));
+	WRITE_ONCE(otask->pid, pid_nr(npid));
+}
+
 /* transfer_pid is an optimization of attach_pid(new), detach_pid(old) */
 void transfer_pid(struct task_struct *old, struct task_struct *new,
 			   enum pid_type type)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 3/7] proc: Mov rcu_read_(lock|unlock) in proc_prune_siblings_dcache
  2020-02-20 20:49                                 ` [PATCH 3/7] proc: Mov rcu_read_(lock|unlock) in proc_prune_siblings_dcache ebiederm
@ 2020-02-20 22:33                                   ` Linus Torvalds
  0 siblings, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2020-02-20 22:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer

On Thu, Feb 20, 2020 at 12:51 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Don't make it look like rcu_read_lock is held over the entire loop
> instead just take the rcu_read_lock over the part of the loop that
> matters.  This makes the intent of the code a little clearer.

No, this is horrid.

Maybe it makes the intent clearer, but it also causes that "continue"
case to unlock and relock immediately.

And maybe that case never triggers, and that's ok. But then it needs a
big comment about it.

              Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/7] proc: Use d_invalidate in proc_prune_siblings_dcache
  2020-02-20 20:49                                 ` [PATCH 4/7] proc: Use d_invalidate " ebiederm
@ 2020-02-20 22:43                                   ` Linus Torvalds
  2020-02-20 22:54                                   ` Al Viro
  1 sibling, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2020-02-20 22:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer

On Thu, Feb 20, 2020 at 12:51 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> To use d_invalidate replace d_prune_aliases with d_find_alias
> followed by d_invalidate and dput.  This is safe and complete
> because no inode in proc has any hardlinks or aliases.

Are you sure you can't create them some way?  This makes em go "what
if we had multiple dentries associated with that inode?" Then the code
would just invalidate the first one.

I guess we don't have export_operations or anything like that, but
this makes me worry...

            Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/7] proc: Use d_invalidate in proc_prune_siblings_dcache
  2020-02-20 20:49                                 ` [PATCH 4/7] proc: Use d_invalidate " ebiederm
  2020-02-20 22:43                                   ` Linus Torvalds
@ 2020-02-20 22:54                                   ` Al Viro
  2020-02-20 23:00                                     ` Linus Torvalds
  2020-02-20 23:03                                     ` Al Viro
  1 sibling, 2 replies; 67+ messages in thread
From: Al Viro @ 2020-02-20 22:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Thu, Feb 20, 2020 at 02:49:53PM -0600, Eric W. Biederman wrote:
> 
> The function d_prune_aliases has the problem that it will only prune
> aliases thare are completely unused.  It will not remove aliases for
> the dcache or even think of removing mounts from the dcache.  For that
> behavior d_invalidate is needed.
> 
> To use d_invalidate replace d_prune_aliases with d_find_alias
> followed by d_invalidate and dput.  This is safe and complete
> because no inode in proc has any hardlinks or aliases.

s/no inode.*/it's a fucking directory inode./

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/7] proc: Use d_invalidate in proc_prune_siblings_dcache
  2020-02-20 22:54                                   ` Al Viro
@ 2020-02-20 23:00                                     ` Linus Torvalds
  2020-02-20 23:03                                     ` Al Viro
  1 sibling, 0 replies; 67+ messages in thread
From: Linus Torvalds @ 2020-02-20 23:00 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Thu, Feb 20, 2020 at 2:54 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> s/no inode.*/it's a directory inode./

That actually makes my worry go away too. We don't allow aliases for
directory inodes, iirc.

So then it doesn't depend on some /proc implementation issue any more,
then it's fundamental that there's only one dentry.

            Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/7] proc: Dentry flushing without proc_mnt
  2020-02-20 20:46                               ` [PATCH 0/7] proc: Dentry flushing without proc_mnt ebiederm
                                                   ` (6 preceding siblings ...)
  2020-02-20 20:52                                 ` [PATCH 7/7] proc: Ensure we see the exit of each process tid exactly once ebiederm
@ 2020-02-20 23:02                                 ` Linus Torvalds
  2020-02-20 23:07                                   ` Al Viro
  7 siblings, 1 reply; 67+ messages in thread
From: Linus Torvalds @ 2020-02-20 23:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexey Dobriyan,
	Andrew Morton, Andy Lutomirski, Daniel Micay, Djalal Harouni,
	Dmitry V . Levin, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Oleg Nesterov, Solar Designer

On Thu, Feb 20, 2020 at 12:48 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Linus, does this approach look like something you can stand?

A couple of worries, although one of them seem to have already been
resolved by Al.

I think the real gatekeeper should be Al in general.  But other than
the small comments I had, I think this might work just fine.

Al?

           Linus

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/7] proc: Use d_invalidate in proc_prune_siblings_dcache
  2020-02-20 22:54                                   ` Al Viro
  2020-02-20 23:00                                     ` Linus Torvalds
@ 2020-02-20 23:03                                     ` Al Viro
  2020-02-20 23:39                                       ` ebiederm
  1 sibling, 1 reply; 67+ messages in thread
From: Al Viro @ 2020-02-20 23:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Thu, Feb 20, 2020 at 10:54:20PM +0000, Al Viro wrote:
> On Thu, Feb 20, 2020 at 02:49:53PM -0600, Eric W. Biederman wrote:
> > 
> > The function d_prune_aliases has the problem that it will only prune
> > aliases thare are completely unused.  It will not remove aliases for
> > the dcache or even think of removing mounts from the dcache.  For that
> > behavior d_invalidate is needed.
> > 
> > To use d_invalidate replace d_prune_aliases with d_find_alias
> > followed by d_invalidate and dput.  This is safe and complete
> > because no inode in proc has any hardlinks or aliases.
> 
> s/no inode.*/it's a fucking directory inode./

Wait... You are using it for sysctls as well?  Ho-hum...  The thing is,
for sysctls you are likely to run into consequent entries with the
same superblock, making for a big pile of useless playing with
->s_active...  And yes, that applied to mainline as well

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/7] proc: Dentry flushing without proc_mnt
  2020-02-20 23:02                                 ` [PATCH 0/7] proc: Dentry flushing without proc_mnt Linus Torvalds
@ 2020-02-20 23:07                                   ` Al Viro
  2020-02-20 23:37                                     ` ebiederm
  0 siblings, 1 reply; 67+ messages in thread
From: Al Viro @ 2020-02-20 23:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

On Thu, Feb 20, 2020 at 03:02:22PM -0800, Linus Torvalds wrote:
> On Thu, Feb 20, 2020 at 12:48 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > Linus, does this approach look like something you can stand?
> 
> A couple of worries, although one of them seem to have already been
> resolved by Al.
> 
> I think the real gatekeeper should be Al in general.  But other than
> the small comments I had, I think this might work just fine.
> 
> Al?

I'll need to finish RTFS there; I have initially misread that patch,
actually - Eric _is_ using that thing both for those directories
and for sysctl inodes.  And the prototype for that machinery (the
one he'd pulled from proc_sysctl.c) is playing with pinning superblocks
way too much; for per-pid directories that's not an issue, but
for sysctl table removal you are very likely to hit a bunch of
evictees on the same superblock...

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 0/7] proc: Dentry flushing without proc_mnt
  2020-02-20 23:07                                   ` Al Viro
@ 2020-02-20 23:37                                     ` ebiederm
  0 siblings, 0 replies; 67+ messages in thread
From: ebiederm @ 2020-02-20 23:37 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Thu, Feb 20, 2020 at 03:02:22PM -0800, Linus Torvalds wrote:
>> On Thu, Feb 20, 2020 at 12:48 PM Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>> >
>> > Linus, does this approach look like something you can stand?
>> 
>> A couple of worries, although one of them seem to have already been
>> resolved by Al.
>> 
>> I think the real gatekeeper should be Al in general.  But other than
>> the small comments I had, I think this might work just fine.
>> 
>> Al?
>
> I'll need to finish RTFS there; I have initially misread that patch,
> actually - Eric _is_ using that thing both for those directories
> and for sysctl inodes.  And the prototype for that machinery (the
> one he'd pulled from proc_sysctl.c) is playing with pinning superblocks
> way too much; for per-pid directories that's not an issue, but
> for sysctl table removal you are very likely to hit a bunch of
> evictees on the same superblock...

I saw that was possible.  If the broad strokes look correct I don't have
a problem at all with optimizing for the case where many of the entries
are for inodes on the same superblock.  I just had enough other details
on my mind I was afraid if I got a little more clever I would have
introduced a typo somewhere.


I wish I could limit the sysctl parts to just directories, but
unfortunately the sysctl tables don't always give a guarantee that a
directory is what will be removed.  But sysctls do have one name per
inode invarant like fat.  There is no way to express a sysctl
table that doesn't have that invariant.

As for d_find_alias/d_invalidate.

Just for completeness I wanted to write a loop:

	while (dentry = d_find_alias(inode)) {
        	d_invalidate(dentry);
                dput(dentry);
        }

Unfortunately that breaks on directories, because for directories
d_find_alias turns into d_find_any_alias, and continues to return aliases
even when they are unhashed.

It might be nice to write a cousin of d_prune_aliases call
it d_invalidate_aliases that just does that loop the correct way
in dcache.c

Eric

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 4/7] proc: Use d_invalidate in proc_prune_siblings_dcache
  2020-02-20 23:03                                     ` Al Viro
@ 2020-02-20 23:39                                       ` ebiederm
  0 siblings, 0 replies; 67+ messages in thread
From: ebiederm @ 2020-02-20 23:39 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Oleg Nesterov, Solar Designer

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Thu, Feb 20, 2020 at 10:54:20PM +0000, Al Viro wrote:
>> On Thu, Feb 20, 2020 at 02:49:53PM -0600, Eric W. Biederman wrote:
>> > 
>> > The function d_prune_aliases has the problem that it will only prune
>> > aliases thare are completely unused.  It will not remove aliases for
>> > the dcache or even think of removing mounts from the dcache.  For that
>> > behavior d_invalidate is needed.
>> > 
>> > To use d_invalidate replace d_prune_aliases with d_find_alias
>> > followed by d_invalidate and dput.  This is safe and complete
>> > because no inode in proc has any hardlinks or aliases.
>> 
>> s/no inode.*/it's a fucking directory inode./
>
> Wait... You are using it for sysctls as well?  Ho-hum...  The thing is,
> for sysctls you are likely to run into consequent entries with the
> same superblock, making for a big pile of useless playing with
> ->s_active...  And yes, that applied to mainline as well

Which is why I worked to merge the two cases since they were so close.
Fewer things to fix and more eyeballs on the code.

Eric



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 7/7] proc: Ensure we see the exit of each process tid exactly once
  2020-02-20 20:52                                 ` [PATCH 7/7] proc: Ensure we see the exit of each process tid exactly once ebiederm
@ 2020-02-21 16:50                                   ` Oleg Nesterov
  2020-02-22 15:46                                     ` ebiederm
  0 siblings, 1 reply; 67+ messages in thread
From: Oleg Nesterov @ 2020-02-21 16:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Al Viro, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Solar Designer

On 02/20, Eric W. Biederman wrote:
>
> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
> +{
> +	/* pid_links[PIDTYPE_PID].next is always NULL */
> +	struct pid *npid = READ_ONCE(ntask->thread_pid);
> +	struct pid *opid = READ_ONCE(otask->thread_pid);
> +
> +	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
> +	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
> +	rcu_assign_pointer(ntask->thread_pid, opid);
> +	rcu_assign_pointer(otask->thread_pid, npid);

this breaks has_group_leader_pid()...

proc_pid_readdir() can miss a process doing mt-exec but this looks fixable,
just we need to update ntask->thread_pid before updating ->first.

The more problematic case is __exit_signal() which does
		
		if (unlikely(has_group_leader_pid(tsk)))
			posix_cpu_timers_exit_group(tsk);

Oleg.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH 7/7] proc: Ensure we see the exit of each process tid exactly once
  2020-02-21 16:50                                   ` Oleg Nesterov
@ 2020-02-22 15:46                                     ` ebiederm
  0 siblings, 0 replies; 67+ messages in thread
From: ebiederm @ 2020-02-22 15:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Al Viro, LKML, Kernel Hardening, Linux API,
	Linux FS Devel, Linux Security Module, Akinobu Mita,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Solar Designer

Oleg Nesterov <oleg@redhat.com> writes:

> On 02/20, Eric W. Biederman wrote:
>>
>> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
>> +{
>> +	/* pid_links[PIDTYPE_PID].next is always NULL */
>> +	struct pid *npid = READ_ONCE(ntask->thread_pid);
>> +	struct pid *opid = READ_ONCE(otask->thread_pid);
>> +
>> +	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
>> +	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
>> +	rcu_assign_pointer(ntask->thread_pid, opid);
>> +	rcu_assign_pointer(otask->thread_pid, npid);
>
> this breaks has_group_leader_pid()...
>
> proc_pid_readdir() can miss a process doing mt-exec but this looks fixable,
> just we need to update ntask->thread_pid before updating ->first.
>
> The more problematic case is __exit_signal() which does
> 		
> 		if (unlikely(has_group_leader_pid(tsk)))
> 			posix_cpu_timers_exit_group(tsk);

Along with the comment:
		/*
		 * This can only happen if the caller is de_thread().
		 * FIXME: this is the temporary hack, we should teach
		 * posix-cpu-timers to handle this case correctly.
		 */
So I suspect this is fixable and the above fix might be part of that.

Hmm looking at your commit:

commit e0a70217107e6f9844628120412cb27bb4cea194
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Fri Nov 5 16:53:42 2010 +0100

    posix-cpu-timers: workaround to suppress the problems with mt exec
    
    posix-cpu-timers.c correctly assumes that the dying process does
    posix_cpu_timers_exit_group() and removes all !CPUCLOCK_PERTHREAD
    timers from signal->cpu_timers list.
    
    But, it also assumes that timer->it.cpu.task is always the group
    leader, and thus the dead ->task means the dead thread group.
    
    This is obviously not true after de_thread() changes the leader.
    After that almost every posix_cpu_timer_ method has problems.
    
    It is not simple to fix this bug correctly. First of all, I think
    that timer->it.cpu should use struct pid instead of task_struct.
    Also, the locking should be reworked completely. In particular,
    tasklist_lock should not be used at all. This all needs a lot of
    nontrivial and hard-to-test changes.
    
    Change __exit_signal() to do posix_cpu_timers_exit_group() when
    the old leader dies during exec. This is not the fix, just the
    temporary hack to hide the problem for 2.6.37 and stable. IOW,
    this is obviously wrong but this is what we currently have anyway:
    cpu timers do not work after mt exec.
    
    In theory this change adds another race. The exiting leader can
    detach the timers which were attached to the new leader. However,
    the window between de_thread() and release_task() is small, we
    can pretend that sys_timer_create() was called before de_thread().
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

It looks like the data structures need fixing.  Possibly to use struct
pid.  Possibly to move the group data to signal struct.

I think I played with some of that awhile ago.

I am going to move this change to another patchset.  So I don't wind up
playing shift the bug around.  I thought I would need this to get the
other code working but it turns out we remain bug compatible without
this.

Hopefully I can get something out in the next week or so that addresses
the issues you have pointed out.

Eric


^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, back to index

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-10 15:05 [PATCH v8 00/11] proc: modernize proc to support multiple private instances Alexey Gladkov
2020-02-10 15:05 ` [PATCH v8 01/11] proc: Rename struct proc_fs_info to proc_fs_opts Alexey Gladkov
2020-02-10 15:05 ` [PATCH v8 02/11] proc: add proc_fs_info struct to store proc information Alexey Gladkov
2020-02-10 15:05 ` [PATCH v8 03/11] proc: move /proc/{self|thread-self} dentries to proc_fs_info Alexey Gladkov
2020-02-10 18:23   ` Andy Lutomirski
2020-02-12 15:00     ` Alexey Gladkov
2020-02-10 15:05 ` [PATCH v8 04/11] proc: move hide_pid, pid_gid from pid_namespace " Alexey Gladkov
2020-02-10 15:05 ` [PATCH v8 05/11] proc: add helpers to set and get proc hidepid and gid mount options Alexey Gladkov
2020-02-10 18:30   ` Andy Lutomirski
2020-02-12 14:57     ` Alexey Gladkov
2020-02-10 15:05 ` [PATCH v8 06/11] proc: support mounting procfs instances inside same pid namespace Alexey Gladkov
2020-02-10 15:05 ` [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances Alexey Gladkov
2020-02-10 17:46   ` Linus Torvalds
2020-02-10 19:23     ` Al Viro
2020-02-11  1:36   ` ebiederm
2020-02-11  4:01     ` ebiederm
2020-02-12 14:49     ` Alexey Gladkov
2020-02-12 14:59       ` ebiederm
2020-02-12 17:08         ` Alexey Gladkov
2020-02-12 18:45         ` Linus Torvalds
2020-02-12 19:16           ` ebiederm
2020-02-12 19:49             ` Linus Torvalds
2020-02-12 20:03               ` Al Viro
2020-02-12 20:35                 ` Linus Torvalds
2020-02-12 20:38                   ` Al Viro
2020-02-12 20:41                     ` Al Viro
2020-02-12 21:02                       ` Linus Torvalds
2020-02-12 21:46                         ` ebiederm
2020-02-13  0:48                           ` Linus Torvalds
2020-02-13  4:37                             ` ebiederm
2020-02-13  5:55                               ` Al Viro
2020-02-13 21:30                                 ` Linus Torvalds
2020-02-13 22:23                                   ` Al Viro
2020-02-13 22:47                                     ` Linus Torvalds
2020-02-14 14:15                                       ` ebiederm
2020-02-14  3:48                                 ` ebiederm
2020-02-20 20:46                               ` [PATCH 0/7] proc: Dentry flushing without proc_mnt ebiederm
2020-02-20 20:47                                 ` [PATCH 1/7] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes ebiederm
2020-02-20 20:48                                 ` [PATCH 2/7] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache ebiederm
2020-02-20 20:49                                 ` [PATCH 3/7] proc: Mov rcu_read_(lock|unlock) in proc_prune_siblings_dcache ebiederm
2020-02-20 22:33                                   ` Linus Torvalds
2020-02-20 20:49                                 ` [PATCH 4/7] proc: Use d_invalidate " ebiederm
2020-02-20 22:43                                   ` Linus Torvalds
2020-02-20 22:54                                   ` Al Viro
2020-02-20 23:00                                     ` Linus Torvalds
2020-02-20 23:03                                     ` Al Viro
2020-02-20 23:39                                       ` ebiederm
2020-02-20 20:51                                 ` [PATCH 5/7] proc: Clear the pieces of proc_inode that proc_evict_inode cares about ebiederm
2020-02-20 20:52                                 ` [PATCH 6/7] proc: Use a list of inodes to flush from proc ebiederm
2020-02-20 20:52                                 ` [PATCH 7/7] proc: Ensure we see the exit of each process tid exactly once ebiederm
2020-02-21 16:50                                   ` Oleg Nesterov
2020-02-22 15:46                                     ` ebiederm
2020-02-20 23:02                                 ` [PATCH 0/7] proc: Dentry flushing without proc_mnt Linus Torvalds
2020-02-20 23:07                                   ` Al Viro
2020-02-20 23:37                                     ` ebiederm
2020-02-14  3:49                     ` [PATCH v8 07/11] proc: flush task dcache entries from all procfs instances ebiederm
2020-02-12 19:47           ` Al Viro
2020-02-11 22:45   ` Al Viro
2020-02-12 14:26     ` Alexey Gladkov
2020-02-10 15:05 ` [PATCH v8 08/11] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option Alexey Gladkov
2020-02-10 16:29   ` Jordan Glover
2020-02-12 14:34     ` Alexey Gladkov
2020-02-10 15:05 ` [PATCH v8 09/11] proc: add option to mount only a pids subset Alexey Gladkov
2020-02-10 15:05 ` [PATCH v8 10/11] docs: proc: add documentation for "hidepid=4" and "subset=pidfs" options and new mount behavior Alexey Gladkov
2020-02-10 18:29   ` Andy Lutomirski
2020-02-12 16:03     ` Alexey Gladkov
2020-02-10 15:05 ` [PATCH v8 11/11] proc: Move hidepid values to uapi as they are user interface to mount Alexey Gladkov

Kernel-hardening archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kernel-hardening/0 kernel-hardening/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kernel-hardening kernel-hardening/ https://lore.kernel.org/kernel-hardening \
		kernel-hardening@lists.openwall.com
	public-inbox-index kernel-hardening

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/com.openwall.lists.kernel-hardening


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git