linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 0/7] proc: modernize proc to support multiple private instances
@ 2020-04-19 14:10 Alexey Gladkov
  2020-04-19 14:10 ` [PATCH v12 1/7] proc: rename struct proc_fs_info to proc_fs_opts Alexey Gladkov
                   ` (7 more replies)
  0 siblings, 8 replies; 54+ messages in thread
From: Alexey Gladkov @ 2020-04-19 14:10 UTC (permalink / raw)
  To: LKML
  Cc: Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Alexey Gladkov, Andrew Morton, Andy Lutomirski,
	Daniel Micay, Djalal Harouni, Dmitry V . Levin,
	Eric W . Biederman, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Linus Torvalds, Oleg Nesterov, David Howells

Preface:
--------
This is patch set v12 to modernize procfs and make it able to support multiple
private instances per the same pid namespace.

This patch set can be applied on top of:

git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.7-rc1-299-gc21cca0fb71d


Procfs modernization:
---------------------
Historically procfs was always tied to pid namespaces, during pid
namespace creation we internally create a procfs mount for it. However,
this has the effect that all new procfs mounts are just a mirror of the
internal one, any change, any mount option update, any new future
introduction will propagate to all other procfs mounts that are in the
same pid namespace.

This may have solved several use cases in that time. However today we
face new requirements, and making procfs able to support new private
instances inside same pid namespace seems a major point. If we want to
to introduce new features and security mechanisms we have to make sure
first that we do not break existing usecases. Supporting private procfs
instances will allow to support new features and behaviour without
propagating it to all other procfs mounts.

Today procfs is more of a burden especially to some Embedded, IoT,
sandbox, container use cases. In user space we are over-mounting null
or inaccessible files on top to hide files and information. If we want
to hide pids we have to create PID namespaces otherwise mount options
propagate to all other proc mounts, changing a mount option value in one
mount will propagate to all other proc mounts. If we want to introduce
new features, then they will propagate to all other mounts too, resulting
either maybe new useful functionality or maybe breaking stuff. We have
also to note that userspace should not workaround procfs, the kernel
should just provide a sane simple interface.

In this regard several developers and maintainers pointed out that
there are problems with procfs and it has to be modernized:

"Here's another one: split up and modernize /proc." by Andy Lutomirski [1]

Discussion about kernel pointer leaks:

"And yes, as Kees and Daniel mentioned, it's definitely not just dmesg.
In fact, the primary things tend to be /proc and /sys, not dmesg
itself." By Linus Torvalds [2]

Lot of other areas in the kernel and filesystems have been updated to be
able to support private instances, devpts is one major example [3].

Which will be used for:

1) Embedded systems and IoT: usually we have one supervisor for
apps, we have some lightweight sandbox support, however if we create
pid namespaces we have to manage all the processes inside too,
where our goal is to be able to run a bunch of apps each one inside
its own mount namespace, maybe use network namespaces for vlans
setups, but right now we only want mount namespaces, without all the
other complexity. We want procfs to behave more like a real file system,
and block access to inodes that belong to other users. The 'hidepid=' will
not work since it is a shared mount option.

2) Containers, sandboxes and Private instances of file systems - devpts case
Historically, lot of file systems inside Linux kernel view when instantiated
were just a mirror of an already created and mounted filesystem. This was the
case of devpts filesystem, it seems at that time the requirements were to
optimize things and reuse the same memory, etc. This design used to work but not
anymore with today's containers, IoT, hostile environments and all the privacy
challenges that Linux faces.

In that regards, devpts was updated so that each new mounts is a total
independent file system by the following patches:

"devpts: Make each mount of devpts an independent filesystem" by
Eric W. Biederman [3] [4]

3) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.

This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.

By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which user can not.

Side notes:

* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change all LSMs should be able to analyze
open/read/write/close... on /proc/<pid>/

4) This will allow to implement new features either in kernel or
userspace without having to worry about procfs.
In containers, sandboxes, etc we have workarounds to hide some /proc
inodes, this should be supported natively without doing extra complex
work, the kernel should be able to support sane options that work with
today and future Linux use cases.

5) Creation of new superblock with all procfs options for each procfs
mount will fix the ignoring of mount options. The problem is that the
second mount of procfs in the same pid namespace ignores the mount
options. The mount options are ignored without error until procfs is
remounted.

Before:

# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=2 0 0

# strace -e mount mount -o hidepid=1 -t proc proc /tmp/proc
mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
+++ exited with 0 +++

# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=2 0 0
proc /tmp/proc proc rw,relatime,hidepid=2 0 0

# mount -o remount,hidepid=1 -t proc proc /tmp/proc

# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=1 0 0
proc /tmp/proc proc rw,relatime,hidepid=1 0 0

After:

# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=ptraceable 0 0

# mount -o hidepid=invisible -t proc proc /tmp/proc

# grep ^proc /proc/mounts
proc /proc proc rw,relatime,hidepid=ptraceable 0 0
proc /tmp/proc proc rw,relatime,hidepid=invisible 0 0


Introduced changes:
-------------------
Each mount of procfs creates a separate procfs instance with its own
mount options.

This series adds few new mount options:

* New 'hidepid=ptraceable' or 'hidepid=4' mount option to show only ptraceable
processes in the procfs. This allows to support lightweight sandboxes in
Embedded Linux, also solves the case for LSM where now with this mount option,
we make sure that they have a ptrace path in procfs.

* 'subset=pid' that allows to hide non-pid inodes from procfs. It can be used
in containers and sandboxes, as these are already trying to hide and block
access to procfs inodes anyway.


ChangeLog:
----------
# v12:
* Rebase on top of v5.7-rc1.
* Fix a resource leak if proc is not mounted or if proc is simply reconfigured.
* Add few selftests.

# v11:
* After a discussion with Eric W. Biederman, the numerical values for hidepid 
  parameter have been removed from uapi.
* Remove proc_self and proc_thread_self from the pid_namespace struct.
* I took into account the comment of Kees Cook.
* Update Reviewed-by tags.

# v10:
* 'subset=pidfs' renamed to 'subset=pid' as suggested by Alexey Dobriyan.
* Include Reviewed-by tags.

# v9:
* Rebase on top of Eric W. Biederman's procfs changes.
* Add human readable values of 'hidepid' as suggested by Andy Lutomirski.

# v8:
* Started using RCU lock to clean dcache entries as suggested by Linus Torvalds.

# v7:
* 'pidonly=1' renamed to 'subset=pidfs' as suggested by Alexey Dobriyan.
* HIDEPID_* moved to uapi/ as they are user interface to mount().
  Suggested-by Alexey Dobriyan <adobriyan@gmail.com>

# v6:
* 'hidepid=' and 'gid=' mount options are moved from pid namespace to superblock.
* 'newinstance' mount option removed as suggested by Eric W. Biederman.
   Mount of procfs always creates a new instance.
* 'limit_pids' renamed to 'hidepid=3'.
* I took into account the comment of Linus Torvalds [7].
* Documentation added.

# v5:
* Fixed a bug that caused a problem with the Fedora boot.
* The 'pidonly' option is visible among the mount options.

# v2:
* Renamed mount options to 'newinstance' and 'pids='
   Suggested-by: Andy Lutomirski <luto@kernel.org>
* Fixed order of commit, Suggested-by: Andy Lutomirski <luto@kernel.org>
* Many bug fixes.

# v1:
* Removed 'unshared' mount option and replaced it with 'limit_pids'
   which is attached to the current procfs mount.
   Suggested-by Andy Lutomirski <luto@kernel.org>
* Do not fill dcache with pid entries that we can not ptrace.
* Many bug fixes.


References:
-----------
[1] https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-January/004215.html
[2] http://www.openwall.com/lists/kernel-hardening/2017/10/05/5
[3] https://lwn.net/Articles/689539/
[4] http://lxr.free-electrons.com/source/Documentation/filesystems/devpts.txt?v=3.14
[5] https://lkml.org/lkml/2017/5/2/407
[6] https://lkml.org/lkml/2017/5/3/357
[7] https://lkml.org/lkml/2018/5/11/505


Alexey Gladkov (7):
  proc: rename struct proc_fs_info to proc_fs_opts
  proc: allow to mount many instances of proc in one pid namespace
  proc: instantiate only pids that we can ptrace on 'hidepid=4' mount
    option
  proc: add option to mount only a pids subset
  docs: proc: add documentation for "hidepid=4" and "subset=pid" options
    and new mount behavior
  proc: use human-readable values for hidepid
  proc: use named enums for better readability

 Documentation/filesystems/proc.rst            |  92 +++++++++---
 fs/proc/base.c                                |  48 +++++--
 fs/proc/generic.c                             |   9 ++
 fs/proc/inode.c                               |  30 +++-
 fs/proc/root.c                                | 131 +++++++++++++-----
 fs/proc/self.c                                |   6 +-
 fs/proc/thread_self.c                         |   6 +-
 fs/proc_namespace.c                           |  14 +-
 include/linux/pid_namespace.h                 |  12 --
 include/linux/proc_fs.h                       |  30 +++-
 tools/testing/selftests/proc/.gitignore       |   2 +
 tools/testing/selftests/proc/Makefile         |   2 +
 .../selftests/proc/proc-fsconfig-hidepid.c    |  50 +++++++
 .../selftests/proc/proc-multiple-procfs.c     |  48 +++++++
 14 files changed, 384 insertions(+), 96 deletions(-)
 create mode 100644 tools/testing/selftests/proc/proc-fsconfig-hidepid.c
 create mode 100644 tools/testing/selftests/proc/proc-multiple-procfs.c

-- 
2.25.3


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v12 1/7] proc: rename struct proc_fs_info to proc_fs_opts
  2020-04-19 14:10 [PATCH v12 0/7] proc: modernize proc to support multiple private instances Alexey Gladkov
@ 2020-04-19 14:10 ` Alexey Gladkov
  2020-04-19 14:10 ` [PATCH v12 2/7] proc: allow to mount many instances of proc in one pid namespace Alexey Gladkov
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 54+ messages in thread
From: Alexey Gladkov @ 2020-04-19 14:10 UTC (permalink / raw)
  To: LKML
  Cc: Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Alexey Gladkov, Andrew Morton, Andy Lutomirski,
	Daniel Micay, Djalal Harouni, Dmitry V . Levin,
	Eric W . Biederman, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Linus Torvalds, Oleg Nesterov, David Howells

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 fs/proc_namespace.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 273ee82d8aa9..9a8b624bc3db 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -37,23 +37,23 @@ static __poll_t mounts_poll(struct file *file, poll_table *wait)
 	return res;
 }
 
-struct proc_fs_info {
+struct proc_fs_opts {
 	int flag;
 	const char *str;
 };
 
 static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 {
-	static const struct proc_fs_info fs_info[] = {
+	static const struct proc_fs_opts fs_opts[] = {
 		{ SB_SYNCHRONOUS, ",sync" },
 		{ SB_DIRSYNC, ",dirsync" },
 		{ SB_MANDLOCK, ",mand" },
 		{ SB_LAZYTIME, ",lazytime" },
 		{ 0, NULL }
 	};
-	const struct proc_fs_info *fs_infop;
+	const struct proc_fs_opts *fs_infop;
 
-	for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
+	for (fs_infop = fs_opts; fs_infop->flag; fs_infop++) {
 		if (sb->s_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
@@ -63,7 +63,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
 
 static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 {
-	static const struct proc_fs_info mnt_info[] = {
+	static const struct proc_fs_opts mnt_opts[] = {
 		{ MNT_NOSUID, ",nosuid" },
 		{ MNT_NODEV, ",nodev" },
 		{ MNT_NOEXEC, ",noexec" },
@@ -72,9 +72,9 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 		{ MNT_RELATIME, ",relatime" },
 		{ 0, NULL }
 	};
-	const struct proc_fs_info *fs_infop;
+	const struct proc_fs_opts *fs_infop;
 
-	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
+	for (fs_infop = mnt_opts; fs_infop->flag; fs_infop++) {
 		if (mnt->mnt_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v12 2/7] proc: allow to mount many instances of proc in one pid namespace
  2020-04-19 14:10 [PATCH v12 0/7] proc: modernize proc to support multiple private instances Alexey Gladkov
  2020-04-19 14:10 ` [PATCH v12 1/7] proc: rename struct proc_fs_info to proc_fs_opts Alexey Gladkov
@ 2020-04-19 14:10 ` Alexey Gladkov
  2020-04-23 11:28   ` [PATCH v13 " Alexey Gladkov
  2020-04-19 14:10 ` [PATCH v12 3/7] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option Alexey Gladkov
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 54+ messages in thread
From: Alexey Gladkov @ 2020-04-19 14:10 UTC (permalink / raw)
  To: LKML
  Cc: Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Alexey Gladkov, Andrew Morton, Andy Lutomirski,
	Daniel Micay, Djalal Harouni, Dmitry V . Levin,
	Eric W . Biederman, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Linus Torvalds, Oleg Nesterov, David Howells

This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.

1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.

2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.

This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.

By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.

Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...

In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.

Selftest has been added to verify new behavior.

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 fs/proc/base.c                                | 31 +++++++-----
 fs/proc/inode.c                               | 11 ++---
 fs/proc/root.c                                | 49 +++++++++----------
 fs/proc/self.c                                |  6 +--
 fs/proc/thread_self.c                         |  6 +--
 include/linux/pid_namespace.h                 | 12 -----
 include/linux/proc_fs.h                       | 22 ++++++++-
 tools/testing/selftests/proc/.gitignore       |  1 +
 tools/testing/selftests/proc/Makefile         |  1 +
 .../selftests/proc/proc-multiple-procfs.c     | 48 ++++++++++++++++++
 10 files changed, 124 insertions(+), 63 deletions(-)
 create mode 100644 tools/testing/selftests/proc/proc-multiple-procfs.c

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6042b646ab27..93b5d05c142c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -697,13 +697,13 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
  * May current process learn task's sched/cmdline info (for hide_pid_min=1)
  * or euid/egid (for hide_pid_min=2)?
  */
-static bool has_pid_permissions(struct pid_namespace *pid,
+static bool has_pid_permissions(struct proc_fs_info *fs_info,
 				 struct task_struct *task,
 				 int hide_pid_min)
 {
-	if (pid->hide_pid < hide_pid_min)
+	if (fs_info->hide_pid < hide_pid_min)
 		return true;
-	if (in_group_p(pid->pid_gid))
+	if (in_group_p(fs_info->pid_gid))
 		return true;
 	return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
 }
@@ -711,18 +711,18 @@ static bool has_pid_permissions(struct pid_namespace *pid,
 
 static int proc_pid_permission(struct inode *inode, int mask)
 {
-	struct pid_namespace *pid = proc_pid_ns(inode);
+	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
 	struct task_struct *task;
 	bool has_perms;
 
 	task = get_proc_task(inode);
 	if (!task)
 		return -ESRCH;
-	has_perms = has_pid_permissions(pid, task, HIDEPID_NO_ACCESS);
+	has_perms = has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS);
 	put_task_struct(task);
 
 	if (!has_perms) {
-		if (pid->hide_pid == HIDEPID_INVISIBLE) {
+		if (fs_info->hide_pid == HIDEPID_INVISIBLE) {
 			/*
 			 * Let's make getdents(), stat(), and open()
 			 * consistent with each other.  If a process
@@ -1897,7 +1897,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
 		u32 request_mask, unsigned int query_flags)
 {
 	struct inode *inode = d_inode(path->dentry);
-	struct pid_namespace *pid = proc_pid_ns(inode);
+	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
 	struct task_struct *task;
 
 	generic_fillattr(inode, stat);
@@ -1907,7 +1907,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
 	rcu_read_lock();
 	task = pid_task(proc_pid(inode), PIDTYPE_PID);
 	if (task) {
-		if (!has_pid_permissions(pid, task, HIDEPID_INVISIBLE)) {
+		if (!has_pid_permissions(fs_info, task, HIDEPID_INVISIBLE)) {
 			rcu_read_unlock();
 			/*
 			 * This doesn't prevent learning whether PID exists,
@@ -3301,6 +3301,7 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
 {
 	struct task_struct *task;
 	unsigned tgid;
+	struct proc_fs_info *fs_info;
 	struct pid_namespace *ns;
 	struct dentry *result = ERR_PTR(-ENOENT);
 
@@ -3308,7 +3309,8 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
 	if (tgid == ~0U)
 		goto out;
 
-	ns = dentry->d_sb->s_fs_info;
+	fs_info = proc_sb_info(dentry->d_sb);
+	ns = fs_info->pid_ns;
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tgid, ns);
 	if (task)
@@ -3372,6 +3374,7 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
 int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct tgid_iter iter;
+	struct proc_fs_info *fs_info = proc_sb_info(file_inode(file)->i_sb);
 	struct pid_namespace *ns = proc_pid_ns(file_inode(file));
 	loff_t pos = ctx->pos;
 
@@ -3379,13 +3382,13 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 		return 0;
 
 	if (pos == TGID_OFFSET - 2) {
-		struct inode *inode = d_inode(ns->proc_self);
+		struct inode *inode = d_inode(fs_info->proc_self);
 		if (!dir_emit(ctx, "self", 4, inode->i_ino, DT_LNK))
 			return 0;
 		ctx->pos = pos = pos + 1;
 	}
 	if (pos == TGID_OFFSET - 1) {
-		struct inode *inode = d_inode(ns->proc_thread_self);
+		struct inode *inode = d_inode(fs_info->proc_thread_self);
 		if (!dir_emit(ctx, "thread-self", 11, inode->i_ino, DT_LNK))
 			return 0;
 		ctx->pos = pos = pos + 1;
@@ -3399,7 +3402,7 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 		unsigned int len;
 
 		cond_resched();
-		if (!has_pid_permissions(ns, iter.task, HIDEPID_INVISIBLE))
+		if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
 			continue;
 
 		len = snprintf(name, sizeof(name), "%u", iter.tgid);
@@ -3599,6 +3602,7 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 	struct task_struct *task;
 	struct task_struct *leader = get_proc_task(dir);
 	unsigned tid;
+	struct proc_fs_info *fs_info;
 	struct pid_namespace *ns;
 	struct dentry *result = ERR_PTR(-ENOENT);
 
@@ -3609,7 +3613,8 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 	if (tid == ~0U)
 		goto out;
 
-	ns = dentry->d_sb->s_fs_info;
+	fs_info = proc_sb_info(dentry->d_sb);
+	ns = fs_info->pid_ns;
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tid, ns);
 	if (task)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index fb4cace9ea41..9c756531282a 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -167,13 +167,12 @@ void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock
 
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
-	struct super_block *sb = root->d_sb;
-	struct pid_namespace *pid = sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb_info(root->d_sb);
 
-	if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
-		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
-	if (pid->hide_pid != HIDEPID_OFF)
-		seq_printf(seq, ",hidepid=%u", pid->hide_pid);
+	if (!gid_eq(fs_info->pid_gid, GLOBAL_ROOT_GID))
+		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, fs_info->pid_gid));
+	if (fs_info->hide_pid != HIDEPID_OFF)
+		seq_printf(seq, ",hidepid=%u", fs_info->hide_pid);
 
 	return 0;
 }
diff --git a/fs/proc/root.c b/fs/proc/root.c
index cdbe9293ea55..208989274923 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -77,26 +77,31 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	return 0;
 }
 
-static void proc_apply_options(struct super_block *s,
+static void proc_apply_options(struct proc_fs_info *fs_info,
 			       struct fs_context *fc,
-			       struct pid_namespace *pid_ns,
 			       struct user_namespace *user_ns)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
 
 	if (ctx->mask & (1 << Opt_gid))
-		pid_ns->pid_gid = make_kgid(user_ns, ctx->gid);
+		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
-		pid_ns->hide_pid = ctx->hidepid;
+		fs_info->hide_pid = ctx->hidepid;
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 {
-	struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
+	struct proc_fs_context *ctx = fc->fs_private;
 	struct inode *root_inode;
+	struct proc_fs_info *fs_info;
 	int ret;
 
-	proc_apply_options(s, fc, pid_ns, current_user_ns());
+	fs_info = kzalloc(sizeof(*fs_info), GFP_KERNEL);
+	if (!fs_info)
+		return -ENOMEM;
+
+	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+	proc_apply_options(fs_info, fc, current_user_ns());
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -106,6 +111,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_magic = PROC_SUPER_MAGIC;
 	s->s_op = &proc_sops;
 	s->s_time_gran = 1;
+	s->s_fs_info = fs_info;
 
 	/*
 	 * procfs isn't actually a stacking filesystem; however, there is
@@ -113,7 +119,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	 * top of it
 	 */
 	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-	
+
 	/* procfs dentries and inodes don't require IO to create */
 	s->s_shrink.seeks = 0;
 
@@ -140,19 +146,17 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 static int proc_reconfigure(struct fs_context *fc)
 {
 	struct super_block *sb = fc->root->d_sb;
-	struct pid_namespace *pid = sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb_info(sb);
 
 	sync_filesystem(sb);
 
-	proc_apply_options(sb, fc, pid, current_user_ns());
+	proc_apply_options(fs_info, fc, current_user_ns());
 	return 0;
 }
 
 static int proc_get_tree(struct fs_context *fc)
 {
-	struct proc_fs_context *ctx = fc->fs_private;
-
-	return get_tree_keyed(fc, proc_fill_super, ctx->pid_ns);
+	return get_tree_nodev(fc, proc_fill_super);
 }
 
 static void proc_fs_context_free(struct fs_context *fc)
@@ -188,22 +192,17 @@ static int proc_init_fs_context(struct fs_context *fc)
 
 static void proc_kill_sb(struct super_block *sb)
 {
-	struct pid_namespace *ns;
+	struct proc_fs_info *fs_info = proc_sb_info(sb);
 
-	ns = (struct pid_namespace *)sb->s_fs_info;
-	if (ns->proc_self)
-		dput(ns->proc_self);
-	if (ns->proc_thread_self)
-		dput(ns->proc_thread_self);
-	kill_anon_super(sb);
+	if (fs_info->proc_self)
+		dput(fs_info->proc_self);
 
-	/* Make the pid namespace safe for the next mount of proc */
-	ns->proc_self = NULL;
-	ns->proc_thread_self = NULL;
-	ns->pid_gid = GLOBAL_ROOT_GID;
-	ns->hide_pid = 0;
+	if (fs_info->proc_thread_self)
+		dput(fs_info->proc_thread_self);
 
-	put_pid_ns(ns);
+	kill_anon_super(sb);
+	put_pid_ns(fs_info->pid_ns);
+	kfree(fs_info);
 }
 
 static struct file_system_type proc_fs_type = {
diff --git a/fs/proc/self.c b/fs/proc/self.c
index 57c0a1047250..309301ac0136 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -36,10 +36,10 @@ static unsigned self_inum __ro_after_init;
 int proc_setup_self(struct super_block *s)
 {
 	struct inode *root_inode = d_inode(s->s_root);
-	struct pid_namespace *ns = proc_pid_ns(root_inode);
+	struct proc_fs_info *fs_info = proc_sb_info(s);
 	struct dentry *self;
 	int ret = -ENOMEM;
-	
+
 	inode_lock(root_inode);
 	self = d_alloc_name(s->s_root, "self");
 	if (self) {
@@ -62,7 +62,7 @@ int proc_setup_self(struct super_block *s)
 	if (ret)
 		pr_err("proc_fill_super: can't allocate /proc/self\n");
 	else
-		ns->proc_self = self;
+		fs_info->proc_self = self;
 
 	return ret;
 }
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index f61ae53533f5..2493cbbdfa6f 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -36,7 +36,7 @@ static unsigned thread_self_inum __ro_after_init;
 int proc_setup_thread_self(struct super_block *s)
 {
 	struct inode *root_inode = d_inode(s->s_root);
-	struct pid_namespace *ns = proc_pid_ns(root_inode);
+	struct proc_fs_info *fs_info = proc_sb_info(s);
 	struct dentry *thread_self;
 	int ret = -ENOMEM;
 
@@ -60,9 +60,9 @@ int proc_setup_thread_self(struct super_block *s)
 	inode_unlock(root_inode);
 
 	if (ret)
-		pr_err("proc_fill_super: can't allocate /proc/thread_self\n");
+		pr_err("proc_fill_super: can't allocate /proc/thread-self\n");
 	else
-		ns->proc_thread_self = thread_self;
+		fs_info->proc_thread_self = thread_self;
 
 	return ret;
 }
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 4956e362e55e..5a5cb45ac57e 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -17,12 +17,6 @@
 
 struct fs_pin;
 
-enum { /* definitions for pid_namespace's hide_pid field */
-	HIDEPID_OFF	  = 0,
-	HIDEPID_NO_ACCESS = 1,
-	HIDEPID_INVISIBLE = 2,
-};
-
 struct pid_namespace {
 	struct kref kref;
 	struct idr idr;
@@ -32,17 +26,11 @@ struct pid_namespace {
 	struct kmem_cache *pid_cachep;
 	unsigned int level;
 	struct pid_namespace *parent;
-#ifdef CONFIG_PROC_FS
-	struct dentry *proc_self;
-	struct dentry *proc_thread_self;
-#endif
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct fs_pin *bacct;
 #endif
 	struct user_namespace *user_ns;
 	struct ucounts *ucounts;
-	kgid_t pid_gid;
-	int hide_pid;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
 } __randomize_layout;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 45c05fd9c99d..1b98a41fdd8a 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -42,6 +42,26 @@ struct proc_ops {
 	unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 } __randomize_layout;
 
+/* definitions for hide_pid field */
+enum {
+	HIDEPID_OFF	  = 0,
+	HIDEPID_NO_ACCESS = 1,
+	HIDEPID_INVISIBLE = 2,
+};
+
+struct proc_fs_info {
+	struct pid_namespace *pid_ns;
+	struct dentry *proc_self;        /* For /proc/self */
+	struct dentry *proc_thread_self; /* For /proc/thread-self */
+	kgid_t pid_gid;
+	int hide_pid;
+};
+
+static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
 #ifdef CONFIG_PROC_FS
 
 typedef int (*proc_write_t)(struct file *, char *, size_t);
@@ -176,7 +196,7 @@ int open_related_ns(struct ns_common *ns,
 /* get the associated pid namespace for a file in procfs */
 static inline struct pid_namespace *proc_pid_ns(const struct inode *inode)
 {
-	return inode->i_sb->s_fs_info;
+	return proc_sb_info(inode->i_sb)->pid_ns;
 }
 
 #endif /* _LINUX_PROC_FS_H */
diff --git a/tools/testing/selftests/proc/.gitignore b/tools/testing/selftests/proc/.gitignore
index 4bca5a9327a4..126901c5d6f4 100644
--- a/tools/testing/selftests/proc/.gitignore
+++ b/tools/testing/selftests/proc/.gitignore
@@ -3,6 +3,7 @@
 /fd-002-posix-eq
 /fd-003-kthread
 /proc-loadavg-001
+/proc-multiple-procfs
 /proc-pid-vm
 /proc-self-map-files-001
 /proc-self-map-files-002
diff --git a/tools/testing/selftests/proc/Makefile b/tools/testing/selftests/proc/Makefile
index a8ed0f684829..bf22457256be 100644
--- a/tools/testing/selftests/proc/Makefile
+++ b/tools/testing/selftests/proc/Makefile
@@ -19,5 +19,6 @@ TEST_GEN_PROGS += self
 TEST_GEN_PROGS += setns-dcache
 TEST_GEN_PROGS += setns-sysvipc
 TEST_GEN_PROGS += thread-self
+TEST_GEN_PROGS += proc-multiple-procfs
 
 include ../lib.mk
diff --git a/tools/testing/selftests/proc/proc-multiple-procfs.c b/tools/testing/selftests/proc/proc-multiple-procfs.c
new file mode 100644
index 000000000000..ab912ad95dab
--- /dev/null
+++ b/tools/testing/selftests/proc/proc-multiple-procfs.c
@@ -0,0 +1,48 @@
+/*
+ * Copyright © 2020 Alexey Gladkov <gladkov.alexey@gmail.com>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+#include <assert.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/mount.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+int main(void)
+{
+	struct stat proc_st1, proc_st2;
+	char procbuff[] = "/tmp/proc.XXXXXX/meminfo";
+	char procdir1[] = "/tmp/proc.XXXXXX";
+	char procdir2[] = "/tmp/proc.XXXXXX";
+
+	assert(mkdtemp(procdir1) != NULL);
+	assert(mkdtemp(procdir2) != NULL);
+
+	assert(!mount("proc", procdir1, "proc", 0, "hidepid=1"));
+	assert(!mount("proc", procdir2, "proc", 0, "hidepid=2"));
+
+	snprintf(procbuff, sizeof(procbuff), "%s/meminfo", procdir1);
+	assert(!stat(procbuff, &proc_st1));
+
+	snprintf(procbuff, sizeof(procbuff), "%s/meminfo", procdir2);
+	assert(!stat(procbuff, &proc_st2));
+
+	umount(procdir1);
+	umount(procdir2);
+
+	assert(proc_st1.st_dev != proc_st2.st_dev);
+
+	return 0;
+}
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v12 3/7] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
  2020-04-19 14:10 [PATCH v12 0/7] proc: modernize proc to support multiple private instances Alexey Gladkov
  2020-04-19 14:10 ` [PATCH v12 1/7] proc: rename struct proc_fs_info to proc_fs_opts Alexey Gladkov
  2020-04-19 14:10 ` [PATCH v12 2/7] proc: allow to mount many instances of proc in one pid namespace Alexey Gladkov
@ 2020-04-19 14:10 ` Alexey Gladkov
  2020-04-19 14:10 ` [PATCH v12 4/7] proc: add option to mount only a pids subset Alexey Gladkov
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 54+ messages in thread
From: Alexey Gladkov @ 2020-04-19 14:10 UTC (permalink / raw)
  To: LKML
  Cc: Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Alexey Gladkov, Andrew Morton, Andy Lutomirski,
	Daniel Micay, Djalal Harouni, Dmitry V . Levin,
	Eric W . Biederman, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Linus Torvalds, Oleg Nesterov, David Howells

If "hidepid=4" mount option is set then do not instantiate pids that
we can not ptrace. "hidepid=4" means that procfs should only contain
pids that the caller can ptrace.

Signed-off-by: Djalal Harouni <tixxdz@gmail.com>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 fs/proc/base.c          | 15 +++++++++++++++
 fs/proc/root.c          | 13 ++++++++++---
 include/linux/proc_fs.h |  1 +
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 93b5d05c142c..a52a91e90c25 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -701,6 +701,14 @@ static bool has_pid_permissions(struct proc_fs_info *fs_info,
 				 struct task_struct *task,
 				 int hide_pid_min)
 {
+	/*
+	 * If 'hidpid' mount option is set force a ptrace check,
+	 * we indicate that we are using a filesystem syscall
+	 * by passing PTRACE_MODE_READ_FSCREDS
+	 */
+	if (fs_info->hide_pid == HIDEPID_NOT_PTRACEABLE)
+		return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
+
 	if (fs_info->hide_pid < hide_pid_min)
 		return true;
 	if (in_group_p(fs_info->pid_gid))
@@ -3319,7 +3327,14 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
 	if (!task)
 		goto out;
 
+	/* Limit procfs to only ptraceable tasks */
+	if (fs_info->hide_pid == HIDEPID_NOT_PTRACEABLE) {
+		if (!has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS))
+			goto out_put_task;
+	}
+
 	result = proc_pid_instantiate(dentry, task, NULL);
+out_put_task:
 	put_task_struct(task);
 out:
 	return result;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 208989274923..8f23b951d685 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -47,6 +47,14 @@ static const struct fs_parameter_spec proc_fs_parameters[] = {
 	{}
 };
 
+static inline int valid_hidepid(unsigned int value)
+{
+	return (value == HIDEPID_OFF ||
+		value == HIDEPID_NO_ACCESS ||
+		value == HIDEPID_INVISIBLE ||
+		value == HIDEPID_NOT_PTRACEABLE);
+}
+
 static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
@@ -63,10 +71,9 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		break;
 
 	case Opt_hidepid:
+		if (!valid_hidepid(result.uint_32))
+			return invalf(fc, "proc: unknown value of hidepid.\n");
 		ctx->hidepid = result.uint_32;
-		if (ctx->hidepid < HIDEPID_OFF ||
-		    ctx->hidepid > HIDEPID_INVISIBLE)
-			return invalfc(fc, "hidepid value must be between 0 and 2.\n");
 		break;
 
 	default:
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 1b98a41fdd8a..5bdc117ae947 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -47,6 +47,7 @@ enum {
 	HIDEPID_OFF	  = 0,
 	HIDEPID_NO_ACCESS = 1,
 	HIDEPID_INVISIBLE = 2,
+	HIDEPID_NOT_PTRACEABLE = 4, /* Limit pids to only ptraceable pids */
 };
 
 struct proc_fs_info {
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v12 4/7] proc: add option to mount only a pids subset
  2020-04-19 14:10 [PATCH v12 0/7] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (2 preceding siblings ...)
  2020-04-19 14:10 ` [PATCH v12 3/7] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option Alexey Gladkov
@ 2020-04-19 14:10 ` Alexey Gladkov
  2020-04-19 14:10 ` [PATCH v12 5/7] docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior Alexey Gladkov
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 54+ messages in thread
From: Alexey Gladkov @ 2020-04-19 14:10 UTC (permalink / raw)
  To: LKML
  Cc: Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Alexey Gladkov, Andrew Morton, Andy Lutomirski,
	Daniel Micay, Djalal Harouni, Dmitry V . Levin,
	Eric W . Biederman, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Linus Torvalds, Oleg Nesterov, David Howells

This allows to hide all files and directories in the procfs that are not
related to tasks.

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 fs/proc/generic.c       |  9 +++++++++
 fs/proc/inode.c         |  6 ++++++
 fs/proc/root.c          | 33 +++++++++++++++++++++++++++++++++
 include/linux/proc_fs.h |  7 +++++++
 4 files changed, 55 insertions(+)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 4ed6dabdf6ff..2f9fa179194d 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -269,6 +269,11 @@ struct dentry *proc_lookup_de(struct inode *dir, struct dentry *dentry,
 struct dentry *proc_lookup(struct inode *dir, struct dentry *dentry,
 		unsigned int flags)
 {
+	struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
+
+	if (fs_info->pidonly == PROC_PIDONLY_ON)
+		return ERR_PTR(-ENOENT);
+
 	return proc_lookup_de(dir, dentry, PDE(dir));
 }
 
@@ -325,6 +330,10 @@ int proc_readdir_de(struct file *file, struct dir_context *ctx,
 int proc_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct inode *inode = file_inode(file);
+	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
+
+	if (fs_info->pidonly == PROC_PIDONLY_ON)
+		return 1;
 
 	return proc_readdir_de(file, ctx, PDE(inode));
 }
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 9c756531282a..0d5e68fa842f 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -173,6 +173,8 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
 		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, fs_info->pid_gid));
 	if (fs_info->hide_pid != HIDEPID_OFF)
 		seq_printf(seq, ",hidepid=%u", fs_info->hide_pid);
+	if (fs_info->pidonly != PROC_PIDONLY_OFF)
+		seq_printf(seq, ",subset=pid");
 
 	return 0;
 }
@@ -463,6 +465,7 @@ proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr,
 
 static int proc_reg_open(struct inode *inode, struct file *file)
 {
+	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
 	struct proc_dir_entry *pde = PDE(inode);
 	int rv = 0;
 	typeof_member(struct proc_ops, proc_open) open;
@@ -476,6 +479,9 @@ static int proc_reg_open(struct inode *inode, struct file *file)
 		return rv;
 	}
 
+	if (fs_info->pidonly == PROC_PIDONLY_ON)
+		return -ENOENT;
+
 	/*
 	 * Ensure that
 	 * 1) PDE's ->release hook will be called no matter what
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 8f23b951d685..baff006a918f 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -34,16 +34,19 @@ struct proc_fs_context {
 	unsigned int		mask;
 	int			hidepid;
 	int			gid;
+	int			pidonly;
 };
 
 enum proc_param {
 	Opt_gid,
 	Opt_hidepid,
+	Opt_subset,
 };
 
 static const struct fs_parameter_spec proc_fs_parameters[] = {
 	fsparam_u32("gid",	Opt_gid),
 	fsparam_u32("hidepid",	Opt_hidepid),
+	fsparam_string("subset",	Opt_subset),
 	{}
 };
 
@@ -55,6 +58,29 @@ static inline int valid_hidepid(unsigned int value)
 		value == HIDEPID_NOT_PTRACEABLE);
 }
 
+static int proc_parse_subset_param(struct fs_context *fc, char *value)
+{
+	struct proc_fs_context *ctx = fc->fs_private;
+
+	while (value) {
+		char *ptr = strchr(value, ',');
+
+		if (ptr != NULL)
+			*ptr++ = '\0';
+
+		if (*value != '\0') {
+			if (!strcmp(value, "pid")) {
+				ctx->pidonly = PROC_PIDONLY_ON;
+			} else {
+				return invalf(fc, "proc: unsupported subset option - %s\n", value);
+			}
+		}
+		value = ptr;
+	}
+
+	return 0;
+}
+
 static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
@@ -76,6 +102,11 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		ctx->hidepid = result.uint_32;
 		break;
 
+	case Opt_subset:
+		if (proc_parse_subset_param(fc, param->string) < 0)
+			return -EINVAL;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -94,6 +125,8 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
 		fs_info->hide_pid = ctx->hidepid;
+	if (ctx->mask & (1 << Opt_subset))
+		fs_info->pidonly = ctx->pidonly;
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 5bdc117ae947..8bc31ba5cd9c 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -50,12 +50,19 @@ enum {
 	HIDEPID_NOT_PTRACEABLE = 4, /* Limit pids to only ptraceable pids */
 };
 
+/* definitions for proc mount option pidonly */
+enum {
+	PROC_PIDONLY_OFF = 0,
+	PROC_PIDONLY_ON  = 1,
+};
+
 struct proc_fs_info {
 	struct pid_namespace *pid_ns;
 	struct dentry *proc_self;        /* For /proc/self */
 	struct dentry *proc_thread_self; /* For /proc/thread-self */
 	kgid_t pid_gid;
 	int hide_pid;
+	int pidonly;
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v12 5/7] docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
  2020-04-19 14:10 [PATCH v12 0/7] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (3 preceding siblings ...)
  2020-04-19 14:10 ` [PATCH v12 4/7] proc: add option to mount only a pids subset Alexey Gladkov
@ 2020-04-19 14:10 ` Alexey Gladkov
  2020-04-19 14:10 ` [PATCH v12 6/7] proc: use human-readable values for hidepid Alexey Gladkov
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 54+ messages in thread
From: Alexey Gladkov @ 2020-04-19 14:10 UTC (permalink / raw)
  To: LKML
  Cc: Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Alexey Gladkov, Andrew Morton, Andy Lutomirski,
	Daniel Micay, Djalal Harouni, Dmitry V . Levin,
	Eric W . Biederman, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Linus Torvalds, Oleg Nesterov, David Howells

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 Documentation/filesystems/proc.rst | 54 ++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 38b606991065..360486c7a992 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -51,6 +51,8 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
   4	Configuring procfs
   4.1	Mount options
 
+  5	Filesystem behavior
+
 Preface
 =======
 
@@ -2142,6 +2144,7 @@ The following mount options are supported:
 	=========	========================================================
 	hidepid=	Set /proc/<pid>/ access mode.
 	gid=		Set the group authorized to learn processes information.
+	subset=		Show only the specified subset of procfs.
 	=========	========================================================
 
 hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
@@ -2164,6 +2167,57 @@ information about running processes, whether some daemon runs with elevated
 privileges, whether other user runs some sensitive program, whether other users
 run any program at all, etc.
 
+hidepid=4 means that procfs should only contain /proc/<pid>/ directories
+that the caller can ptrace.
+
 gid= defines a group authorized to learn processes information otherwise
 prohibited by hidepid=.  If you use some daemon like identd which needs to learn
 information about processes information, just add identd to this group.
+
+subset=pid hides all top level files and directories in the procfs that
+are not related to tasks.
+
+5	Filesystem behavior
+----------------------------
+
+Originally, before the advent of pid namepsace, procfs was a global file
+system. It means that there was only one procfs instance in the system.
+
+When pid namespace was added, a separate procfs instance was mounted in
+each pid namespace. So, procfs mount options are global among all
+mountpoints within the same namespace.
+
+::
+
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=2 0 0
+
+# strace -e mount mount -o hidepid=1 -t proc proc /tmp/proc
+mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
++++ exited with 0 +++
+
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=2 0 0
+proc /tmp/proc proc rw,relatime,hidepid=2 0 0
+
+and only after remounting procfs mount options will change at all
+mountpoints.
+
+# mount -o remount,hidepid=1 -t proc proc /tmp/proc
+
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=1 0 0
+proc /tmp/proc proc rw,relatime,hidepid=1 0 0
+
+This behavior is different from the behavior of other filesystems.
+
+The new procfs behavior is more like other filesystems. Each procfs mount
+creates a new procfs instance. Mount options affect own procfs instance.
+It means that it became possible to have several procfs instances
+displaying tasks with different filtering options in one pid namespace.
+
+# mount -o hidepid=2 -t proc proc /proc
+# mount -o hidepid=1 -t proc proc /tmp/proc
+# grep ^proc /proc/mounts
+proc /proc proc rw,relatime,hidepid=2 0 0
+proc /tmp/proc proc rw,relatime,hidepid=1 0 0
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v12 6/7] proc: use human-readable values for hidepid
  2020-04-19 14:10 [PATCH v12 0/7] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (4 preceding siblings ...)
  2020-04-19 14:10 ` [PATCH v12 5/7] docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior Alexey Gladkov
@ 2020-04-19 14:10 ` Alexey Gladkov
  2020-04-19 14:10 ` [PATCH v12 7/7] proc: use named enums for better readability Alexey Gladkov
       [not found] ` <87ftcv1nqe.fsf@x220.int.ebiederm.org>
  7 siblings, 0 replies; 54+ messages in thread
From: Alexey Gladkov @ 2020-04-19 14:10 UTC (permalink / raw)
  To: LKML
  Cc: Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Alexey Gladkov, Andrew Morton, Andy Lutomirski,
	Daniel Micay, Djalal Harouni, Dmitry V . Levin,
	Eric W . Biederman, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Linus Torvalds, Oleg Nesterov, David Howells

The hidepid parameter values are becoming more and more and it becomes
difficult to remember what each new magic number means.

Backward compatibility is preserved since it is possible to specify
numerical value for the hidepid parameter. This does not break the
fsconfig since it is not possible to specify a numerical value through
it. All numeric values are converted to a string. The type
FSCONFIG_SET_BINARY cannot be used to indicate a numerical value.

Selftest has been added to verify this behavior.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 Documentation/filesystems/proc.rst            | 52 +++++++++----------
 fs/proc/inode.c                               | 15 +++++-
 fs/proc/root.c                                | 38 ++++++++++++--
 tools/testing/selftests/proc/.gitignore       |  1 +
 tools/testing/selftests/proc/Makefile         |  1 +
 .../selftests/proc/proc-fsconfig-hidepid.c    | 50 ++++++++++++++++++
 6 files changed, 126 insertions(+), 31 deletions(-)
 create mode 100644 tools/testing/selftests/proc/proc-fsconfig-hidepid.c

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 360486c7a992..e2ecf248feb5 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2147,28 +2147,28 @@ The following mount options are supported:
 	subset=		Show only the specified subset of procfs.
 	=========	========================================================
 
-hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories
-(default).
-
-hidepid=1 means users may not access any /proc/<pid>/ directories but their
-own.  Sensitive files like cmdline, sched*, status are now protected against
-other users.  This makes it impossible to learn whether any user runs
-specific program (given the program doesn't reveal itself by its behaviour).
-As an additional bonus, as /proc/<pid>/cmdline is unaccessible for other users,
-poorly written programs passing sensitive information via program arguments are
-now protected against local eavesdroppers.
-
-hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be fully invisible to other
-users.  It doesn't mean that it hides a fact whether a process with a specific
-pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"),
-but it hides process' uid and gid, which may be learned by stat()'ing
-/proc/<pid>/ otherwise.  It greatly complicates an intruder's task of gathering
-information about running processes, whether some daemon runs with elevated
-privileges, whether other user runs some sensitive program, whether other users
-run any program at all, etc.
-
-hidepid=4 means that procfs should only contain /proc/<pid>/ directories
-that the caller can ptrace.
+hidepid=off or hidepid=0 means classic mode - everybody may access all
+/proc/<pid>/ directories (default).
+
+hidepid=noaccess or hidepid=1 means users may not access any /proc/<pid>/
+directories but their own.  Sensitive files like cmdline, sched*, status are now
+protected against other users.  This makes it impossible to learn whether any
+user runs specific program (given the program doesn't reveal itself by its
+behaviour).  As an additional bonus, as /proc/<pid>/cmdline is unaccessible for
+other users, poorly written programs passing sensitive information via program
+arguments are now protected against local eavesdroppers.
+
+hidepid=invisible or hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be
+fully invisible to other users.  It doesn't mean that it hides a fact whether a
+process with a specific pid value exists (it can be learned by other means, e.g.
+by "kill -0 $PID"), but it hides process' uid and gid, which may be learned by
+stat()'ing /proc/<pid>/ otherwise.  It greatly complicates an intruder's task of
+gathering information about running processes, whether some daemon runs with
+elevated privileges, whether other user runs some sensitive program, whether
+other users run any program at all, etc.
+
+hidepid=ptraceable or hidepid=4 means that procfs should only contain
+/proc/<pid>/ directories that the caller can ptrace.
 
 gid= defines a group authorized to learn processes information otherwise
 prohibited by hidepid=.  If you use some daemon like identd which needs to learn
@@ -2216,8 +2216,8 @@ creates a new procfs instance. Mount options affect own procfs instance.
 It means that it became possible to have several procfs instances
 displaying tasks with different filtering options in one pid namespace.
 
-# mount -o hidepid=2 -t proc proc /proc
-# mount -o hidepid=1 -t proc proc /tmp/proc
+# mount -o hidepid=invisible -t proc proc /proc
+# mount -o hidepid=noaccess -t proc proc /tmp/proc
 # grep ^proc /proc/mounts
-proc /proc proc rw,relatime,hidepid=2 0 0
-proc /tmp/proc proc rw,relatime,hidepid=1 0 0
+proc /proc proc rw,relatime,hidepid=invisible 0 0
+proc /tmp/proc proc rw,relatime,hidepid=noaccess 0 0
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 0d5e68fa842f..cbacac2e892b 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -24,6 +24,7 @@
 #include <linux/seq_file.h>
 #include <linux/slab.h>
 #include <linux/mount.h>
+#include <linux/bug.h>
 
 #include <linux/uaccess.h>
 
@@ -165,6 +166,18 @@ void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock
 		deactivate_super(old_sb);
 }
 
+static inline const char *hidepid2str(int v)
+{
+	switch (v) {
+		case HIDEPID_OFF: return "off";
+		case HIDEPID_NO_ACCESS: return "noaccess";
+		case HIDEPID_INVISIBLE: return "invisible";
+		case HIDEPID_NOT_PTRACEABLE: return "ptraceable";
+	}
+	WARN_ONCE(1, "bad hide_pid value: %d\n", v);
+	return "unknown";
+}
+
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct proc_fs_info *fs_info = proc_sb_info(root->d_sb);
@@ -172,7 +185,7 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root)
 	if (!gid_eq(fs_info->pid_gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, fs_info->pid_gid));
 	if (fs_info->hide_pid != HIDEPID_OFF)
-		seq_printf(seq, ",hidepid=%u", fs_info->hide_pid);
+		seq_printf(seq, ",hidepid=%s", hidepid2str(fs_info->hide_pid));
 	if (fs_info->pidonly != PROC_PIDONLY_OFF)
 		seq_printf(seq, ",subset=pid");
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index baff006a918f..288093261b7f 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -45,7 +45,7 @@ enum proc_param {
 
 static const struct fs_parameter_spec proc_fs_parameters[] = {
 	fsparam_u32("gid",	Opt_gid),
-	fsparam_u32("hidepid",	Opt_hidepid),
+	fsparam_string("hidepid",	Opt_hidepid),
 	fsparam_string("subset",	Opt_subset),
 	{}
 };
@@ -58,6 +58,37 @@ static inline int valid_hidepid(unsigned int value)
 		value == HIDEPID_NOT_PTRACEABLE);
 }
 
+static int proc_parse_hidepid_param(struct fs_context *fc, struct fs_parameter *param)
+{
+	struct proc_fs_context *ctx = fc->fs_private;
+	struct fs_parameter_spec hidepid_u32_spec = fsparam_u32("hidepid", Opt_hidepid);
+	struct fs_parse_result result;
+	int base = (unsigned long)hidepid_u32_spec.data;
+
+	if (param->type != fs_value_is_string)
+		return invalf(fc, "proc: unexpected type of hidepid value\n");
+
+	if (!kstrtouint(param->string, base, &result.uint_32)) {
+		if (!valid_hidepid(result.uint_32))
+			return invalf(fc, "proc: unknown value of hidepid - %s\n", param->string);
+		ctx->hidepid = result.uint_32;
+		return 0;
+	}
+
+	if (!strcmp(param->string, "off"))
+		ctx->hidepid = HIDEPID_OFF;
+	else if (!strcmp(param->string, "noaccess"))
+		ctx->hidepid = HIDEPID_NO_ACCESS;
+	else if (!strcmp(param->string, "invisible"))
+		ctx->hidepid = HIDEPID_INVISIBLE;
+	else if (!strcmp(param->string, "ptraceable"))
+		ctx->hidepid = HIDEPID_NOT_PTRACEABLE;
+	else
+		return invalf(fc, "proc: unknown value of hidepid - %s\n", param->string);
+
+	return 0;
+}
+
 static int proc_parse_subset_param(struct fs_context *fc, char *value)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
@@ -97,9 +128,8 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		break;
 
 	case Opt_hidepid:
-		if (!valid_hidepid(result.uint_32))
-			return invalf(fc, "proc: unknown value of hidepid.\n");
-		ctx->hidepid = result.uint_32;
+		if (proc_parse_hidepid_param(fc, param))
+			return -EINVAL;
 		break;
 
 	case Opt_subset:
diff --git a/tools/testing/selftests/proc/.gitignore b/tools/testing/selftests/proc/.gitignore
index 126901c5d6f4..bed4b5318a86 100644
--- a/tools/testing/selftests/proc/.gitignore
+++ b/tools/testing/selftests/proc/.gitignore
@@ -2,6 +2,7 @@
 /fd-001-lookup
 /fd-002-posix-eq
 /fd-003-kthread
+/proc-fsconfig-hidepid
 /proc-loadavg-001
 /proc-multiple-procfs
 /proc-pid-vm
diff --git a/tools/testing/selftests/proc/Makefile b/tools/testing/selftests/proc/Makefile
index bf22457256be..8be8a03d2973 100644
--- a/tools/testing/selftests/proc/Makefile
+++ b/tools/testing/selftests/proc/Makefile
@@ -20,5 +20,6 @@ TEST_GEN_PROGS += setns-dcache
 TEST_GEN_PROGS += setns-sysvipc
 TEST_GEN_PROGS += thread-self
 TEST_GEN_PROGS += proc-multiple-procfs
+TEST_GEN_PROGS += proc-fsconfig-hidepid
 
 include ../lib.mk
diff --git a/tools/testing/selftests/proc/proc-fsconfig-hidepid.c b/tools/testing/selftests/proc/proc-fsconfig-hidepid.c
new file mode 100644
index 000000000000..b9af8f537185
--- /dev/null
+++ b/tools/testing/selftests/proc/proc-fsconfig-hidepid.c
@@ -0,0 +1,50 @@
+/*
+ * Copyright © 2020 Alexey Gladkov <gladkov.alexey@gmail.com>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+#include <assert.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <linux/mount.h>
+#include <linux/unistd.h>
+
+static inline int fsopen(const char *fsname, unsigned int flags)
+{
+	return syscall(__NR_fsopen, fsname, flags);
+}
+
+static inline int fsconfig(int fd, unsigned int cmd, const char *key, const void *val, int aux)
+{
+	return syscall(__NR_fsconfig, fd, cmd, key, val, aux);
+}
+
+int main(void)
+{
+	int fsfd, ret;
+	int hidepid = 2;
+
+	assert((fsfd = fsopen("proc", 0)) != -1);
+
+	ret = fsconfig(fsfd, FSCONFIG_SET_BINARY, "hidepid", &hidepid, 0);
+	assert(ret == -1);
+	assert(errno == EINVAL);
+
+	assert(!fsconfig(fsfd, FSCONFIG_SET_STRING, "hidepid", "2", 0));
+	assert(!fsconfig(fsfd, FSCONFIG_SET_STRING, "hidepid", "invisible", 0));
+
+	assert(!close(fsfd));
+
+	return 0;
+}
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v12 7/7] proc: use named enums for better readability
  2020-04-19 14:10 [PATCH v12 0/7] proc: modernize proc to support multiple private instances Alexey Gladkov
                   ` (5 preceding siblings ...)
  2020-04-19 14:10 ` [PATCH v12 6/7] proc: use human-readable values for hidepid Alexey Gladkov
@ 2020-04-19 14:10 ` Alexey Gladkov
       [not found] ` <87ftcv1nqe.fsf@x220.int.ebiederm.org>
  7 siblings, 0 replies; 54+ messages in thread
From: Alexey Gladkov @ 2020-04-19 14:10 UTC (permalink / raw)
  To: LKML
  Cc: Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Alexey Gladkov, Andrew Morton, Andy Lutomirski,
	Daniel Micay, Djalal Harouni, Dmitry V . Levin,
	Eric W . Biederman, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Linus Torvalds, Oleg Nesterov, David Howells

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 fs/proc/base.c          | 2 +-
 fs/proc/inode.c         | 2 +-
 fs/proc/root.c          | 4 ++--
 include/linux/proc_fs.h | 8 ++++----
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index a52a91e90c25..2868bff1a142 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -699,7 +699,7 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
  */
 static bool has_pid_permissions(struct proc_fs_info *fs_info,
 				 struct task_struct *task,
-				 int hide_pid_min)
+				 enum proc_hidepid hide_pid_min)
 {
 	/*
 	 * If 'hidpid' mount option is set force a ptrace check,
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index cbacac2e892b..f40c2532c057 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -166,7 +166,7 @@ void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock
 		deactivate_super(old_sb);
 }
 
-static inline const char *hidepid2str(int v)
+static inline const char *hidepid2str(enum proc_hidepid v)
 {
 	switch (v) {
 		case HIDEPID_OFF: return "off";
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 288093261b7f..ffebed1999e5 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -32,9 +32,9 @@
 struct proc_fs_context {
 	struct pid_namespace	*pid_ns;
 	unsigned int		mask;
-	int			hidepid;
+	enum proc_hidepid	hidepid;
 	int			gid;
-	int			pidonly;
+	enum proc_pidonly	pidonly;
 };
 
 enum proc_param {
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 8bc31ba5cd9c..2cb424e6f36a 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -43,7 +43,7 @@ struct proc_ops {
 } __randomize_layout;
 
 /* definitions for hide_pid field */
-enum {
+enum proc_hidepid {
 	HIDEPID_OFF	  = 0,
 	HIDEPID_NO_ACCESS = 1,
 	HIDEPID_INVISIBLE = 2,
@@ -51,7 +51,7 @@ enum {
 };
 
 /* definitions for proc mount option pidonly */
-enum {
+enum proc_pidonly {
 	PROC_PIDONLY_OFF = 0,
 	PROC_PIDONLY_ON  = 1,
 };
@@ -61,8 +61,8 @@ struct proc_fs_info {
 	struct dentry *proc_self;        /* For /proc/self */
 	struct dentry *proc_thread_self; /* For /proc/thread-self */
 	kgid_t pid_gid;
-	int hide_pid;
-	int pidonly;
+	enum proc_hidepid hide_pid;
+	enum proc_pidonly pidonly;
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v13 2/7] proc: allow to mount many instances of proc in one pid namespace
  2020-04-19 14:10 ` [PATCH v12 2/7] proc: allow to mount many instances of proc in one pid namespace Alexey Gladkov
@ 2020-04-23 11:28   ` Alexey Gladkov
  2020-04-23 12:16     ` Eric W. Biederman
  0 siblings, 1 reply; 54+ messages in thread
From: Alexey Gladkov @ 2020-04-23 11:28 UTC (permalink / raw)
  To: LKML
  Cc: Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Alexey Gladkov, Andrew Morton, Andy Lutomirski,
	Daniel Micay, Djalal Harouni, Dmitry V . Levin,
	Eric W . Biederman, Greg Kroah-Hartman, Ingo Molnar,
	J . Bruce Fields, Jeff Layton, Jonathan Corbet, Kees Cook,
	Linus Torvalds, Oleg Nesterov, David Howells

This patch allows to have multiple procfs instances inside the
same pid namespace. The aim here is lightweight sandboxes, and to allow
that we have to modernize procfs internals.

1) The main aim of this work is to have on embedded systems one
supervisor for apps. Right now we have some lightweight sandbox support,
however if we create pid namespacess we have to manages all the
processes inside too, where our goal is to be able to run a bunch of
apps each one inside its own mount namespace without being able to
notice each other. We only want to use mount namespaces, and we want
procfs to behave more like a real mount point.

2) Linux Security Modules have multiple ptrace paths inside some
subsystems, however inside procfs, the implementation does not guarantee
that the ptrace() check which triggers the security_ptrace_check() hook
will always run. We have the 'hidepid' mount option that can be used to
force the ptrace_may_access() check inside has_pid_permissions() to run.
The problem is that 'hidepid' is per pid namespace and not attached to
the mount point, any remount or modification of 'hidepid' will propagate
to all other procfs mounts.

This also does not allow to support Yama LSM easily in desktop and user
sessions. Yama ptrace scope which restricts ptrace and some other
syscalls to be allowed only on inferiors, can be updated to have a
per-task context, where the context will be inherited during fork(),
clone() and preserved across execve(). If we support multiple private
procfs instances, then we may force the ptrace_may_access() on
/proc/<pids>/ to always run inside that new procfs instances. This will
allow to specifiy on user sessions if we should populate procfs with
pids that the user can ptrace or not.

By using Yama ptrace scope, some restricted users will only be able to see
inferiors inside /proc, they won't even be able to see their other
processes. Some software like Chromium, Firefox's crash handler, Wine
and others are already using Yama to restrict which processes can be
ptracable. With this change this will give the possibility to restrict
/proc/<pids>/ but more importantly this will give desktop users a
generic and usuable way to specifiy which users should see all processes
and which users can not.

Side notes:
* This covers the lack of seccomp where it is not able to parse
arguments, it is easy to install a seccomp filter on direct syscalls
that operate on pids, however /proc/<pid>/ is a Linux ABI using
filesystem syscalls. With this change LSMs should be able to analyze
open/read/write/close...

In the new patch set version I removed the 'newinstance' option
as suggested by Eric W. Biederman.

Selftest has been added to verify new behavior.

Fixed getting proc_pidns in the lock_get_status() and locks_show() directly from
the superblock, which caused a crash:

=== arm64 ===
[12140.366814] LTP: starting proc01 (proc01 -m 128)
[12149.580943] ==================================================================
[12149.589521] BUG: KASAN: out-of-bounds in pid_nr_ns+0x2c/0x90 pid_nr_ns at kernel/pid.c:456
[12149.595939] Read of size 4 at addr 1bff000bfa8c0388 by task = proc01/50298
[12149.603392] Pointer tag: [1b], memory tag: [fe]

[12149.610906] CPU: 69 PID: 50298 Comm: proc01 Tainted: G L 5.7.0-rc2-next-20200422 #6
[12149.620585] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.11 06/18/2019
[12149.631074] Call trace:
[12149.634304]  dump_backtrace+0x0/0x22c
[12149.638745]  show_stack+0x28/0x34
[12149.642839]  dump_stack+0x104/0x194
[12149.647110]  print_address_description+0x70/0x3a4
[12149.652576]  __kasan_report+0x188/0x238
[12149.657169]  kasan_report+0x3c/0x58
[12149.661430]  check_memory_region+0x98/0xa0
[12149.666303]  __hwasan_load4_noabort+0x18/0x20
[12149.671431]  pid_nr_ns+0x2c/0x90
[12149.675446]  locks_translate_pid+0xf4/0x1a0
[12149.680382]  locks_show+0x68/0x110
[12149.684536]  seq_read+0x380/0x930
[12149.688604]  pde_read+0x5c/0x78
[12149.692498]  proc_reg_read+0x74/0xc0
[12149.696813]  __vfs_read+0x84/0x1d0
[12149.700939]  vfs_read+0xec/0x124
[12149.704889]  ksys_read+0xb0/0x120
[12149.708927]  __arm64_sys_read+0x54/0x88
[12149.713485]  do_el0_svc+0x128/0x1dc
[12149.717697]  el0_sync_handler+0x150/0x250
[12149.722428]  el0_sync+0x164/0x180

[12149.728672] Allocated by task 1:
[12149.732624]  __kasan_kmalloc+0x124/0x188
[12149.737269]  kasan_kmalloc+0x10/0x18
[12149.741568]  kmem_cache_alloc_trace+0x2e4/0x3d4
[12149.746820]  proc_fill_super+0x48/0x1fc
[12149.751377]  vfs_get_super+0xcc/0x170
[12149.755760]  get_tree_nodev+0x28/0x34
[12149.760143]  proc_get_tree+0x24/0x30
[12149.764439]  vfs_get_tree+0x54/0x158
[12149.768736]  do_mount+0x80c/0xaf0
[12149.772774]  __arm64_sys_mount+0xe0/0x18c
[12149.777504]  do_el0_svc+0x128/0x1dc
[12149.781715]  el0_sync_handler+0x150/0x250
[12149.786445]  el0_sync+0x164/0x180

Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 fs/locks.c                                    |  4 +-
 fs/proc/base.c                                | 31 +++++++-----
 fs/proc/inode.c                               | 11 ++---
 fs/proc/root.c                                | 49 +++++++++----------
 fs/proc/self.c                                |  6 +--
 fs/proc/thread_self.c                         |  6 +--
 include/linux/pid_namespace.h                 | 12 -----
 include/linux/proc_fs.h                       | 22 ++++++++-
 tools/testing/selftests/proc/.gitignore       |  1 +
 tools/testing/selftests/proc/Makefile         |  1 +
 .../selftests/proc/proc-multiple-procfs.c     | 48 ++++++++++++++++++
 11 files changed, 126 insertions(+), 65 deletions(-)
 create mode 100644 tools/testing/selftests/proc/proc-multiple-procfs.c

diff --git a/fs/locks.c b/fs/locks.c
index b8a31c1c4fff..399c5dbb72c4 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2823,7 +2823,7 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
 {
 	struct inode *inode = NULL;
 	unsigned int fl_pid;
-	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+	struct pid_namespace *proc_pidns = proc_pid_ns(file_inode(f->file));
 
 	fl_pid = locks_translate_pid(fl, proc_pidns);
 	/*
@@ -2901,7 +2901,7 @@ static int locks_show(struct seq_file *f, void *v)
 {
 	struct locks_iterator *iter = f->private;
 	struct file_lock *fl, *bfl;
-	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+	struct pid_namespace *proc_pidns = proc_pid_ns(file_inode(f->file));
 
 	fl = hlist_entry(v, struct file_lock, fl_link);
 
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6042b646ab27..93b5d05c142c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -697,13 +697,13 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr)
  * May current process learn task's sched/cmdline info (for hide_pid_min=1)
  * or euid/egid (for hide_pid_min=2)?
  */
-static bool has_pid_permissions(struct pid_namespace *pid,
+static bool has_pid_permissions(struct proc_fs_info *fs_info,
 				 struct task_struct *task,
 				 int hide_pid_min)
 {
-	if (pid->hide_pid < hide_pid_min)
+	if (fs_info->hide_pid < hide_pid_min)
 		return true;
-	if (in_group_p(pid->pid_gid))
+	if (in_group_p(fs_info->pid_gid))
 		return true;
 	return ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
 }
@@ -711,18 +711,18 @@ static bool has_pid_permissions(struct pid_namespace *pid,
 
 static int proc_pid_permission(struct inode *inode, int mask)
 {
-	struct pid_namespace *pid = proc_pid_ns(inode);
+	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
 	struct task_struct *task;
 	bool has_perms;
 
 	task = get_proc_task(inode);
 	if (!task)
 		return -ESRCH;
-	has_perms = has_pid_permissions(pid, task, HIDEPID_NO_ACCESS);
+	has_perms = has_pid_permissions(fs_info, task, HIDEPID_NO_ACCESS);
 	put_task_struct(task);
 
 	if (!has_perms) {
-		if (pid->hide_pid == HIDEPID_INVISIBLE) {
+		if (fs_info->hide_pid == HIDEPID_INVISIBLE) {
 			/*
 			 * Let's make getdents(), stat(), and open()
 			 * consistent with each other.  If a process
@@ -1897,7 +1897,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
 		u32 request_mask, unsigned int query_flags)
 {
 	struct inode *inode = d_inode(path->dentry);
-	struct pid_namespace *pid = proc_pid_ns(inode);
+	struct proc_fs_info *fs_info = proc_sb_info(inode->i_sb);
 	struct task_struct *task;
 
 	generic_fillattr(inode, stat);
@@ -1907,7 +1907,7 @@ int pid_getattr(const struct path *path, struct kstat *stat,
 	rcu_read_lock();
 	task = pid_task(proc_pid(inode), PIDTYPE_PID);
 	if (task) {
-		if (!has_pid_permissions(pid, task, HIDEPID_INVISIBLE)) {
+		if (!has_pid_permissions(fs_info, task, HIDEPID_INVISIBLE)) {
 			rcu_read_unlock();
 			/*
 			 * This doesn't prevent learning whether PID exists,
@@ -3301,6 +3301,7 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
 {
 	struct task_struct *task;
 	unsigned tgid;
+	struct proc_fs_info *fs_info;
 	struct pid_namespace *ns;
 	struct dentry *result = ERR_PTR(-ENOENT);
 
@@ -3308,7 +3309,8 @@ struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
 	if (tgid == ~0U)
 		goto out;
 
-	ns = dentry->d_sb->s_fs_info;
+	fs_info = proc_sb_info(dentry->d_sb);
+	ns = fs_info->pid_ns;
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tgid, ns);
 	if (task)
@@ -3372,6 +3374,7 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
 int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 {
 	struct tgid_iter iter;
+	struct proc_fs_info *fs_info = proc_sb_info(file_inode(file)->i_sb);
 	struct pid_namespace *ns = proc_pid_ns(file_inode(file));
 	loff_t pos = ctx->pos;
 
@@ -3379,13 +3382,13 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 		return 0;
 
 	if (pos == TGID_OFFSET - 2) {
-		struct inode *inode = d_inode(ns->proc_self);
+		struct inode *inode = d_inode(fs_info->proc_self);
 		if (!dir_emit(ctx, "self", 4, inode->i_ino, DT_LNK))
 			return 0;
 		ctx->pos = pos = pos + 1;
 	}
 	if (pos == TGID_OFFSET - 1) {
-		struct inode *inode = d_inode(ns->proc_thread_self);
+		struct inode *inode = d_inode(fs_info->proc_thread_self);
 		if (!dir_emit(ctx, "thread-self", 11, inode->i_ino, DT_LNK))
 			return 0;
 		ctx->pos = pos = pos + 1;
@@ -3399,7 +3402,7 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 		unsigned int len;
 
 		cond_resched();
-		if (!has_pid_permissions(ns, iter.task, HIDEPID_INVISIBLE))
+		if (!has_pid_permissions(fs_info, iter.task, HIDEPID_INVISIBLE))
 			continue;
 
 		len = snprintf(name, sizeof(name), "%u", iter.tgid);
@@ -3599,6 +3602,7 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 	struct task_struct *task;
 	struct task_struct *leader = get_proc_task(dir);
 	unsigned tid;
+	struct proc_fs_info *fs_info;
 	struct pid_namespace *ns;
 	struct dentry *result = ERR_PTR(-ENOENT);
 
@@ -3609,7 +3613,8 @@ static struct dentry *proc_task_lookup(struct inode *dir, struct dentry * dentry
 	if (tid == ~0U)
 		goto out;
 
-	ns = dentry->d_sb->s_fs_info;
+	fs_info = proc_sb_info(dentry->d_sb);
+	ns = fs_info->pid_ns;
 	rcu_read_lock();
 	task = find_task_by_pid_ns(tid, ns);
 	if (task)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index fb4cace9ea41..9c756531282a 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -167,13 +167,12 @@ void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock
 
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
-	struct super_block *sb = root->d_sb;
-	struct pid_namespace *pid = sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb_info(root->d_sb);
 
-	if (!gid_eq(pid->pid_gid, GLOBAL_ROOT_GID))
-		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, pid->pid_gid));
-	if (pid->hide_pid != HIDEPID_OFF)
-		seq_printf(seq, ",hidepid=%u", pid->hide_pid);
+	if (!gid_eq(fs_info->pid_gid, GLOBAL_ROOT_GID))
+		seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, fs_info->pid_gid));
+	if (fs_info->hide_pid != HIDEPID_OFF)
+		seq_printf(seq, ",hidepid=%u", fs_info->hide_pid);
 
 	return 0;
 }
diff --git a/fs/proc/root.c b/fs/proc/root.c
index cdbe9293ea55..208989274923 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -77,26 +77,31 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	return 0;
 }
 
-static void proc_apply_options(struct super_block *s,
+static void proc_apply_options(struct proc_fs_info *fs_info,
 			       struct fs_context *fc,
-			       struct pid_namespace *pid_ns,
 			       struct user_namespace *user_ns)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
 
 	if (ctx->mask & (1 << Opt_gid))
-		pid_ns->pid_gid = make_kgid(user_ns, ctx->gid);
+		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
-		pid_ns->hide_pid = ctx->hidepid;
+		fs_info->hide_pid = ctx->hidepid;
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 {
-	struct pid_namespace *pid_ns = get_pid_ns(s->s_fs_info);
+	struct proc_fs_context *ctx = fc->fs_private;
 	struct inode *root_inode;
+	struct proc_fs_info *fs_info;
 	int ret;
 
-	proc_apply_options(s, fc, pid_ns, current_user_ns());
+	fs_info = kzalloc(sizeof(*fs_info), GFP_KERNEL);
+	if (!fs_info)
+		return -ENOMEM;
+
+	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+	proc_apply_options(fs_info, fc, current_user_ns());
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -106,6 +111,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_magic = PROC_SUPER_MAGIC;
 	s->s_op = &proc_sops;
 	s->s_time_gran = 1;
+	s->s_fs_info = fs_info;
 
 	/*
 	 * procfs isn't actually a stacking filesystem; however, there is
@@ -113,7 +119,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	 * top of it
 	 */
 	s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
-	
+
 	/* procfs dentries and inodes don't require IO to create */
 	s->s_shrink.seeks = 0;
 
@@ -140,19 +146,17 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 static int proc_reconfigure(struct fs_context *fc)
 {
 	struct super_block *sb = fc->root->d_sb;
-	struct pid_namespace *pid = sb->s_fs_info;
+	struct proc_fs_info *fs_info = proc_sb_info(sb);
 
 	sync_filesystem(sb);
 
-	proc_apply_options(sb, fc, pid, current_user_ns());
+	proc_apply_options(fs_info, fc, current_user_ns());
 	return 0;
 }
 
 static int proc_get_tree(struct fs_context *fc)
 {
-	struct proc_fs_context *ctx = fc->fs_private;
-
-	return get_tree_keyed(fc, proc_fill_super, ctx->pid_ns);
+	return get_tree_nodev(fc, proc_fill_super);
 }
 
 static void proc_fs_context_free(struct fs_context *fc)
@@ -188,22 +192,17 @@ static int proc_init_fs_context(struct fs_context *fc)
 
 static void proc_kill_sb(struct super_block *sb)
 {
-	struct pid_namespace *ns;
+	struct proc_fs_info *fs_info = proc_sb_info(sb);
 
-	ns = (struct pid_namespace *)sb->s_fs_info;
-	if (ns->proc_self)
-		dput(ns->proc_self);
-	if (ns->proc_thread_self)
-		dput(ns->proc_thread_self);
-	kill_anon_super(sb);
+	if (fs_info->proc_self)
+		dput(fs_info->proc_self);
 
-	/* Make the pid namespace safe for the next mount of proc */
-	ns->proc_self = NULL;
-	ns->proc_thread_self = NULL;
-	ns->pid_gid = GLOBAL_ROOT_GID;
-	ns->hide_pid = 0;
+	if (fs_info->proc_thread_self)
+		dput(fs_info->proc_thread_self);
 
-	put_pid_ns(ns);
+	kill_anon_super(sb);
+	put_pid_ns(fs_info->pid_ns);
+	kfree(fs_info);
 }
 
 static struct file_system_type proc_fs_type = {
diff --git a/fs/proc/self.c b/fs/proc/self.c
index 57c0a1047250..309301ac0136 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -36,10 +36,10 @@ static unsigned self_inum __ro_after_init;
 int proc_setup_self(struct super_block *s)
 {
 	struct inode *root_inode = d_inode(s->s_root);
-	struct pid_namespace *ns = proc_pid_ns(root_inode);
+	struct proc_fs_info *fs_info = proc_sb_info(s);
 	struct dentry *self;
 	int ret = -ENOMEM;
-	
+
 	inode_lock(root_inode);
 	self = d_alloc_name(s->s_root, "self");
 	if (self) {
@@ -62,7 +62,7 @@ int proc_setup_self(struct super_block *s)
 	if (ret)
 		pr_err("proc_fill_super: can't allocate /proc/self\n");
 	else
-		ns->proc_self = self;
+		fs_info->proc_self = self;
 
 	return ret;
 }
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index f61ae53533f5..2493cbbdfa6f 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -36,7 +36,7 @@ static unsigned thread_self_inum __ro_after_init;
 int proc_setup_thread_self(struct super_block *s)
 {
 	struct inode *root_inode = d_inode(s->s_root);
-	struct pid_namespace *ns = proc_pid_ns(root_inode);
+	struct proc_fs_info *fs_info = proc_sb_info(s);
 	struct dentry *thread_self;
 	int ret = -ENOMEM;
 
@@ -60,9 +60,9 @@ int proc_setup_thread_self(struct super_block *s)
 	inode_unlock(root_inode);
 
 	if (ret)
-		pr_err("proc_fill_super: can't allocate /proc/thread_self\n");
+		pr_err("proc_fill_super: can't allocate /proc/thread-self\n");
 	else
-		ns->proc_thread_self = thread_self;
+		fs_info->proc_thread_self = thread_self;
 
 	return ret;
 }
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 4956e362e55e..5a5cb45ac57e 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -17,12 +17,6 @@
 
 struct fs_pin;
 
-enum { /* definitions for pid_namespace's hide_pid field */
-	HIDEPID_OFF	  = 0,
-	HIDEPID_NO_ACCESS = 1,
-	HIDEPID_INVISIBLE = 2,
-};
-
 struct pid_namespace {
 	struct kref kref;
 	struct idr idr;
@@ -32,17 +26,11 @@ struct pid_namespace {
 	struct kmem_cache *pid_cachep;
 	unsigned int level;
 	struct pid_namespace *parent;
-#ifdef CONFIG_PROC_FS
-	struct dentry *proc_self;
-	struct dentry *proc_thread_self;
-#endif
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct fs_pin *bacct;
 #endif
 	struct user_namespace *user_ns;
 	struct ucounts *ucounts;
-	kgid_t pid_gid;
-	int hide_pid;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
 } __randomize_layout;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 45c05fd9c99d..1b98a41fdd8a 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -42,6 +42,26 @@ struct proc_ops {
 	unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 } __randomize_layout;
 
+/* definitions for hide_pid field */
+enum {
+	HIDEPID_OFF	  = 0,
+	HIDEPID_NO_ACCESS = 1,
+	HIDEPID_INVISIBLE = 2,
+};
+
+struct proc_fs_info {
+	struct pid_namespace *pid_ns;
+	struct dentry *proc_self;        /* For /proc/self */
+	struct dentry *proc_thread_self; /* For /proc/thread-self */
+	kgid_t pid_gid;
+	int hide_pid;
+};
+
+static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
 #ifdef CONFIG_PROC_FS
 
 typedef int (*proc_write_t)(struct file *, char *, size_t);
@@ -176,7 +196,7 @@ int open_related_ns(struct ns_common *ns,
 /* get the associated pid namespace for a file in procfs */
 static inline struct pid_namespace *proc_pid_ns(const struct inode *inode)
 {
-	return inode->i_sb->s_fs_info;
+	return proc_sb_info(inode->i_sb)->pid_ns;
 }
 
 #endif /* _LINUX_PROC_FS_H */
diff --git a/tools/testing/selftests/proc/.gitignore b/tools/testing/selftests/proc/.gitignore
index 4bca5a9327a4..126901c5d6f4 100644
--- a/tools/testing/selftests/proc/.gitignore
+++ b/tools/testing/selftests/proc/.gitignore
@@ -3,6 +3,7 @@
 /fd-002-posix-eq
 /fd-003-kthread
 /proc-loadavg-001
+/proc-multiple-procfs
 /proc-pid-vm
 /proc-self-map-files-001
 /proc-self-map-files-002
diff --git a/tools/testing/selftests/proc/Makefile b/tools/testing/selftests/proc/Makefile
index a8ed0f684829..bf22457256be 100644
--- a/tools/testing/selftests/proc/Makefile
+++ b/tools/testing/selftests/proc/Makefile
@@ -19,5 +19,6 @@ TEST_GEN_PROGS += self
 TEST_GEN_PROGS += setns-dcache
 TEST_GEN_PROGS += setns-sysvipc
 TEST_GEN_PROGS += thread-self
+TEST_GEN_PROGS += proc-multiple-procfs
 
 include ../lib.mk
diff --git a/tools/testing/selftests/proc/proc-multiple-procfs.c b/tools/testing/selftests/proc/proc-multiple-procfs.c
new file mode 100644
index 000000000000..ab912ad95dab
--- /dev/null
+++ b/tools/testing/selftests/proc/proc-multiple-procfs.c
@@ -0,0 +1,48 @@
+/*
+ * Copyright © 2020 Alexey Gladkov <gladkov.alexey@gmail.com>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+#include <assert.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/mount.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+int main(void)
+{
+	struct stat proc_st1, proc_st2;
+	char procbuff[] = "/tmp/proc.XXXXXX/meminfo";
+	char procdir1[] = "/tmp/proc.XXXXXX";
+	char procdir2[] = "/tmp/proc.XXXXXX";
+
+	assert(mkdtemp(procdir1) != NULL);
+	assert(mkdtemp(procdir2) != NULL);
+
+	assert(!mount("proc", procdir1, "proc", 0, "hidepid=1"));
+	assert(!mount("proc", procdir2, "proc", 0, "hidepid=2"));
+
+	snprintf(procbuff, sizeof(procbuff), "%s/meminfo", procdir1);
+	assert(!stat(procbuff, &proc_st1));
+
+	snprintf(procbuff, sizeof(procbuff), "%s/meminfo", procdir2);
+	assert(!stat(procbuff, &proc_st2));
+
+	umount(procdir1);
+	umount(procdir2);
+
+	assert(proc_st1.st_dev != proc_st2.st_dev);
+
+	return 0;
+}
-- 
2.25.3


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v13 2/7] proc: allow to mount many instances of proc in one pid namespace
  2020-04-23 11:28   ` [PATCH v13 " Alexey Gladkov
@ 2020-04-23 12:16     ` Eric W. Biederman
  2020-04-23 20:01       ` Alexey Gladkov
  0 siblings, 1 reply; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-23 12:16 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Alexey Gladkov, Andrew Morton, Andy Lutomirski,
	Daniel Micay, Djalal Harouni, Dmitry V . Levin,
	Greg Kroah-Hartman, Ingo Molnar, J . Bruce Fields, Jeff Layton,
	Jonathan Corbet, Kees Cook, Linus Torvalds, Oleg Nesterov,
	David Howells


I took a quick look and there is at least one other use in security/tomoyo/realpath.c:

static char *tomoyo_get_local_path(struct dentry *dentry, char * const buffer,
				   const int buflen)
{
	struct super_block *sb = dentry->d_sb;
	char *pos = tomoyo_get_dentry_path(dentry, buffer, buflen);

	if (IS_ERR(pos))
		return pos;
	/* Convert from $PID to self if $PID is current thread. */
	if (sb->s_magic == PROC_SUPER_MAGIC && *pos == '/') {
		char *ep;
		const pid_t pid = (pid_t) simple_strtoul(pos + 1, &ep, 10);

		if (*ep == '/' && pid && pid ==
		    task_tgid_nr_ns(current, sb->s_fs_info)) {
			pos = ep - 5;
			if (pos < buffer)
				goto out;
			memmove(pos, "/self", 5);
		}
		goto prepend_filesystem_name;
	}

Can you make the fixes to locks.c and tomoyo a couple of standalone
fixes that should be inserted before your patch?

On the odd chance there is a typo they will bisect better, as well
as just keeping this patch and it's description from expanding in size.
So that things are small enough for people to really look at and review.

The fix itself looks fine.

Thank you,
Eric


Alexey Gladkov <gladkov.alexey@gmail.com> writes:

> Fixed getting proc_pidns in the lock_get_status() and locks_show() directly from
> the superblock, which caused a crash:
>
> === arm64 ===
> [12140.366814] LTP: starting proc01 (proc01 -m 128)
> [12149.580943] ==================================================================
> [12149.589521] BUG: KASAN: out-of-bounds in pid_nr_ns+0x2c/0x90 pid_nr_ns at kernel/pid.c:456
> [12149.595939] Read of size 4 at addr 1bff000bfa8c0388 by task = proc01/50298
> [12149.603392] Pointer tag: [1b], memory tag: [fe]
>
> [12149.610906] CPU: 69 PID: 50298 Comm: proc01 Tainted: G L 5.7.0-rc2-next-20200422 #6
> [12149.620585] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.11 06/18/2019
> [12149.631074] Call trace:
> [12149.634304]  dump_backtrace+0x0/0x22c
> [12149.638745]  show_stack+0x28/0x34
> [12149.642839]  dump_stack+0x104/0x194
> [12149.647110]  print_address_description+0x70/0x3a4
> [12149.652576]  __kasan_report+0x188/0x238
> [12149.657169]  kasan_report+0x3c/0x58
> [12149.661430]  check_memory_region+0x98/0xa0
> [12149.666303]  __hwasan_load4_noabort+0x18/0x20
> [12149.671431]  pid_nr_ns+0x2c/0x90
> [12149.675446]  locks_translate_pid+0xf4/0x1a0
> [12149.680382]  locks_show+0x68/0x110
> [12149.684536]  seq_read+0x380/0x930
> [12149.688604]  pde_read+0x5c/0x78
> [12149.692498]  proc_reg_read+0x74/0xc0
> [12149.696813]  __vfs_read+0x84/0x1d0
> [12149.700939]  vfs_read+0xec/0x124
> [12149.704889]  ksys_read+0xb0/0x120
> [12149.708927]  __arm64_sys_read+0x54/0x88
> [12149.713485]  do_el0_svc+0x128/0x1dc
> [12149.717697]  el0_sync_handler+0x150/0x250
> [12149.722428]  el0_sync+0x164/0x180
>
> [12149.728672] Allocated by task 1:
> [12149.732624]  __kasan_kmalloc+0x124/0x188
> [12149.737269]  kasan_kmalloc+0x10/0x18
> [12149.741568]  kmem_cache_alloc_trace+0x2e4/0x3d4
> [12149.746820]  proc_fill_super+0x48/0x1fc
> [12149.751377]  vfs_get_super+0xcc/0x170
> [12149.755760]  get_tree_nodev+0x28/0x34
> [12149.760143]  proc_get_tree+0x24/0x30
> [12149.764439]  vfs_get_tree+0x54/0x158
> [12149.768736]  do_mount+0x80c/0xaf0
> [12149.772774]  __arm64_sys_mount+0xe0/0x18c
> [12149.777504]  do_el0_svc+0x128/0x1dc
> [12149.781715]  el0_sync_handler+0x150/0x250
> [12149.786445]  el0_sync+0x164/0x180

> diff --git a/fs/locks.c b/fs/locks.c
> index b8a31c1c4fff..399c5dbb72c4 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2823,7 +2823,7 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
>  {
>  	struct inode *inode = NULL;
>  	unsigned int fl_pid;
> -	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
> +	struct pid_namespace *proc_pidns = proc_pid_ns(file_inode(f->file));
>  
>  	fl_pid = locks_translate_pid(fl, proc_pidns);
>  	/*
> @@ -2901,7 +2901,7 @@ static int locks_show(struct seq_file *f, void *v)
>  {
>  	struct locks_iterator *iter = f->private;
>  	struct file_lock *fl, *bfl;
> -	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
> +	struct pid_namespace *proc_pidns = proc_pid_ns(file_inode(f->file));
>  
>  	fl = hlist_entry(v, struct file_lock, fl_link);
>  

Eric

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 0/2] proc: Calling proc_flush_task exactly once per task
       [not found] ` <87ftcv1nqe.fsf@x220.int.ebiederm.org>
@ 2020-04-23 17:54   ` Oleg Nesterov
  2020-04-23 19:38     ` Eric W. Biederman
  2020-04-23 19:39   ` [PATCH v2 1/2] proc: Use PIDTYPE_TGID in next_tgid Eric W. Biederman
  2020-04-23 19:39   ` [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
  2 siblings, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2020-04-23 17:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds

On 04/22, Eric W. Biederman wrote:
>
> Eric W. Biederman (2):
>       proc: Use PIDTYPE_TGID in next_tgid
>       proc: Ensure we see the exit of each process tid exactly once
>
>  fs/exec.c           |  5 +----
>  fs/proc/base.c      | 16 ++--------------
>  include/linux/pid.h |  1 +
>  kernel/pid.c        | 16 ++++++++++++++++
>  4 files changed, 20 insertions(+), 18 deletions(-)
>
> ---
> Oleg if these look good I will add these onto my branch of proc changes
> that includes Alexey's changes.

Eric, sorry, where can I find these 2 patches?

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 0/2] proc: Calling proc_flush_task exactly once per task
  2020-04-23 17:54   ` [PATCH v2 0/2] proc: Calling proc_flush_task exactly once per task Oleg Nesterov
@ 2020-04-23 19:38     ` Eric W. Biederman
  0 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-23 19:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/22, Eric W. Biederman wrote:
>>
>> Eric W. Biederman (2):
>>       proc: Use PIDTYPE_TGID in next_tgid
>>       proc: Ensure we see the exit of each process tid exactly once
>>
>>  fs/exec.c           |  5 +----
>>  fs/proc/base.c      | 16 ++--------------
>>  include/linux/pid.h |  1 +
>>  kernel/pid.c        | 16 ++++++++++++++++
>>  4 files changed, 20 insertions(+), 18 deletions(-)
>>
>> ---
>> Oleg if these look good I will add these onto my branch of proc changes
>> that includes Alexey's changes.
>
> Eric, sorry, where can I find these 2 patches?

Did I not post them?  Apologies.  I will post them now.

Eric

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v2 1/2] proc: Use PIDTYPE_TGID in next_tgid
       [not found] ` <87ftcv1nqe.fsf@x220.int.ebiederm.org>
  2020-04-23 17:54   ` [PATCH v2 0/2] proc: Calling proc_flush_task exactly once per task Oleg Nesterov
@ 2020-04-23 19:39   ` Eric W. Biederman
  2020-04-24 17:29     ` Oleg Nesterov
  2020-04-23 19:39   ` [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
  2 siblings, 1 reply; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-23 19:39 UTC (permalink / raw)
  To: LKML
  Cc: Linux FS Devel, Alexey Dobriyan, Alexey Gladkov, Andrew Morton,
	Oleg Nesterov, Alexey Gladkov, Linus Torvalds


Combine the pid_task and thes test has_group_leader_pid into a single
dereference by using pid_task(PIDTYPE_TGID).

This makes the code simpler and proof against needing to even think
about any shenanigans that de_thread might get up to.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/base.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2868bff1a142..a48b4d4056a9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3360,20 +3360,8 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
 	pid = find_ge_pid(iter.tgid, ns);
 	if (pid) {
 		iter.tgid = pid_nr_ns(pid, ns);
-		iter.task = pid_task(pid, PIDTYPE_PID);
-		/* What we to know is if the pid we have find is the
-		 * pid of a thread_group_leader.  Testing for task
-		 * being a thread_group_leader is the obvious thing
-		 * todo but there is a window when it fails, due to
-		 * the pid transfer logic in de_thread.
-		 *
-		 * So we perform the straight forward test of seeing
-		 * if the pid we have found is the pid of a thread
-		 * group leader, and don't worry if the task we have
-		 * found doesn't happen to be a thread group leader.
-		 * As we don't care in the case of readdir.
-		 */
-		if (!iter.task || !has_group_leader_pid(iter.task)) {
+		iter.task = pid_task(pid, PIDTYPE_TGID);
+		if (!iter.task) {
 			iter.tgid += 1;
 			goto retry;
 		}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly
       [not found] ` <87ftcv1nqe.fsf@x220.int.ebiederm.org>
  2020-04-23 17:54   ` [PATCH v2 0/2] proc: Calling proc_flush_task exactly once per task Oleg Nesterov
  2020-04-23 19:39   ` [PATCH v2 1/2] proc: Use PIDTYPE_TGID in next_tgid Eric W. Biederman
@ 2020-04-23 19:39   ` Eric W. Biederman
  2020-04-23 20:28     ` Linus Torvalds
  2020-04-24 17:39     ` Oleg Nesterov
  2 siblings, 2 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-23 19:39 UTC (permalink / raw)
  To: LKML
  Cc: Linux FS Devel, Alexey Dobriyan, Alexey Gladkov, Andrew Morton,
	Oleg Nesterov, Alexey Gladkov, Linus Torvalds


When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.

Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.

When I removed switch_exec_pids and introduced this behavior
in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.

This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization").  Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.

The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently.  Fix
this just so there are no little surprise corner cases waiting to bite
people.

-- Oleg points out this could be an issue in next_tgid in proc where
   has_group_leader_pid is called, and reording some of the assignments
   should fix that.

-- Oleg points out this will break the 10 year old hack in __exit_signal.c
>	/*
>	 * This can only happen if the caller is de_thread().
>	 * FIXME: this is the temporary hack, we should teach
>	 * posix-cpu-timers to handle this case correctly.
>	 */
>	if (unlikely(has_group_leader_pid(tsk)))
>		posix_cpu_timers_exit_group(tsk);

The code in next_tgid has been changed to use PIDTYPE_TGID,
and the posix cpu timers code has been fixed so it does not
need the 10 year old hack, so this should be safe to merge
now.

Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/exec.c           |  5 +----
 include/linux/pid.h |  1 +
 kernel/pid.c        | 16 ++++++++++++++++
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 06b4c550af5d..9b60f927afd7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1186,11 +1186,8 @@ static int de_thread(struct task_struct *tsk)
 
 		/* Become a process group leader with the old leader's pid.
 		 * The old leader becomes a thread of the this thread group.
-		 * Note: The old leader also uses this pid until release_task
-		 *       is called.  Odd but simple and correct.
 		 */
-		tsk->pid = leader->pid;
-		change_pid(tsk, PIDTYPE_PID, task_pid(leader));
+		exchange_tids(tsk, leader);
 		transfer_pid(leader, tsk, PIDTYPE_TGID);
 		transfer_pid(leader, tsk, PIDTYPE_PGID);
 		transfer_pid(leader, tsk, PIDTYPE_SID);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index cc896f0fc4e3..2159ffca63fc 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -102,6 +102,7 @@ extern void attach_pid(struct task_struct *task, enum pid_type);
 extern void detach_pid(struct task_struct *task, enum pid_type);
 extern void change_pid(struct task_struct *task, enum pid_type,
 			struct pid *pid);
+extern void exchange_tids(struct task_struct *task, struct task_struct *old);
 extern void transfer_pid(struct task_struct *old, struct task_struct *new,
 			 enum pid_type);
 
diff --git a/kernel/pid.c b/kernel/pid.c
index c835b844aca7..4ece32d8791a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -363,6 +363,22 @@ void change_pid(struct task_struct *task, enum pid_type type,
 	attach_pid(task, type);
 }
 
+void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
+{
+	/* pid_links[PIDTYPE_PID].next is always NULL */
+	struct pid *npid = READ_ONCE(ntask->thread_pid);
+	struct pid *opid = READ_ONCE(otask->thread_pid);
+
+	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
+	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
+	rcu_assign_pointer(ntask->thread_pid, opid);
+	rcu_assign_pointer(otask->thread_pid, npid);
+	WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
+	WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
+	WRITE_ONCE(ntask->pid, pid_nr(opid));
+	WRITE_ONCE(otask->pid, pid_nr(npid));
+}
+
 /* transfer_pid is an optimization of attach_pid(new), detach_pid(old) */
 void transfer_pid(struct task_struct *old, struct task_struct *new,
 			   enum pid_type type)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v13 2/7] proc: allow to mount many instances of proc in one pid namespace
  2020-04-23 12:16     ` Eric W. Biederman
@ 2020-04-23 20:01       ` Alexey Gladkov
  0 siblings, 0 replies; 54+ messages in thread
From: Alexey Gladkov @ 2020-04-23 20:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Kernel Hardening, Linux API, Linux FS Devel,
	Linux Security Module, Akinobu Mita, Alexander Viro,
	Alexey Dobriyan, Andrew Morton, Andy Lutomirski, Daniel Micay,
	Djalal Harouni, Dmitry V . Levin, Greg Kroah-Hartman,
	Ingo Molnar, J . Bruce Fields, Jeff Layton, Jonathan Corbet,
	Kees Cook, Linus Torvalds, Oleg Nesterov, David Howells

On Thu, Apr 23, 2020 at 07:16:07AM -0500, Eric W. Biederman wrote:
> 
> I took a quick look and there is at least one other use in security/tomoyo/realpath.c:
> 
> static char *tomoyo_get_local_path(struct dentry *dentry, char * const buffer,
> 				   const int buflen)
> {
> 	struct super_block *sb = dentry->d_sb;
> 	char *pos = tomoyo_get_dentry_path(dentry, buffer, buflen);
> 
> 	if (IS_ERR(pos))
> 		return pos;
> 	/* Convert from $PID to self if $PID is current thread. */
> 	if (sb->s_magic == PROC_SUPER_MAGIC && *pos == '/') {
> 		char *ep;
> 		const pid_t pid = (pid_t) simple_strtoul(pos + 1, &ep, 10);
> 
> 		if (*ep == '/' && pid && pid ==
> 		    task_tgid_nr_ns(current, sb->s_fs_info)) {
> 			pos = ep - 5;
> 			if (pos < buffer)
> 				goto out;
> 			memmove(pos, "/self", 5);
> 		}
> 		goto prepend_filesystem_name;
> 	}

Ooops. I missed this one. I thought I found all such cases.

> Can you make the fixes to locks.c and tomoyo a couple of standalone
> fixes that should be inserted before your patch?

Sure.

> On the odd chance there is a typo they will bisect better, as well
> as just keeping this patch and it's description from expanding in size.
> So that things are small enough for people to really look at and review.
> 
> The fix itself looks fine.
> 
> Thank you,
> Eric
> 
> 
> Alexey Gladkov <gladkov.alexey@gmail.com> writes:
> 
> > Fixed getting proc_pidns in the lock_get_status() and locks_show() directly from
> > the superblock, which caused a crash:
> >
> > === arm64 ===
> > [12140.366814] LTP: starting proc01 (proc01 -m 128)
> > [12149.580943] ==================================================================
> > [12149.589521] BUG: KASAN: out-of-bounds in pid_nr_ns+0x2c/0x90 pid_nr_ns at kernel/pid.c:456
> > [12149.595939] Read of size 4 at addr 1bff000bfa8c0388 by task = proc01/50298
> > [12149.603392] Pointer tag: [1b], memory tag: [fe]
> >
> > [12149.610906] CPU: 69 PID: 50298 Comm: proc01 Tainted: G L 5.7.0-rc2-next-20200422 #6
> > [12149.620585] Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.11 06/18/2019
> > [12149.631074] Call trace:
> > [12149.634304]  dump_backtrace+0x0/0x22c
> > [12149.638745]  show_stack+0x28/0x34
> > [12149.642839]  dump_stack+0x104/0x194
> > [12149.647110]  print_address_description+0x70/0x3a4
> > [12149.652576]  __kasan_report+0x188/0x238
> > [12149.657169]  kasan_report+0x3c/0x58
> > [12149.661430]  check_memory_region+0x98/0xa0
> > [12149.666303]  __hwasan_load4_noabort+0x18/0x20
> > [12149.671431]  pid_nr_ns+0x2c/0x90
> > [12149.675446]  locks_translate_pid+0xf4/0x1a0
> > [12149.680382]  locks_show+0x68/0x110
> > [12149.684536]  seq_read+0x380/0x930
> > [12149.688604]  pde_read+0x5c/0x78
> > [12149.692498]  proc_reg_read+0x74/0xc0
> > [12149.696813]  __vfs_read+0x84/0x1d0
> > [12149.700939]  vfs_read+0xec/0x124
> > [12149.704889]  ksys_read+0xb0/0x120
> > [12149.708927]  __arm64_sys_read+0x54/0x88
> > [12149.713485]  do_el0_svc+0x128/0x1dc
> > [12149.717697]  el0_sync_handler+0x150/0x250
> > [12149.722428]  el0_sync+0x164/0x180
> >
> > [12149.728672] Allocated by task 1:
> > [12149.732624]  __kasan_kmalloc+0x124/0x188
> > [12149.737269]  kasan_kmalloc+0x10/0x18
> > [12149.741568]  kmem_cache_alloc_trace+0x2e4/0x3d4
> > [12149.746820]  proc_fill_super+0x48/0x1fc
> > [12149.751377]  vfs_get_super+0xcc/0x170
> > [12149.755760]  get_tree_nodev+0x28/0x34
> > [12149.760143]  proc_get_tree+0x24/0x30
> > [12149.764439]  vfs_get_tree+0x54/0x158
> > [12149.768736]  do_mount+0x80c/0xaf0
> > [12149.772774]  __arm64_sys_mount+0xe0/0x18c
> > [12149.777504]  do_el0_svc+0x128/0x1dc
> > [12149.781715]  el0_sync_handler+0x150/0x250
> > [12149.786445]  el0_sync+0x164/0x180
> 
> > diff --git a/fs/locks.c b/fs/locks.c
> > index b8a31c1c4fff..399c5dbb72c4 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -2823,7 +2823,7 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> >  {
> >  	struct inode *inode = NULL;
> >  	unsigned int fl_pid;
> > -	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
> > +	struct pid_namespace *proc_pidns = proc_pid_ns(file_inode(f->file));
> >  
> >  	fl_pid = locks_translate_pid(fl, proc_pidns);
> >  	/*
> > @@ -2901,7 +2901,7 @@ static int locks_show(struct seq_file *f, void *v)
> >  {
> >  	struct locks_iterator *iter = f->private;
> >  	struct file_lock *fl, *bfl;
> > -	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
> > +	struct pid_namespace *proc_pidns = proc_pid_ns(file_inode(f->file));
> >  
> >  	fl = hlist_entry(v, struct file_lock, fl_link);
> >  
> 
> Eric
> 

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-23 19:39   ` [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
@ 2020-04-23 20:28     ` Linus Torvalds
  2020-04-24  3:33       ` Eric W. Biederman
  2020-04-24 17:39     ` Oleg Nesterov
  1 sibling, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2020-04-23 20:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Oleg Nesterov, Alexey Gladkov

On Thu, Apr 23, 2020 at 12:42 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
> +{
> +       /* pid_links[PIDTYPE_PID].next is always NULL */
> +       struct pid *npid = READ_ONCE(ntask->thread_pid);
> +       struct pid *opid = READ_ONCE(otask->thread_pid);
> +
> +       rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
> +       rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
> +       rcu_assign_pointer(ntask->thread_pid, opid);
> +       rcu_assign_pointer(otask->thread_pid, npid);
> +       WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
> +       WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
> +       WRITE_ONCE(ntask->pid, pid_nr(opid));
> +       WRITE_ONCE(otask->pid, pid_nr(npid));
> +}

This function is _very_ hard to read as written.

It really wants a helper function to do the swapping per hlist_head
and hlist_node, I think. And "opid/npid" is very hard to see, and the
naming doesn't make much sense (if it's an "exchange", then why is it
"old/new" - they're symmetric).

At least something like

        struct hlist_head *old_pid_hlist = opid->tasks + PIDTYPE_PID;
        struct hlist_head *new_pid_hlist = npid->tasks + PIDTYPE_PID;
        struct hlist_node *old_pid_node = otask->pid_links + PIDTYPE_PID;
        struct hlist_node *new_pid_node = ntask->pid_links + PIDTYPE_PID;

        struct hlist_node *old_first_node = old_pid_hlist->first;
        struct hlist_node *new_first_node = new_pid_hlist->first;

and then trying to group up the first/pprev/thread_pid/pid  accesses
so that you them together, and using a helper function that does the
whole switch, so that you'd have

        /* Move new node to old hlist, and update thread_pid/pid fields */
        insert_pid_pointers(old_pid_hlist, new_pid_node, new_first_node);
        rcu_assign_pointer(ntask->thread_pid, opid);
        WRITE_ONCE(ntask->pid, pid_nr(opid));

        /* Move old new to new hlist, and update thread_pid/pid fields */
        insert_pid_pointers(new_pid_hlist, old_pid_node, old_first_node);
        rcu_assign_pointer(otask->thread_pid, npid);
        WRITE_ONCE(otask->pid, pid_nr(npid));

or something roughly like that.

(And the above still uses "old/new", which as mentioned sounds wrong
to me. Maybe it should just be "a_xyz" and "b_xyz"? Also note that I
did this in my MUA, so I could have gotten the names and types wrong
etc).

I think that would make it look at least _slightly_ less like random
line noise and easier to follow.

But maybe even a rcu_hlist_swap() helper? We have one for regular
lists. Do we really have to do it all written out, not do it with a
"remove and reinsert" model?

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-23 20:28     ` Linus Torvalds
@ 2020-04-24  3:33       ` Eric W. Biederman
  2020-04-24 18:02         ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-24  3:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Oleg Nesterov, Alexey Gladkov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Apr 23, 2020 at 12:42 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
>> +{
>> +       /* pid_links[PIDTYPE_PID].next is always NULL */
>> +       struct pid *npid = READ_ONCE(ntask->thread_pid);
>> +       struct pid *opid = READ_ONCE(otask->thread_pid);
>> +
>> +       rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
>> +       rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
>> +       rcu_assign_pointer(ntask->thread_pid, opid);
>> +       rcu_assign_pointer(otask->thread_pid, npid);
>> +       WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
>> +       WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
>> +       WRITE_ONCE(ntask->pid, pid_nr(opid));
>> +       WRITE_ONCE(otask->pid, pid_nr(npid));
>> +}
>
> This function is _very_ hard to read as written.
>
> It really wants a helper function to do the swapping per hlist_head
> and hlist_node, I think. And "opid/npid" is very hard to see, and the
> naming doesn't make much sense (if it's an "exchange", then why is it
> "old/new" - they're symmetric).
>
> At least something like
>
>         struct hlist_head *old_pid_hlist = opid->tasks + PIDTYPE_PID;
>         struct hlist_head *new_pid_hlist = npid->tasks + PIDTYPE_PID;
>         struct hlist_node *old_pid_node = otask->pid_links + PIDTYPE_PID;
>         struct hlist_node *new_pid_node = ntask->pid_links + PIDTYPE_PID;
>
>         struct hlist_node *old_first_node = old_pid_hlist->first;
>         struct hlist_node *new_first_node = new_pid_hlist->first;
>
> and then trying to group up the first/pprev/thread_pid/pid  accesses
> so that you them together, and using a helper function that does the
> whole switch, so that you'd have
>
>         /* Move new node to old hlist, and update thread_pid/pid fields */
>         insert_pid_pointers(old_pid_hlist, new_pid_node, new_first_node);
>         rcu_assign_pointer(ntask->thread_pid, opid);
>         WRITE_ONCE(ntask->pid, pid_nr(opid));
>
>         /* Move old new to new hlist, and update thread_pid/pid fields */
>         insert_pid_pointers(new_pid_hlist, old_pid_node, old_first_node);
>         rcu_assign_pointer(otask->thread_pid, npid);
>         WRITE_ONCE(otask->pid, pid_nr(npid));
>
> or something roughly like that.
>
> (And the above still uses "old/new", which as mentioned sounds wrong
> to me. Maybe it should just be "a_xyz" and "b_xyz"? Also note that I
> did this in my MUA, so I could have gotten the names and types wrong
> etc).
>
> I think that would make it look at least _slightly_ less like random
> line noise and easier to follow.
>
> But maybe even a rcu_hlist_swap() helper? We have one for regular
> lists. Do we really have to do it all written out, not do it with a
> "remove and reinsert" model?

At one point my brain I had forgetten that xchg can not take two memory
arguments and had hoped to be able to provide stronger guarnatees than I
can.  Which is where I think the structure of exchange_pids came from.

I do agree the clearer we can write things, the easier it is for
someone else to come along and follow.

We can not use a remove and reinser model because that does break rcu
accesses, and complicates everything else.  With a swap model we have
the struct pids pointer at either of the tasks that are swapped but
never at nothing.  With a remove/reinsert model we have to deal the
addittional possibility of the pids not pointing at a thread at all
which can result in things like signals not being delivered at all.

I played with it a bit and the best I have been able to come up is:

	void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
	{
		struct hlist_node **lpprev = left->pprev;
		struct hlist_node **rpprev = right->pprev;
	
		rcu_assign_pointer(*lpprev, right);
		rcu_assign_pointer(*rpprev, left);
		WRITE_ONCE(left->pprev,  rpprev);
		WRITE_ONCE(right->pprev, lpprev);
	}
	
	void exchange_tids(struct task_struct *left, struct task_struct *right)
	{
		struct hlist_node *lnode = &left->pid_links[PIDTYPE_PID];
		struct hlist_node *rnode = &right->pid_links[PIDTYPE_PID];
		struct pid *lpid, *rpid;
	
		/* Replace the single entry tid lists with each other */
		hlist_swap_before_rcu(lnode, rnode);

		/* Swap thread_pid */
		rpid = left->thread_pid;
		lpid = right->thread_pid;
		rcu_assign_pointer(left->thread_pid, lpid);
		rcu_assign_pointer(right->thread_pid, rpid);

                /* Swap the cached pid value */
		WRITE_ONCE(left->pid, pid_nr(lpid));
		WRITE_ONCE(right->pid, pid_nr(rpid));
	}

hlists because they are not doubly linked can legitimately swap their
beginnings or their tails.  Something that regular lists can not,
and I think that is exactly the general purpose semantic I want.

Does that look a little more readable?

Eric

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 1/2] proc: Use PIDTYPE_TGID in next_tgid
  2020-04-23 19:39   ` [PATCH v2 1/2] proc: Use PIDTYPE_TGID in next_tgid Eric W. Biederman
@ 2020-04-24 17:29     ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2020-04-24 17:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds

On 04/23, Eric W. Biederman wrote:
>
> @@ -3360,20 +3360,8 @@ static struct tgid_iter next_tgid(struct pid_namespace *ns, struct tgid_iter ite
>  	pid = find_ge_pid(iter.tgid, ns);
>  	if (pid) {
>  		iter.tgid = pid_nr_ns(pid, ns);
> -		iter.task = pid_task(pid, PIDTYPE_PID);
> -		/* What we to know is if the pid we have find is the
> -		 * pid of a thread_group_leader.  Testing for task
> -		 * being a thread_group_leader is the obvious thing
> -		 * todo but there is a window when it fails, due to
> -		 * the pid transfer logic in de_thread.
> -		 *
> -		 * So we perform the straight forward test of seeing
> -		 * if the pid we have found is the pid of a thread
> -		 * group leader, and don't worry if the task we have
> -		 * found doesn't happen to be a thread group leader.
> -		 * As we don't care in the case of readdir.
> -		 */
> -		if (!iter.task || !has_group_leader_pid(iter.task)) {
> +		iter.task = pid_task(pid, PIDTYPE_TGID);
> +		if (!iter.task) {
>  			iter.tgid += 1;
>  			goto retry;
>  		}

Acked-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-23 19:39   ` [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
  2020-04-23 20:28     ` Linus Torvalds
@ 2020-04-24 17:39     ` Oleg Nesterov
  2020-04-24 18:10       ` Eric W. Biederman
                         ` (2 more replies)
  1 sibling, 3 replies; 54+ messages in thread
From: Oleg Nesterov @ 2020-04-24 17:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds

On 04/23, Eric W. Biederman wrote:
>
> When the thread group leader changes during exec and the old leaders
> thread is reaped proc_flush_pid

This is off-topic, but let me mention this before I forget...

Note that proc_flush_pid() does nothing if CONFIG_PROC_FS=n, this mean
that in this case release_task() leaks thread_pid.

> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
> +{
> +	/* pid_links[PIDTYPE_PID].next is always NULL */
> +	struct pid *npid = READ_ONCE(ntask->thread_pid);
> +	struct pid *opid = READ_ONCE(otask->thread_pid);
> +
> +	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
> +	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
> +	rcu_assign_pointer(ntask->thread_pid, opid);
> +	rcu_assign_pointer(otask->thread_pid, npid);
> +	WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
> +	WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
> +	WRITE_ONCE(ntask->pid, pid_nr(opid));
> +	WRITE_ONCE(otask->pid, pid_nr(npid));
> +}

Oh, at first glance this breaks posix-cpu-timers.c:lookup_task(), the last
user of has_group_leader_pid().

I think that we should change lookup_task() to return "struct *pid", this
should simplify the code... Note that none of its callers needs task_struct.

And, instead of thread_group_leader/has_group_leader_pid checks we should
use pid_has_task(TGID).

After that, this patch should kill has_group_leader_pid().

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-24  3:33       ` Eric W. Biederman
@ 2020-04-24 18:02         ` Linus Torvalds
  2020-04-24 18:46           ` Linus Torvalds
  2020-04-24 19:51           ` Eric W. Biederman
  0 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2020-04-24 18:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Oleg Nesterov, Alexey Gladkov

On Thu, Apr 23, 2020 at 8:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> At one point my brain I had forgetten that xchg can not take two memory
> arguments and had hoped to be able to provide stronger guarnatees than I
> can.  Which is where I think the structure of exchange_pids came from.

Note that even if we were to have a "exchange two memory locations
atomically" instruction (and we don't - even a "double cmpxchg" is
actually just a double-_sized_ one, not a two different locations
one), I'm not convinced it makes sense.

There's no way to _walk_ two lists atomically. Any user will only ever
walk one or the other, so it's not sensible to try to make the two
list updates be atomic.

And if a user for some reason walks both, the walking itself will
obviously then be racy - it does one or the other first, and can see
either the old state, or the new state - or see _neither_ (ie if you
walk it twice, you might see neither task, or you might see both, just
depending on order or walk).

> I do agree the clearer we can write things, the easier it is for
> someone else to come along and follow.

Your alternate write of the function seems a bit more readable to me,
even if the main effect might be just that it was split up a bit and
added a few comments and whitespace.

So I'm more happier with that one. That said:

> We can not use a remove and reinser model because that does break rcu
> accesses, and complicates everything else.  With a swap model we have
> the struct pids pointer at either of the tasks that are swapped but
> never at nothing.

I'm not suggesting removing the pid entirely - like making task->pid
be NULL. I'm literally suggesting just doing the RCU list operations
as "remove and re-insert".

And that shouldn't break anything, for the same reason that an atomic
exchange doesn't make sense: you can only ever walk one of the lists
at a time. And regardless of how you walk it, you might not see the
new state (or the old state) reliably.

Put another way:

>         void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
>         {
>                 struct hlist_node **lpprev = left->pprev;
>                 struct hlist_node **rpprev = right->pprev;
>
>                 rcu_assign_pointer(*lpprev, right);
>                 rcu_assign_pointer(*rpprev, left);

These are the only two assignments that matter for anything that walks
the list (the pprev ones are for things that change the list, and they
have to have exclusions in place).

And those two writes cannot be atomic anyway, so you fundamentally
will always be in the situation that a walker can miss one of the
tasks.

Which is why I think it would be ok to just do the RCU list swap as a
"remove left, remove right, add left, add right" operation. It doesn't
seem fundamentally different to a walker than the "switch left/right"
operation, and it seems much simpler.

Is there something I'm missing?

But I'm *not* suggesting that we change these simple parts to be
"remove thread_pid or pid pointer, and then insert a new one":

>                 /* Swap thread_pid */
>                 rpid = left->thread_pid;
>                 lpid = right->thread_pid;
>                 rcu_assign_pointer(left->thread_pid, lpid);
>                 rcu_assign_pointer(right->thread_pid, rpid);
>
>                 /* Swap the cached pid value */
>                 WRITE_ONCE(left->pid, pid_nr(lpid));
>                 WRITE_ONCE(right->pid, pid_nr(rpid));
>         }

because I agree that for things that don't _walk_ the list, but just
look up "thread_pid" vs "pid" atomically but asynchronously, we
obviously need to get one or the other, not some kind of "empty"
state.

> Does that look a little more readable?

Regardless, I find your new version at least a lot more readable, so
I'm ok with it.

It looks like Oleg found an independent issue, though.

                  Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-24 17:39     ` Oleg Nesterov
@ 2020-04-24 18:10       ` Eric W. Biederman
  2020-04-24 20:50       ` [PATCH] proc: Put thread_pid in release_task not proc_flush_pid Eric W. Biederman
       [not found]       ` <87mu6ymkea.fsf_-_@x220.int.ebiederm.org>
  2 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-24 18:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/23, Eric W. Biederman wrote:
>>
>> When the thread group leader changes during exec and the old leaders
>> thread is reaped proc_flush_pid
>
> This is off-topic, but let me mention this before I forget...
>
> Note that proc_flush_pid() does nothing if CONFIG_PROC_FS=n, this mean
> that in this case release_task() leaks thread_pid.

Good catch.  Wow.  I seem to be introducing almost as many bugs as I am
fixing right now.  Ouch.

>> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
>> +{
>> +	/* pid_links[PIDTYPE_PID].next is always NULL */
>> +	struct pid *npid = READ_ONCE(ntask->thread_pid);
>> +	struct pid *opid = READ_ONCE(otask->thread_pid);
>> +
>> +	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
>> +	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
>> +	rcu_assign_pointer(ntask->thread_pid, opid);
>> +	rcu_assign_pointer(otask->thread_pid, npid);
>> +	WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
>> +	WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
>> +	WRITE_ONCE(ntask->pid, pid_nr(opid));
>> +	WRITE_ONCE(otask->pid, pid_nr(npid));
>> +}
>
> Oh, at first glance this breaks posix-cpu-timers.c:lookup_task(), the last
> user of has_group_leader_pid().
>
> I think that we should change lookup_task() to return "struct *pid", this
> should simplify the code... Note that none of its callers needs task_struct.
>
> And, instead of thread_group_leader/has_group_leader_pid checks we should
> use pid_has_task(TGID).

Somehow I thought we could get away without fiddling with that right
now, but on second glance I can see the races.

I played with this earlier and I agree returning a struct pid *
is desirable.  I will see if I can track down the patches I was
playing with as that definitely needs to get fixed first.

> After that, this patch should kill has_group_leader_pid().
>
> What do you think?

I agree completely.  has_group_leader_pid is the same as
thread_group_leader after this so should be removed.  Especially as it
won't have any users.

There are several other potential cleanups as well.  Such as not
using a hlist for PIDTYPE_PID.  Which would allow us to run the hlists
through struct signal_struct instead.  I think that would clean things
up but that touches so many things it may just be pointless code churn.

Just for mentioning I am thinking we should rename PIDTYPE_PID to
PIDTYPE_TID just to create a distance in peoples minds between
the kernel concepts and the user space concepts.

Eric


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-24 18:02         ` Linus Torvalds
@ 2020-04-24 18:46           ` Linus Torvalds
  2020-04-24 19:51           ` Eric W. Biederman
  1 sibling, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2020-04-24 18:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Oleg Nesterov, Alexey Gladkov

On Fri, Apr 24, 2020 at 11:02 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>  [..] even a "double cmpxchg" is
> actually just a double-_sized_ one, not a two different locations
> one

Historical accuracy side note: the 68020 actually had a CAS2 that was
"two different locations".

Maybe somebody else did too.

            Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-24 18:02         ` Linus Torvalds
  2020-04-24 18:46           ` Linus Torvalds
@ 2020-04-24 19:51           ` Eric W. Biederman
  2020-04-24 20:10             ` Linus Torvalds
  1 sibling, 1 reply; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-24 19:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Oleg Nesterov, Alexey Gladkov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Apr 23, 2020 at 8:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> At one point my brain I had forgetten that xchg can not take two memory
>> arguments and had hoped to be able to provide stronger guarnatees than I
>> can.  Which is where I think the structure of exchange_pids came from.
>
> Note that even if we were to have a "exchange two memory locations
> atomically" instruction (and we don't - even a "double cmpxchg" is
> actually just a double-_sized_ one, not a two different locations
> one), I'm not convinced it makes sense.
>
> There's no way to _walk_ two lists atomically. Any user will only ever
> walk one or the other, so it's not sensible to try to make the two
> list updates be atomic.
>
> And if a user for some reason walks both, the walking itself will
> obviously then be racy - it does one or the other first, and can see
> either the old state, or the new state - or see _neither_ (ie if you
> walk it twice, you might see neither task, or you might see both, just
> depending on order or walk).
>
>> I do agree the clearer we can write things, the easier it is for
>> someone else to come along and follow.
>
> Your alternate write of the function seems a bit more readable to me,
> even if the main effect might be just that it was split up a bit and
> added a few comments and whitespace.
>
> So I'm more happier with that one. That said:
>
>> We can not use a remove and reinser model because that does break rcu
>> accesses, and complicates everything else.  With a swap model we have
>> the struct pids pointer at either of the tasks that are swapped but
>> never at nothing.
>
> I'm not suggesting removing the pid entirely - like making task->pid
> be NULL. I'm literally suggesting just doing the RCU list operations
> as "remove and re-insert".
>
> And that shouldn't break anything, for the same reason that an atomic
> exchange doesn't make sense: you can only ever walk one of the lists
> at a time. And regardless of how you walk it, you might not see the
> new state (or the old state) reliably.
>
> Put another way:
>
>>         void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
>>         {
>>                 struct hlist_node **lpprev = left->pprev;
>>                 struct hlist_node **rpprev = right->pprev;
>>
>>                 rcu_assign_pointer(*lpprev, right);
>>                 rcu_assign_pointer(*rpprev, left);
>
> These are the only two assignments that matter for anything that walks
> the list (the pprev ones are for things that change the list, and they
> have to have exclusions in place).
>
> And those two writes cannot be atomic anyway, so you fundamentally
> will always be in the situation that a walker can miss one of the
> tasks.
>
> Which is why I think it would be ok to just do the RCU list swap as a
> "remove left, remove right, add left, add right" operation. It doesn't
> seem fundamentally different to a walker than the "switch left/right"
> operation, and it seems much simpler.
>
> Is there something I'm missing?


The problem with

remove
remove
add
add
is:

A lookup that hit between the remove and the add could return nothing.

The function kill_pid_info does everything it can to handle this case
today does:

int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
{
	int error = -ESRCH;
	struct task_struct *p;

	for (;;) {
		rcu_read_lock();
		p = pid_task(pid, PIDTYPE_PID);
		if (p)
			error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
		rcu_read_unlock();
		if (likely(!p || error != -ESRCH))
			return error;

		/*
		 * The task was unhashed in between, try again.  If it
		 * is dead, pid_task() will return NULL, if we race with
		 * de_thread() it will find the new leader.
		 */
	}
}

Now kill_pid_info is signalling the entire task and is just using
PIDTYPE_PID to find a thread in the task.

With the remove then add model there will be a point where pid_task
will return nothing, because ever so briefly the lists will be
empty.

However with an actually swap we will find a task and kill_pid_info
will work.  It pathloglical cases lock_task_sighand might have to loop
and we would need to find the new task that has the given pid.  But
kill_pid_info is guaranteed to work with swaps and will fail with
remove add.


> But I'm *not* suggesting that we change these simple parts to be
> "remove thread_pid or pid pointer, and then insert a new one":
>
>>                 /* Swap thread_pid */
>>                 rpid = left->thread_pid;
>>                 lpid = right->thread_pid;
>>                 rcu_assign_pointer(left->thread_pid, lpid);
>>                 rcu_assign_pointer(right->thread_pid, rpid);
>>
>>                 /* Swap the cached pid value */
>>                 WRITE_ONCE(left->pid, pid_nr(lpid));
>>                 WRITE_ONCE(right->pid, pid_nr(rpid));
>>         }
>
> because I agree that for things that don't _walk_ the list, but just
> look up "thread_pid" vs "pid" atomically but asynchronously, we
> obviously need to get one or the other, not some kind of "empty"
> state.

For PIDTYPE_PID and PIDTYPE_TGID these practically aren't lists but
pointers to the appropriate task.  Only for PIDTYPE_PGID and PIDTYPE_SID
do these become lists in practice.

That not-really-a-list status allows for signel delivery to indivdual
processes to happen in rcu context.  Which is where we would get into
trouble with add/remove.

Since signals are guaranteed to be delivered to the entire session
or the entire process group all of the list walking happens under
the tasklist_lock currently.  Which really keeps list walking from
being a concern.

>> Does that look a little more readable?
>
> Regardless, I find your new version at least a lot more readable, so
> I'm ok with it.

Good. Then I will finish cleaning it up and go with that version.

> It looks like Oleg found an independent issue, though.

Yes, and I will definitely work through those.

Eric

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-24 19:51           ` Eric W. Biederman
@ 2020-04-24 20:10             ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2020-04-24 20:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Oleg Nesterov, Alexey Gladkov

On Fri, Apr 24, 2020 at 12:54 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> The problem with
>
> remove
> remove
> add
> add
> is:
>
> A lookup that hit between the remove and the add could return nothing.

Argh. Because that thing doesn't actually _search_ the list, it just
wants to pick any (first) entry. So it doesn't actually care what it
gets, just that it gets something.

Ok, I see.

> For PIDTYPE_PID and PIDTYPE_TGID these practically aren't lists but
> pointers to the appropriate task.  Only for PIDTYPE_PGID and PIDTYPE_SID
> do these become lists in practice.

Yeah, that's what confused me. Ok, please add a comment about the
"single-entry list" issue, and I'm happy.

            Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH] proc: Put thread_pid in release_task not proc_flush_pid
  2020-04-24 17:39     ` Oleg Nesterov
  2020-04-24 18:10       ` Eric W. Biederman
@ 2020-04-24 20:50       ` Eric W. Biederman
       [not found]       ` <87mu6ymkea.fsf_-_@x220.int.ebiederm.org>
  2 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-24 20:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds


Oleg pointed out that in the unlikely event the kernel is compiled
with CONFIG_PROC_FS unset that release_task will now leak the pid.

Move the put_pid out of proc_flush_pid into release_task to fix this
and to guarantee I don't make that mistake again.

When possible it makes sense to keep get and put in the same function
so it can easily been seen how they pair up.

Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/base.c | 1 -
 kernel/exit.c  | 1 +
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6042b646ab27..42f43c7b9669 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3274,7 +3274,6 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
 void proc_flush_pid(struct pid *pid)
 {
 	proc_invalidate_siblings_dcache(&pid->inodes, &pid->lock);
-	put_pid(pid);
 }
 
 static struct dentry *proc_pid_instantiate(struct dentry * dentry,
diff --git a/kernel/exit.c b/kernel/exit.c
index 389a88cb3081..ce2a75bc0ade 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -219,6 +219,7 @@ void release_task(struct task_struct *p)
 
 	write_unlock_irq(&tasklist_lock);
 	proc_flush_pid(thread_pid);
+	put_pid(thread_pid);
 	release_thread(p);
 	put_task_struct_rcu_user(p);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 2/6] posix-cpu-timers: Use PIDTYPE_TGID to simplify the logic in lookup_task
       [not found]         ` <87blnemj5t.fsf_-_@x220.int.ebiederm.org>
@ 2020-04-26 17:22           ` Oleg Nesterov
  2020-04-27 11:51             ` Eric W. Biederman
  2020-04-27 10:32           ` Thomas Gleixner
  1 sibling, 1 reply; 54+ messages in thread
From: Oleg Nesterov @ 2020-04-26 17:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds, Thomas Gleixner,
	Paul E. McKenney

Eric,

I am sick today and can't read the code, but I feel this patch is not
right ... please correct me.

So, iiuc when posix_cpu_timer_create() is called and CPUCLOCK_PERTHREAD
is false we roughly have

	task = pid_task(pid, PIDTYPE_TGID);			// lookup_task()

	/* WINDOW */

	timer->it.cpu.pid = = get_task_pid(task, PIDTYPE_TGID)	// posix_cpu_timer_create()

Now suppose that we race with mt-exec and this "task" is the old leader;
it can be release_task()'ed in the WINDOW above and then get_task_pid()
will return NULL.

That is why I suggested to change lookup_task() to return "struct pid*"
to eliminate the pid -> task -> pid transition.

Apart from the same_thread_group() check for the "thread" case we do not
need task_struct at all, lookup_task() can do

	if (thread) {
		p = pid_task(pid, PIDTYPE_PID);
		if (p && !same_thread_group(p, current))
			pid = NULL;
	} else {
		... gettime check ...

		if (!pid_has_task(pid, PIDTYPE_TGID))
			pid = NULL;
	}

	return pid;

No?

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 3/6] rculist: Add hlist_swap_before_rcu
       [not found]         ` <875zdmmj4y.fsf_-_@x220.int.ebiederm.org>
@ 2020-04-26 17:40           ` Linus Torvalds
  2020-04-27 14:28             ` Eric W. Biederman
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2020-04-26 17:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Oleg Nesterov, Thomas Gleixner,
	Paul E. McKenney

On Sun, Apr 26, 2020 at 7:14 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> To support this add hlist_swap_before_rcu.  An hlist primitive that
> will allow swaping the leading sections of two tasks.  For exchanging
> the task pids it will just be swapping the hlist_heads of two single
> entry lists.  But the functionality is more general.

So I have no problems with the series now - the code is much more
understandable. Partly because of the split-up, partly because of the
comments, and partly because you explained the special case and why it
was a valid thing to do...

However, I did start thinking about this case again.

I still don't think the "swap entry" macro is necessarily useful in
_general_ - any time it's an actual individual entry, that swap macro
doesn't really work.

So the only reason it works here is because you're actually swapping
the whole list.

But that, in turn, shouldn't be using that "first node" model at all,
it should use the hlist_head. That would have made it a lot more
obvious what is actually going on to me.

Now, the comment very much talks about the head case, but the code
still looks like it's swapping a non-head thing.

I guess the code technically _works_ with "swap two list ends", but
does that actually really make sense as an operation?

So I no longer hate how this patch looks, but I wonder if we should
just make the whole "this node is the *first* node" a bit more
explicit in both the caller and in the swapping code.

It could be as simple as replacing just the conceptual types and
names, so instead of some "pnode1" double-indirect node pointer, we'd
have

        struct hlist_head *left_head = container_of(left->pprev,
struct hlist_head, first);
        struct hlist_head *right_head = container_of(right->pprev,
struct hlist_head, first);

and then the code would do

        rcu_assign_pointer(right_head->first, left);
        rcu_assign_pointer(left_head->first, right);
        WRITE_ONCE(left->pprev,  &right_head->first);
        WRITE_ONCE(right->pprev,  &left_head->first);

which should generate the exact same code, but makes it clear that
what we're doing is switching the whole hlist when given the first
entries.

Doesn't that make what it actually does a lot more understandable? The
*pnode1/pnode2 games are somewhat opaque, but with that type and name
change and using "container_of()", the code now fairly naturally reads
as "oh, we're changing the first pointers in the list heads, and
making the nodes point back to them" .

Again - the current function _works_ with swapping two hlists in the
middle (not two entries - it swaps the whole list starting at that
entry!), so your current patch is in some ways "more generic". I'm
just suggesting that the generic case doesn't make much sense, and
that the "we know the first entries, swap the lists" actually is what
the real use is, and writing it as such makes the code easier to
understand.

But I'm not going to insist on this, so this is more an RFC. Maybe
people disagree, and/or have an actual use case for that "break two
hlists in the middle, swap the ends" that I find unlikely...

(NOTE: My "convert to hlist_head" code _works_ for that case too
because the code generation is the same! But it would be really really
confusing for that code to be used for anything but the first entry).

                       Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 1/6] posix-cpu-timers: Always call __get_task_for_clock holding rcu_read_lock
       [not found]         ` <87h7x6mj6h.fsf_-_@x220.int.ebiederm.org>
@ 2020-04-27  9:43           ` Thomas Gleixner
  2020-04-27 11:53             ` Eric W. Biederman
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2020-04-27  9:43 UTC (permalink / raw)
  To: Eric W. Biederman, LKML
  Cc: Linux FS Devel, Alexey Dobriyan, Alexey Gladkov, Andrew Morton,
	Alexey Gladkov, Linus Torvalds, Oleg Nesterov, Paul E. McKenney

ebiederm@xmission.com (Eric W. Biederman) writes:

> This allows the getref flag to be removed and the callers can
> than take a task reference if needed.

That changelog lacks any form of information why this should be
changed. I can see the point vs. patch 2, but pretty please put coherent
explanations into each patch.

> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/time/posix-cpu-timers.c | 41 +++++++++++++++++-----------------
>  1 file changed, 21 insertions(+), 20 deletions(-)
>
> diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
> index 2fd3b3fa68bf..eba41c70f0f0 100644
> --- a/kernel/time/posix-cpu-timers.c
> +++ b/kernel/time/posix-cpu-timers.c
> @@ -86,36 +86,34 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
>  }
>  
>  static struct task_struct *__get_task_for_clock(const clockid_t clock,
> -						bool getref, bool gettime)
> +						bool gettime)
>  {
>  	const bool thread = !!CPUCLOCK_PERTHREAD(clock);
>  	const pid_t pid = CPUCLOCK_PID(clock);
> -	struct task_struct *p;
>  
>  	if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
>  		return NULL;
>  
> -	rcu_read_lock();
> -	p = lookup_task(pid, thread, gettime);
> -	if (p && getref)
> -		get_task_struct(p);
> -	rcu_read_unlock();
> -	return p;
> +	return lookup_task(pid, thread, gettime);
>  }
>  
>  static inline struct task_struct *get_task_for_clock(const clockid_t clock)
>  {
> -	return __get_task_for_clock(clock, true, false);
> +	return __get_task_for_clock(clock, false);
>  }
>  
>  static inline struct task_struct *get_task_for_clock_get(const clockid_t clock)
>  {
> -	return __get_task_for_clock(clock, true, true);
> +	return __get_task_for_clock(clock, true);
>  }
>  
>  static inline int validate_clock_permissions(const clockid_t clock)
>  {
> -	return __get_task_for_clock(clock, false, false) ? 0 : -EINVAL;
> +	int ret;

New line between declarations and code please.

> +	rcu_read_lock();
> +	ret = __get_task_for_clock(clock, false) ? 0 : -EINVAL;
> +	rcu_read_unlock();
> +	return ret;
>  }

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 2/6] posix-cpu-timers: Use PIDTYPE_TGID to simplify the logic in lookup_task
       [not found]         ` <87blnemj5t.fsf_-_@x220.int.ebiederm.org>
  2020-04-26 17:22           ` [PATCH v3 2/6] posix-cpu-timers: Use PIDTYPE_TGID to simplify the logic in lookup_task Oleg Nesterov
@ 2020-04-27 10:32           ` Thomas Gleixner
  2020-04-27 19:46             ` Eric W. Biederman
  1 sibling, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2020-04-27 10:32 UTC (permalink / raw)
  To: Eric W. Biederman, LKML
  Cc: Linux FS Devel, Alexey Dobriyan, Alexey Gladkov, Andrew Morton,
	Alexey Gladkov, Linus Torvalds, Oleg Nesterov, Paul E. McKenney

ebiederm@xmission.com (Eric W. Biederman) writes:
> Using pid_task(find_vpid(N), PIDTYPE_TGID) guarantees that if a task
> is found it is at that moment the thread group leader.  Which removes
> the need for the follow on test has_group_leader_pid.
>
> I have reorganized the rest of the code in lookup_task for clarity,
> and created a common return for most of the code.

Sorry, it's way harder to read than the very explicit exits which were
there before.

> The special case for clock_gettime with "pid == gettid" is not my
> favorite.  I strongly suspect it isn't used as gettid is such a pain,
> and passing 0 is much easier.  Still it is easier to keep this special
> case than to do the reasarch that will show it isn't used.

It might be not your favorite, but when I refactored the code I learned
the hard way that one of the test suites has assumptions that
clock_gettime(PROCESS) works from any task of a group and not just for
the group leader. Sure we could fix the test suite, but test code tends
to be copied ...

>  /*
>   * Functions for validating access to tasks.
>   */
> -static struct task_struct *lookup_task(const pid_t pid, bool thread,
> +static struct task_struct *lookup_task(const pid_t which_pid, bool thread,
>  				       bool gettime)
>  {
>  	struct task_struct *p;
> +	struct pid *pid;
>  
>  	/*
>  	 * If the encoded PID is 0, then the timer is targeted at current
>  	 * or the process to which current belongs.
>  	 */
> -	if (!pid)
> +	if (!which_pid)
>  		return thread ? current : current->group_leader;
>  
> -	p = find_task_by_vpid(pid);
> -	if (!p)
> -		return p;
> -
> -	if (thread)
> -		return same_thread_group(p, current) ? p : NULL;
> -
> -	if (gettime) {
> +	pid = find_vpid(which_pid);
> +	if (thread) {
> +		p = pid_task(pid, PIDTYPE_PID);
> +		if (p && !same_thread_group(p, current))
> +			p = NULL;
> +	} else {
>  		/*
>  		 * For clock_gettime(PROCESS) the task does not need to be
>  		 * the actual group leader. tsk->sighand gives
> @@ -76,13 +75,13 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
>  		 * reference on it and store the task pointer until the
>  		 * timer is destroyed.

Btw, this comment is wrong since

     55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task")

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 2/6] posix-cpu-timers: Use PIDTYPE_TGID to simplify the logic in lookup_task
  2020-04-26 17:22           ` [PATCH v3 2/6] posix-cpu-timers: Use PIDTYPE_TGID to simplify the logic in lookup_task Oleg Nesterov
@ 2020-04-27 11:51             ` Eric W. Biederman
  2020-04-28 18:03               ` Oleg Nesterov
  0 siblings, 1 reply; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-27 11:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds, Thomas Gleixner,
	Paul E. McKenney

Oleg Nesterov <oleg@redhat.com> writes:

> Eric,
>
> I am sick today and can't read the code, but I feel this patch is not
> right ... please correct me.


> So, iiuc when posix_cpu_timer_create() is called and CPUCLOCK_PERTHREAD
> is false we roughly have
>
> 	task = pid_task(pid, PIDTYPE_TGID);			// lookup_task()
>
> 	/* WINDOW */
>
> 	timer->it.cpu.pid = = get_task_pid(task, PIDTYPE_TGID)	// posix_cpu_timer_create()
>
> Now suppose that we race with mt-exec and this "task" is the old leader;
> it can be release_task()'ed in the WINDOW above and then get_task_pid()
> will return NULL.

Except it is asking for PIDTYPE_TGID.

task->signal even if it is freed (which it won't be in a mt-exec)
is valid until after an rcu window.

release_task()
   put_task_struct_rcu_user()
      call_rcu(..., delayed_put_task_struct())
... rcu delay ...
delayed_put_task_struct()
   put_task_struct()
      __put_task_struct()
         put_signal_struct()
            free_signal_struct()

Which means that task->signal->pids[PIDTYPE_TGID] will remain valid even
across mt-exec.

Further the only change I have introduced is to perform this work under
rcu_read_lock vs taking a reference to task_struct.  As the reference to
task_struct does not prevent release_task, the situation with respect
to races in the rest of the code does not change.

Hmm....

If the case instead is:
> 	timer->it.cpu.pid = get_task_pid(task, PIDTYPE_PID)	// posix_cpu_timer_create()

Which can also happen for threads in the same thread group.
I have to agree that we can wind up with a NULL pid.

And that is a brand new bug, because we didn't use to use pids.
Sigh.


> That is why I suggested to change lookup_task() to return "struct pid*"
> to eliminate the pid -> task -> pid transition.

Yes.  I have to agree.  Getting rid of the pid -> task -> pid transition
looks important to close bugs like that.

> Apart from the same_thread_group() check for the "thread" case we do not
> need task_struct at all, lookup_task() can do
>
> 	if (thread) {
> 		p = pid_task(pid, PIDTYPE_PID);
> 		if (p && !same_thread_group(p, current))
> 			pid = NULL;
> 	} else {
> 		... gettime check ...
>
> 		if (!pid_has_task(pid, PIDTYPE_TGID))
> 			pid = NULL;
> 	}
>
> 	return pid;
>
> No?

There is also the posix_cpu_clock_get, where we immediately use the
clock instead of create something we can use later.

I want to say the gettime case is another reason to go through the whole
transition but the code can just as easily say "pid = task_tgid(current)"
as it can "p = current";

Eric


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 1/6] posix-cpu-timers: Always call __get_task_for_clock holding rcu_read_lock
  2020-04-27  9:43           ` [PATCH v3 1/6] posix-cpu-timers: Always call __get_task_for_clock holding rcu_read_lock Thomas Gleixner
@ 2020-04-27 11:53             ` Eric W. Biederman
  0 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-27 11:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds, Oleg Nesterov,
	Paul E. McKenney

Thomas Gleixner <tglx@linutronix.de> writes:

> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> This allows the getref flag to be removed and the callers can
>> than take a task reference if needed.
>
> That changelog lacks any form of information why this should be
> changed. I can see the point vs. patch 2, but pretty please put coherent
> explanations into each patch.

Well excess flags bad.  But in this case I can do even better.

The code no longer takes a reference on task_struct so this patch
removes unnecessary code.

I will see if I can say that better.



>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  kernel/time/posix-cpu-timers.c | 41 +++++++++++++++++-----------------
>>  1 file changed, 21 insertions(+), 20 deletions(-)
>>
>> diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
>> index 2fd3b3fa68bf..eba41c70f0f0 100644
>> --- a/kernel/time/posix-cpu-timers.c
>> +++ b/kernel/time/posix-cpu-timers.c
>> @@ -86,36 +86,34 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
>>  }
>>  
>>  static struct task_struct *__get_task_for_clock(const clockid_t clock,
>> -						bool getref, bool gettime)
>> +						bool gettime)
>>  {
>>  	const bool thread = !!CPUCLOCK_PERTHREAD(clock);
>>  	const pid_t pid = CPUCLOCK_PID(clock);
>> -	struct task_struct *p;
>>  
>>  	if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
>>  		return NULL;
>>  
>> -	rcu_read_lock();
>> -	p = lookup_task(pid, thread, gettime);
>> -	if (p && getref)
>> -		get_task_struct(p);
>> -	rcu_read_unlock();
>> -	return p;
>> +	return lookup_task(pid, thread, gettime);
>>  }
>>  
>>  static inline struct task_struct *get_task_for_clock(const clockid_t clock)
>>  {
>> -	return __get_task_for_clock(clock, true, false);
>> +	return __get_task_for_clock(clock, false);
>>  }
>>  
>>  static inline struct task_struct *get_task_for_clock_get(const clockid_t clock)
>>  {
>> -	return __get_task_for_clock(clock, true, true);
>> +	return __get_task_for_clock(clock, true);
>>  }
>>  
>>  static inline int validate_clock_permissions(const clockid_t clock)
>>  {
>> -	return __get_task_for_clock(clock, false, false) ? 0 : -EINVAL;
>> +	int ret;
>
> New line between declarations and code please.
>
>> +	rcu_read_lock();
>> +	ret = __get_task_for_clock(clock, false) ? 0 : -EINVAL;
>> +	rcu_read_unlock();
>> +	return ret;
>>  }
>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 3/6] rculist: Add hlist_swap_before_rcu
  2020-04-26 17:40           ` [PATCH v3 3/6] rculist: Add hlist_swap_before_rcu Linus Torvalds
@ 2020-04-27 14:28             ` Eric W. Biederman
  2020-04-27 20:27               ` Linus Torvalds
  0 siblings, 1 reply; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-27 14:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Oleg Nesterov, Thomas Gleixner,
	Paul E. McKenney

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sun, Apr 26, 2020 at 7:14 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> To support this add hlist_swap_before_rcu.  An hlist primitive that
>> will allow swaping the leading sections of two tasks.  For exchanging
>> the task pids it will just be swapping the hlist_heads of two single
>> entry lists.  But the functionality is more general.
>
> So I have no problems with the series now - the code is much more
> understandable. Partly because of the split-up, partly because of the
> comments, and partly because you explained the special case and why it
> was a valid thing to do...
>
> However, I did start thinking about this case again.
>
> I still don't think the "swap entry" macro is necessarily useful in
> _general_ - any time it's an actual individual entry, that swap macro
> doesn't really work.

But it isn't a "swap entry" macro/function.  I did not even attempt
to make it a "swap entry" function.

I made a chop two lists into two and swap the pieces function.

> So the only reason it works here is because you're actually swapping
> the whole list.
>
> But that, in turn, shouldn't be using that "first node" model at all,
> it should use the hlist_head. That would have made it a lot more
> obvious what is actually going on to me.
>
> Now, the comment very much talks about the head case, but the code
> still looks like it's swapping a non-head thing.
>
> I guess the code technically _works_ with "swap two list ends", but
> does that actually really make sense as an operation?

As an operation yes.  Will anyone else want that operation I don't know.

> So I no longer hate how this patch looks, but I wonder if we should
> just make the whole "this node is the *first* node" a bit more
> explicit in both the caller and in the swapping code.
>
> It could be as simple as replacing just the conceptual types and
> names, so instead of some "pnode1" double-indirect node pointer, we'd
> have
>
>         struct hlist_head *left_head = container_of(left->pprev,
> struct hlist_head, first);
>         struct hlist_head *right_head = container_of(right->pprev,
> struct hlist_head, first);
>
> and then the code would do
>
>         rcu_assign_pointer(right_head->first, left);
>         rcu_assign_pointer(left_head->first, right);
>         WRITE_ONCE(left->pprev,  &right_head->first);
>         WRITE_ONCE(right->pprev,  &left_head->first);
>
> which should generate the exact same code, but makes it clear that
> what we're doing is switching the whole hlist when given the first
> entries.
>
> Doesn't that make what it actually does a lot more understandable?

Understandable is a bit subjective. I think having a well defined hlist
operation I can call makes things more understandable.

I think the getting the list head as:
"head = &task->thread_pid->tasks[PIDTYPE_PID];" is more understandable
and less risky than container_of.

My concern and probably unreasonbable as this is a slow path
with getting the list heads after looking up the pid is that it seems
to add a wait for an additional cache line to load before anything can
happen.

The only way I really know to make this code much more understandable is
to remove the lists entirely for this case.  But that is a much larger
change and it is not clear that it makes the kernel code overall better.
I stared at that for a while and it is an interesting follow on but not
something I want or we even can do before exchange_tids is in place.

> The
> *pnode1/pnode2 games are somewhat opaque, but with that type and name
> change and using "container_of()", the code now fairly naturally reads
> as "oh, we're changing the first pointers in the list heads, and
> making the nodes point back to them" .
>
> Again - the current function _works_ with swapping two hlists in the
> middle (not two entries - it swaps the whole list starting at that
> entry!), so your current patch is in some ways "more generic". I'm
> just suggesting that the generic case doesn't make much sense, and
> that the "we know the first entries, swap the lists" actually is what
> the real use is, and writing it as such makes the code easier to
> understand.

Yep.  That is waht I designed it to do.  I sort of went the other
direction when writing this.  I could start with the list heads and swap
the rest of the lists and get the same code.  But it looked like it
would be a little slower to find the hlist_heads, and I couldn't think
of a good name for the function.  So I figured if I was writing a
fucntion for this case I would write one that was convinient.

For understandability that is my real challenge what is a good name
that people can read and understand what is happening for this swapping
function.

> But I'm not going to insist on this, so this is more an RFC. Maybe
> people disagree, and/or have an actual use case for that "break two
> hlists in the middle, swap the ends" that I find unlikely...
>
> (NOTE: My "convert to hlist_head" code _works_ for that case too
> because the code generation is the same! But it would be really really
> confusing for that code to be used for anything but the first entry).

Yes.

I am open to improvements.  Especially in the naming.

Would hlists_swap_heads_rcu be noticably better?

Eric


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 2/6] posix-cpu-timers: Use PIDTYPE_TGID to simplify the logic in lookup_task
  2020-04-27 10:32           ` Thomas Gleixner
@ 2020-04-27 19:46             ` Eric W. Biederman
  0 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-27 19:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds, Oleg Nesterov,
	Paul E. McKenney

Thomas Gleixner <tglx@linutronix.de> writes:

> ebiederm@xmission.com (Eric W. Biederman) writes:
>> Using pid_task(find_vpid(N), PIDTYPE_TGID) guarantees that if a task
>> is found it is at that moment the thread group leader.  Which removes
>> the need for the follow on test has_group_leader_pid.
>>
>> I have reorganized the rest of the code in lookup_task for clarity,
>> and created a common return for most of the code.
>
> Sorry, it's way harder to read than the very explicit exits which were
> there before.

My biggest gripe is the gettime and the ordinary !thread case should
be sharing code and they are not.  I know historically why they don't
but for all practical purposes has_group_leader_pid and
thread_group_leader are the same test.


>> The special case for clock_gettime with "pid == gettid" is not my
>> favorite.  I strongly suspect it isn't used as gettid is such a pain,
>> and passing 0 is much easier.  Still it is easier to keep this special
>> case than to do the reasarch that will show it isn't used.
>
> It might be not your favorite, but when I refactored the code I learned
> the hard way that one of the test suites has assumptions that
> clock_gettime(PROCESS) works from any task of a group and not just for
> the group leader. Sure we could fix the test suite, but test code tends
> to be copied ...

Do you know which test suite?
It would be nice to see such surprising code with my own eyes.

I completely agree that clock_gettime(PROCESS) should work for any task
of a group.  I would think anything with a constant like that would just
be passing in 0, which is trivial.  Looking up your threadid seems like
extra work.

Mostly my complaint is that the gettime subcase is an awkward special
case.  Added in 33ab0fec3352 ("posix-timers: Consolidate
posix_cpu_clock_get()") and the only justification for changing the
userspace ABI was that it made things less awkward to combine to
branches of code.

>>  /*
>>   * Functions for validating access to tasks.
>>   */
>> -static struct task_struct *lookup_task(const pid_t pid, bool thread,
>> +static struct task_struct *lookup_task(const pid_t which_pid, bool thread,
>>  				       bool gettime)
>>  {
>>  	struct task_struct *p;
>> +	struct pid *pid;
>>  
>>  	/*
>>  	 * If the encoded PID is 0, then the timer is targeted at current
>>  	 * or the process to which current belongs.
>>  	 */
>> -	if (!pid)
>> +	if (!which_pid)
>>  		return thread ? current : current->group_leader;
>>  
>> -	p = find_task_by_vpid(pid);
>> -	if (!p)
>> -		return p;
>> -
>> -	if (thread)
>> -		return same_thread_group(p, current) ? p : NULL;
>> -
>> -	if (gettime) {
>> +	pid = find_vpid(which_pid);
>> +	if (thread) {
>> +		p = pid_task(pid, PIDTYPE_PID);
>> +		if (p && !same_thread_group(p, current))
>> +			p = NULL;
>> +	} else {
>>  		/*
>>  		 * For clock_gettime(PROCESS) the task does not need to be
>>  		 * the actual group leader. tsk->sighand gives
>> @@ -76,13 +75,13 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
>>  		 * reference on it and store the task pointer until the
>>  		 * timer is destroyed.
>
> Btw, this comment is wrong since
>
>      55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task")

It is definitely stale.  It continues to describe the motive for
limiting ourselves to a thread_group_leader.

I am cooking up a patch to tweak that comment, and get replace of
has_group_leader_pid.  Since it is unnecessary I just want to decouple
that work from this patchset.

Eric



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3] proc: Ensure we see the exit of each process tid exactly
       [not found]         ` <87r1w8ete7.fsf@x220.int.ebiederm.org>
@ 2020-04-27 20:23           ` Eric W. Biederman
  0 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-27 20:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds, Thomas Gleixner

ebiederm@xmission.com (Eric W. Biederman) writes:
> Take a good look over the code base and see if I can see any more
> issues like was found in next_tgid and repost the core patches.

Oleg,

I am reading through kernel/ptrace.c and I am seeing a lot of:

	spin_lock(&child->sighand->siglock);

In places where I don't see anything guaranteeing the child is stopped
such as ptrace_freeze_traced.  Are all of those places safe or
do some of the need to be transformed into lock_task_sighand in
case the process is current running?

I might just be reading the code to quickly and missing what keeps the
code from executing exec and changing sighand.

Eric

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 3/6] rculist: Add hlist_swap_before_rcu
  2020-04-27 14:28             ` Eric W. Biederman
@ 2020-04-27 20:27               ` Linus Torvalds
  2020-04-28 12:16                 ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
  0 siblings, 1 reply; 54+ messages in thread
From: Linus Torvalds @ 2020-04-27 20:27 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Oleg Nesterov, Thomas Gleixner,
	Paul E. McKenney

On Mon, Apr 27, 2020 at 7:32 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Would hlists_swap_heads_rcu be noticably better?

Edited out most of the rest because I think we're in agreement and
it's not a huge deal.

And yes, I think it might be nice to just call it "swap_heads()" and
make the naming match what the user wants, so that people who don't
care about the implementation can just guess from the function name.

But I also think that by now it's mostly bikeshedding, and I'm
probably also now biased by the fact that I have looked at that code
and read your explanations so it's all fresh in my mind.

A year from now, when I've forgotten the details, who knows which part
I'd find confusing? ;)

               Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-27 20:27               ` Linus Torvalds
@ 2020-04-28 12:16                 ` Eric W. Biederman
  2020-04-28 12:18                   ` [PATCH v4 1/2] rculist: Add hlists_swap_heads_rcu Eric W. Biederman
                                     ` (3 more replies)
  0 siblings, 4 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-28 12:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Oleg Nesterov, Thomas Gleixner,
	Paul E. McKenney


In the work to remove proc_mnt I noticed that we were calling
proc_flush_task now proc_flush_pid possibly multiple times for the same
pid because of how de_thread works.

This is a bare minimal patchset to sort out de_thread, by introducing
exchange_tids and the helper of exchange_tids hlists_swap_heads_rcu.

The actual call of exchange_tids should be slowpath so I have
prioritized readability over getting every last drop of performance.

I have also read through a bunch of the code to see if I could find
anything that would be affected by this change.  Users of
has_group_leader_pid were a good canidates.  But I also looked at other
cases that might have a pid->task->pid transition.  I ignored other
sources of races with de_thread and exec as those are preexisting.

I found a close call with send_signals user of task_active_pid_ns, but
all pids of a thread group are guaranteeds to be in the same pid
namespace so there is not a problem.

I found a few pieces of debugging code that do:

	task = pid_task(pid, PIDTYPE_PID);
        if (task) {
        	printk("%u\n", task->pid);
        }

But I can't see how we care if it happens at the wrong moment that
task->pid might not match pid_nr(pid);

Similarly because the code in posix-cpu-timers goes pid->task->pid it
feels like there should be a problem.  But as the code that works with
PIDTYPE_PID is only available within the thread group, and as de_thread
kills all of the other threads before it makes any changes of this
kind the race can not happen.

In short I don't think this change will introduce any regressions.

Eric W. Biederman (2):
      rculist: Add hlists_swap_heads_rcu
      proc: Ensure we see the exit of each process tid exactly once

 fs/exec.c               |  5 +----
 include/linux/pid.h     |  1 +
 include/linux/rculist.h | 21 +++++++++++++++++++++
 kernel/pid.c            | 19 +++++++++++++++++++
 4 files changed, 42 insertions(+), 4 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v4 1/2] rculist: Add hlists_swap_heads_rcu
  2020-04-28 12:16                 ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
@ 2020-04-28 12:18                   ` Eric W. Biederman
  2020-04-28 12:19                   ` [PATCH v4 2/2] proc: Ensure we see the exit of each process tid exactly once Eric W. Biederman
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-28 12:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Oleg Nesterov, Thomas Gleixner,
	Paul E. McKenney


Using the struct pid to refer to two tasks in de_thread was a clever
idea and ultimately too clever, as it has lead to proc_flush_task
being called inconsistently.

To support rectifying this add hlists_swap_heads_rcu.  An hlist
primitive that just swaps the hlist heads of two lists.  This is
exactly what is needed for exchanging the pids of two tasks.

Only consideration of correctness of the code has been given,
as the caller is expected to be a slowpath.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/rculist.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 8214cdc715f2..67867e0d4cec 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -506,6 +506,27 @@ static inline void hlist_replace_rcu(struct hlist_node *old,
 	WRITE_ONCE(old->pprev, LIST_POISON2);
 }
 
+/**
+ * hlists_swap_heads_rcu - swap the lists the hlist heads point to
+ * @left:  The hlist head on the left
+ * @right: The hlist head on the right
+ *
+ * The lists start out as [@left  ][node1 ... ] and
+                          [@right ][node2 ... ]
+ * The lists end up as    [@left  ][node2 ... ]
+ *                        [@right ][node1 ... ]
+ */
+static inline void hlists_swap_heads_rcu(struct hlist_head *left, struct hlist_head *right)
+{
+	struct hlist_node *node1 = left->first;
+	struct hlist_node *node2 = right->first;
+
+	rcu_assign_pointer(left->first, node2);
+	rcu_assign_pointer(right->first, node1);
+	WRITE_ONCE(node2->pprev, &left->first);
+	WRITE_ONCE(node1->pprev, &right->first);
+}
+
 /*
  * return the first or the next element in an RCU protected hlist
  */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 2/2] proc: Ensure we see the exit of each process tid exactly once
  2020-04-28 12:16                 ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
  2020-04-28 12:18                   ` [PATCH v4 1/2] rculist: Add hlists_swap_heads_rcu Eric W. Biederman
@ 2020-04-28 12:19                   ` Eric W. Biederman
  2020-04-28 16:53                   ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Linus Torvalds
  2020-04-28 18:05                   ` Oleg Nesterov
  3 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-28 12:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Oleg Nesterov, Thomas Gleixner,
	Paul E. McKenney


When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.

Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.

When I removed switch_exec_pids and introduced this behavior
in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.

This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization").  Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.

The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently.  Fix
this just so there are no little surprise corner cases waiting to bite
people.

-- Oleg points out this could be an issue in next_tgid in proc where
   has_group_leader_pid is called, and reording some of the assignments
   should fix that.

-- Oleg points out this will break the 10 year old hack in __exit_signal.c
>	/*
>	 * This can only happen if the caller is de_thread().
>	 * FIXME: this is the temporary hack, we should teach
>	 * posix-cpu-timers to handle this case correctly.
>	 */
>	if (unlikely(has_group_leader_pid(tsk)))
>		posix_cpu_timers_exit_group(tsk);

The code in next_tgid has been changed to use PIDTYPE_TGID,
and the posix cpu timers code has been fixed so it does not
need the 10 year old hack, so this should be safe to merge
now.

Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/exec.c           |  5 +----
 include/linux/pid.h |  1 +
 kernel/pid.c        | 19 +++++++++++++++++++
 3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 06b4c550af5d..9b60f927afd7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1186,11 +1186,8 @@ static int de_thread(struct task_struct *tsk)
 
 		/* Become a process group leader with the old leader's pid.
 		 * The old leader becomes a thread of the this thread group.
-		 * Note: The old leader also uses this pid until release_task
-		 *       is called.  Odd but simple and correct.
 		 */
-		tsk->pid = leader->pid;
-		change_pid(tsk, PIDTYPE_PID, task_pid(leader));
+		exchange_tids(tsk, leader);
 		transfer_pid(leader, tsk, PIDTYPE_TGID);
 		transfer_pid(leader, tsk, PIDTYPE_PGID);
 		transfer_pid(leader, tsk, PIDTYPE_SID);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index cc896f0fc4e3..2159ffca63fc 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -102,6 +102,7 @@ extern void attach_pid(struct task_struct *task, enum pid_type);
 extern void detach_pid(struct task_struct *task, enum pid_type);
 extern void change_pid(struct task_struct *task, enum pid_type,
 			struct pid *pid);
+extern void exchange_tids(struct task_struct *task, struct task_struct *old);
 extern void transfer_pid(struct task_struct *old, struct task_struct *new,
 			 enum pid_type);
 
diff --git a/kernel/pid.c b/kernel/pid.c
index c835b844aca7..6d5d0a5bda82 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -363,6 +363,25 @@ void change_pid(struct task_struct *task, enum pid_type type,
 	attach_pid(task, type);
 }
 
+void exchange_tids(struct task_struct *left, struct task_struct *right)
+{
+	struct pid *pid1 = left->thread_pid;
+	struct pid *pid2 = right->thread_pid;
+	struct hlist_head *head1 = &pid1->tasks[PIDTYPE_PID];
+	struct hlist_head *head2 = &pid2->tasks[PIDTYPE_PID];
+
+	/* Swap the single entry tid lists */
+	hlists_swap_heads_rcu(head1, head2);
+
+	/* Swap the per task_struct pid */
+	rcu_assign_pointer(left->thread_pid, pid2);
+	rcu_assign_pointer(right->thread_pid, pid1);
+
+	/* Swap the cached value */
+	WRITE_ONCE(left->pid, pid_nr(pid2));
+	WRITE_ONCE(right->pid, pid_nr(pid1));
+}
+
 /* transfer_pid is an optimization of attach_pid(new), detach_pid(old) */
 void transfer_pid(struct task_struct *old, struct task_struct *new,
 			   enum pid_type type)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-28 12:16                 ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
  2020-04-28 12:18                   ` [PATCH v4 1/2] rculist: Add hlists_swap_heads_rcu Eric W. Biederman
  2020-04-28 12:19                   ` [PATCH v4 2/2] proc: Ensure we see the exit of each process tid exactly once Eric W. Biederman
@ 2020-04-28 16:53                   ` Linus Torvalds
  2020-04-28 17:55                     ` Eric W. Biederman
  2020-04-28 18:55                     ` Eric W. Biederman
  2020-04-28 18:05                   ` Oleg Nesterov
  3 siblings, 2 replies; 54+ messages in thread
From: Linus Torvalds @ 2020-04-28 16:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Oleg Nesterov, Thomas Gleixner,
	Paul E. McKenney

On Tue, Apr 28, 2020 at 5:20 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> In short I don't think this change will introduce any regressions.

I think the series looks fine, but I also think the long explanation
(that I snipped in this reply) in the cover letter should be there in
the kernel tree.

So if you send me this as a single pull request, with that explanation
(either in the email or in the signed tag - although you don't seem to
use tags normally - so that we have that extra commentary for
posterity, that sounds good.

That said, this fix seems to not matter for normal operation, so
unless it's holding up something important, maybe it's 5.8 material?

                Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-28 16:53                   ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Linus Torvalds
@ 2020-04-28 17:55                     ` Eric W. Biederman
  2020-04-28 18:55                     ` Eric W. Biederman
  1 sibling, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-28 17:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Oleg Nesterov, Thomas Gleixner,
	Paul E. McKenney

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Apr 28, 2020 at 5:20 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> In short I don't think this change will introduce any regressions.
>
> I think the series looks fine, but I also think the long explanation
> (that I snipped in this reply) in the cover letter should be there in
> the kernel tree.

When I have been adding patchsets like this to my tree I have been doing
merge --no-ff so I can create a place for explanations like this, and I
will do the same with this.

I already have Alexey Gladkov's proc changes, and my next_tgid cleanup
on a branch of proc changes in my tree already.

> So if you send me this as a single pull request, with that explanation
> (either in the email or in the signed tag - although you don't seem to
> use tags normally - so that we have that extra commentary for
> posterity, that sounds good.

I hope you don't mind if I combind this with some other proc changes.
If you do mind I will put this on a separate topic branch.

Right now it just seems easier for me to keep track of if I keep my
number of topics limited.

> That said, this fix seems to not matter for normal operation, so
> unless it's holding up something important, maybe it's 5.8 material?

Yes, this is 5.8 material.

I am just aiming to get review before I put in linux-next, and later
send it to your for merging.  I should have mentioned that in the cover
letter.

I am noticing that removing technical debt without adding more technical
debt is quite a challenge.

Eric



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v3 2/6] posix-cpu-timers: Use PIDTYPE_TGID to simplify the logic in lookup_task
  2020-04-27 11:51             ` Eric W. Biederman
@ 2020-04-28 18:03               ` Oleg Nesterov
  0 siblings, 0 replies; 54+ messages in thread
From: Oleg Nesterov @ 2020-04-28 18:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Linus Torvalds, Thomas Gleixner,
	Paul E. McKenney

On 04/27, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Eric,
> >
> > I am sick today and can't read the code, but I feel this patch is not
> > right ... please correct me.
>
>
> > So, iiuc when posix_cpu_timer_create() is called and CPUCLOCK_PERTHREAD
> > is false we roughly have
> >
> > 	task = pid_task(pid, PIDTYPE_TGID);			// lookup_task()
> >
> > 	/* WINDOW */
> >
> > 	timer->it.cpu.pid = = get_task_pid(task, PIDTYPE_TGID)	// posix_cpu_timer_create()
> >
> > Now suppose that we race with mt-exec and this "task" is the old leader;
> > it can be release_task()'ed in the WINDOW above and then get_task_pid()
> > will return NULL.
>
> Except it is asking for PIDTYPE_TGID.
>
> task->signal

Ah yes, I knew I missed something...

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-28 12:16                 ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
                                     ` (2 preceding siblings ...)
  2020-04-28 16:53                   ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Linus Torvalds
@ 2020-04-28 18:05                   ` Oleg Nesterov
  2020-04-28 18:54                     ` Eric W. Biederman
  2020-04-28 21:39                     ` [PATCH v1 0/4] signal: Removing has_group_leader_pid Eric W. Biederman
  3 siblings, 2 replies; 54+ messages in thread
From: Oleg Nesterov @ 2020-04-28 18:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, LKML, Linux FS Devel, Alexey Dobriyan,
	Alexey Gladkov, Andrew Morton, Alexey Gladkov, Thomas Gleixner,
	Paul E. McKenney

On 04/28, Eric W. Biederman wrote:
>
> In short I don't think this change will introduce any regressions.
>
> Eric W. Biederman (2):
>       rculist: Add hlists_swap_heads_rcu
>       proc: Ensure we see the exit of each process tid exactly once

Eric, sorry, I got lost.

Both changes look good to me, feel free to add my ack, but I presume
this is on top of next_tgid/lookup_task changes ? If yes, why did not
you remove has_group_leader_pid?

Oleg.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-28 18:05                   ` Oleg Nesterov
@ 2020-04-28 18:54                     ` Eric W. Biederman
  2020-04-28 21:39                     ` [PATCH v1 0/4] signal: Removing has_group_leader_pid Eric W. Biederman
  1 sibling, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-28 18:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, LKML, Linux FS Devel, Alexey Dobriyan,
	Alexey Gladkov, Andrew Morton, Alexey Gladkov, Thomas Gleixner,
	Paul E. McKenney

Oleg Nesterov <oleg@redhat.com> writes:

> On 04/28, Eric W. Biederman wrote:
>>
>> In short I don't think this change will introduce any regressions.
>>
>> Eric W. Biederman (2):
>>       rculist: Add hlists_swap_heads_rcu
>>       proc: Ensure we see the exit of each process tid exactly once
>
> Eric, sorry, I got lost.
>
> Both changes look good to me, feel free to add my ack, but I presume
> this is on top of next_tgid/lookup_task changes ? If yes, why did not
> you remove has_group_leader_pid?

On top of next_tgid.

Upon a close examination there are not any current bugs in
posix-cpu-timers nor is there anything that exchange_tids
will make worse.

I am preparing a follow on patchset to kill has_group_leader_pid.

I am preparing a further follow on patchset to see if I can get that
code to start returning pids, because that is cheaper and clearer.


I pushed those changes a little farther out so I could maintain focus on
what I am accomplishing.

Adding exchange_tids was difficult because I had to audit pretty much
all of the pid use in the kernel to see if the small change in behavior
would make anything worse.   The rest of the changes should be simpler
and more localized so I hope they go faster.

Eric


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-28 16:53                   ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Linus Torvalds
  2020-04-28 17:55                     ` Eric W. Biederman
@ 2020-04-28 18:55                     ` Eric W. Biederman
  2020-04-28 19:36                       ` Linus Torvalds
  1 sibling, 1 reply; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-28 18:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Oleg Nesterov, Thomas Gleixner,
	Paul E. McKenney

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Apr 28, 2020 at 5:20 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> In short I don't think this change will introduce any regressions.
>
> I think the series looks fine.

Mind if I translate that into

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
on the patches?

Eric

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly
  2020-04-28 18:55                     ` Eric W. Biederman
@ 2020-04-28 19:36                       ` Linus Torvalds
  0 siblings, 0 replies; 54+ messages in thread
From: Linus Torvalds @ 2020-04-28 19:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: LKML, Linux FS Devel, Alexey Dobriyan, Alexey Gladkov,
	Andrew Morton, Alexey Gladkov, Oleg Nesterov, Thomas Gleixner,
	Paul E. McKenney

On Tue, Apr 28, 2020 at 11:59 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> >
> > I think the series looks fine.
>
> Mind if I translate that into
>
> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
> on the patches?

Sure, go right ahead.

            Linus

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v1 0/4] signal: Removing has_group_leader_pid
  2020-04-28 18:05                   ` Oleg Nesterov
  2020-04-28 18:54                     ` Eric W. Biederman
@ 2020-04-28 21:39                     ` Eric W. Biederman
  2020-04-28 21:45                       ` [PATCH v1 1/4] posix-cpu-timer: Tidy up group_leader logic in lookup_task Eric W. Biederman
                                         ` (4 more replies)
  1 sibling, 5 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-28 21:39 UTC (permalink / raw)
  To: LKML; +Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner


With de_thread now calling exchange_tids has_group_leader_pid no longer
makes any sense and is equivalent to calling thread_group_leader.

As there are only 2 remaining users of has_group_leader_pid let's
update the code and get rid of has_group_leader_pid.

There is one extra patch to lookup_task that performs that unifies
to code paths that become identical when has_group_leader_pid went
away.

Eric W. Biederman (4):
      posix-cpu-timer: Tidy up group_leader logic in lookup_task
      posix-cpu-timer:  Unify the now redundant code in lookup_task
      exec: Remove BUG_ON(has_group_leader_pid)
      signal: Remove has_group_leader_pid

 fs/exec.c                      |  1 -
 include/linux/sched/signal.h   | 11 -----------
 kernel/time/posix-cpu-timers.c | 21 ++++++++-------------
 3 files changed, 8 insertions(+), 25 deletions(-)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v1 1/4] posix-cpu-timer: Tidy up group_leader logic in lookup_task
  2020-04-28 21:39                     ` [PATCH v1 0/4] signal: Removing has_group_leader_pid Eric W. Biederman
@ 2020-04-28 21:45                       ` Eric W. Biederman
  2020-04-28 21:48                       ` [PATCH 2/4] posix-cpu-timer: Unify the now redundant code " Eric W. Biederman
                                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-28 21:45 UTC (permalink / raw)
  To: LKML; +Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner


Replace has_group_leader_pid with thread_group_leader.  Years ago Oleg
suggested changing thread_group_leader to has_group_leader_pid to handle
races.  Looking at the code then and now I don't see how it ever helped.
Especially as then the code really did need to be the
thread_group_leader.

Today it doesn't make a difference if thread_group_leader races with
de_thread as the task returned from lookup_task in the non-thread case
is just used to find values in task->signal.

Since the races with de_thread have never been handled revert
has_group_header_pid to thread_group_leader for clarity.

Update the comment in lookup_task to remove implementation details that
are no longer true and to mention task->signal instead of task->sighand,
as the relevant cpu timer details are all in task->signal.

Ref: 55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task")
Ref: c0deae8c9587 ("posix-cpu-timers: Rcu_read_lock/unlock protect find_task_by_vpid call")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/time/posix-cpu-timers.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 2fd3b3fa68bf..e4051e417bcb 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -69,12 +69,8 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
 	if (gettime) {
 		/*
 		 * For clock_gettime(PROCESS) the task does not need to be
-		 * the actual group leader. tsk->sighand gives
+		 * the actual group leader. task->signal gives
 		 * access to the group's clock.
-		 *
-		 * Timers need the group leader because they take a
-		 * reference on it and store the task pointer until the
-		 * timer is destroyed.
 		 */
 		return (p == current || thread_group_leader(p)) ? p : NULL;
 	}
@@ -82,7 +78,7 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
 	/*
 	 * For processes require that p is group leader.
 	 */
-	return has_group_leader_pid(p) ? p : NULL;
+	return thread_group_leader(p) ? p : NULL;
 }
 
 static struct task_struct *__get_task_for_clock(const clockid_t clock,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 2/4] posix-cpu-timer: Unify the now redundant code in lookup_task
  2020-04-28 21:39                     ` [PATCH v1 0/4] signal: Removing has_group_leader_pid Eric W. Biederman
  2020-04-28 21:45                       ` [PATCH v1 1/4] posix-cpu-timer: Tidy up group_leader logic in lookup_task Eric W. Biederman
@ 2020-04-28 21:48                       ` Eric W. Biederman
  2020-04-28 21:53                       ` [PATCH v1 3/4] exec: Remove BUG_ON(has_group_leader_pid) Eric W. Biederman
                                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-28 21:48 UTC (permalink / raw)
  To: LKML; +Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner


Now that both !thread paths through lookup_task call
thread_group_leader, unify them into the single test at the end of
lookup_task.
    
This unification just makes it clear what is happening in the gettime
special case of lookup_task.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/time/posix-cpu-timers.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index e4051e417bcb..b7f972fb115e 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -66,14 +66,13 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
 	if (thread)
 		return same_thread_group(p, current) ? p : NULL;
 
-	if (gettime) {
-		/*
-		 * For clock_gettime(PROCESS) the task does not need to be
-		 * the actual group leader. task->signal gives
-		 * access to the group's clock.
-		 */
-		return (p == current || thread_group_leader(p)) ? p : NULL;
-	}
+	/*
+	 * For clock_gettime(PROCESS) the task does not need to be
+	 * the actual group leader. task->signal gives
+	 * access to the group's clock.
+	 */
+	if (gettime && (p == current))
+		return p;
 
 	/*
 	 * For processes require that p is group leader.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v1 3/4] exec: Remove BUG_ON(has_group_leader_pid)
  2020-04-28 21:39                     ` [PATCH v1 0/4] signal: Removing has_group_leader_pid Eric W. Biederman
  2020-04-28 21:45                       ` [PATCH v1 1/4] posix-cpu-timer: Tidy up group_leader logic in lookup_task Eric W. Biederman
  2020-04-28 21:48                       ` [PATCH 2/4] posix-cpu-timer: Unify the now redundant code " Eric W. Biederman
@ 2020-04-28 21:53                       ` Eric W. Biederman
  2020-04-28 21:56                       ` [PATCH v4 4/4] signal: Remove has_group_leader_pid Eric W. Biederman
  2020-04-30 11:54                       ` [PATCH v1 0/3] posix-cpu-timers: Use pids not tasks in lookup Eric W. Biederman
  4 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-28 21:53 UTC (permalink / raw)
  To: LKML; +Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner


With the introduction of exchange_tids thread_group_leader and
has_group_leader_pid have become equivalent.  Further at this point in the
code a thread group has exactly two threads, the previous thread_group_leader
that is waiting to be reaped and tsk.  So we know it is impossible for tsk to
be the thread_group_leader.

This is also the last user of has_group_leader_pid so removing this check
will allow has_group_leader_pid to be removed.

So remove the "BUG_ON(has_group_leader_pid)" that will never fire.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 9b60f927afd7..6ab1c19d84fa 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1176,7 +1176,6 @@ static int de_thread(struct task_struct *tsk)
 		tsk->start_boottime = leader->start_boottime;
 
 		BUG_ON(!same_thread_group(leader, tsk));
-		BUG_ON(has_group_leader_pid(tsk));
 		/*
 		 * An exec() starts a new thread group with the
 		 * TGID of the previous thread group. Rehash the
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 4/4] signal: Remove has_group_leader_pid
  2020-04-28 21:39                     ` [PATCH v1 0/4] signal: Removing has_group_leader_pid Eric W. Biederman
                                         ` (2 preceding siblings ...)
  2020-04-28 21:53                       ` [PATCH v1 3/4] exec: Remove BUG_ON(has_group_leader_pid) Eric W. Biederman
@ 2020-04-28 21:56                       ` Eric W. Biederman
  2020-04-30 11:54                       ` [PATCH v1 0/3] posix-cpu-timers: Use pids not tasks in lookup Eric W. Biederman
  4 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-28 21:56 UTC (permalink / raw)
  To: LKML; +Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner


After the introduction of exchange_tids has_group_leader_pid is
equivalent to thread_group_leader.  After the last couple of cleanups
has_group_leader_pid has no more callers.

So remove the now unused and redundant has_group_leader_pid.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/sched/signal.h | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 3e5b090c16d4..0ee5e696c5d8 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -654,17 +654,6 @@ static inline bool thread_group_leader(struct task_struct *p)
 	return p->exit_signal >= 0;
 }
 
-/* Do to the insanities of de_thread it is possible for a process
- * to have the pid of the thread group leader without actually being
- * the thread group leader.  For iteration through the pids in proc
- * all we care about is that we have a task with the appropriate
- * pid, we don't actually care if we have the right task.
- */
-static inline bool has_group_leader_pid(struct task_struct *p)
-{
-	return task_pid(p) == task_tgid(p);
-}
-
 static inline
 bool same_thread_group(struct task_struct *p1, struct task_struct *p2)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v1 0/3] posix-cpu-timers: Use pids not tasks in lookup
  2020-04-28 21:39                     ` [PATCH v1 0/4] signal: Removing has_group_leader_pid Eric W. Biederman
                                         ` (3 preceding siblings ...)
  2020-04-28 21:56                       ` [PATCH v4 4/4] signal: Remove has_group_leader_pid Eric W. Biederman
@ 2020-04-30 11:54                       ` Eric W. Biederman
  2020-04-30 11:55                         ` [PATCH v1 1/3] posix-cpu-timers: Extend rcu_read_lock removing task_struct references Eric W. Biederman
                                           ` (2 more replies)
  4 siblings, 3 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-30 11:54 UTC (permalink / raw)
  To: LKML; +Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner


The current posix-cpu-timer code uses pids when holding persistent
references in timers.  However the lookups from clockid_t still return
tasks that need to be converted into pids for use.

This results in usage being pid->task->pid and that can race with
release_task and de_thread.  This can lead to some not wrong but
surprising results.  Surprising enough that Oleg and I both thought
there were some bugs in the code for a while.

This set of changes modifies the code to just lookup, verify, and return
pids from the clockid_t lookups to remove those potentialy troublesome
races.

Eric W. Biederman (3):
      posix-cpu-timers: Extend rcu_read_lock removing task_struct references
      posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
      posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock

 kernel/time/posix-cpu-timers.c | 102 ++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 57 deletions(-)

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

---

The changes can also be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git proc-testing


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v1 1/3] posix-cpu-timers: Extend rcu_read_lock removing task_struct references
  2020-04-30 11:54                       ` [PATCH v1 0/3] posix-cpu-timers: Use pids not tasks in lookup Eric W. Biederman
@ 2020-04-30 11:55                         ` Eric W. Biederman
  2020-04-30 11:56                         ` [PATCH v1 2/3] posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type Eric W. Biederman
  2020-04-30 11:56                         ` [PATCH v1 3/3] posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock Eric W. Biederman
  2 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-30 11:55 UTC (permalink / raw)
  To: LKML; +Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner


Now that the code stores of pid references it is no longer necessary
or desirable to take a reference on task_struct in __get_task_for_clock.

Instead extend the scope of rcu_read_lock and remove the reference
counting on struct task_struct entirely.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/time/posix-cpu-timers.c | 43 ++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index b7f972fb115e..91996dd089a4 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -81,36 +81,36 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
 }
 
 static struct task_struct *__get_task_for_clock(const clockid_t clock,
-						bool getref, bool gettime)
+						bool gettime)
 {
 	const bool thread = !!CPUCLOCK_PERTHREAD(clock);
 	const pid_t pid = CPUCLOCK_PID(clock);
-	struct task_struct *p;
 
 	if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
 		return NULL;
 
-	rcu_read_lock();
-	p = lookup_task(pid, thread, gettime);
-	if (p && getref)
-		get_task_struct(p);
-	rcu_read_unlock();
-	return p;
+	return lookup_task(pid, thread, gettime);
 }
 
 static inline struct task_struct *get_task_for_clock(const clockid_t clock)
 {
-	return __get_task_for_clock(clock, true, false);
+	return __get_task_for_clock(clock, false);
 }
 
 static inline struct task_struct *get_task_for_clock_get(const clockid_t clock)
 {
-	return __get_task_for_clock(clock, true, true);
+	return __get_task_for_clock(clock, true);
 }
 
 static inline int validate_clock_permissions(const clockid_t clock)
 {
-	return __get_task_for_clock(clock, false, false) ? 0 : -EINVAL;
+	int ret;
+
+	rcu_read_lock();
+	ret = __get_task_for_clock(clock, false) ? 0 : -EINVAL;
+	rcu_read_unlock();
+
+	return ret;
 }
 
 static inline enum pid_type cpu_timer_pid_type(struct k_itimer *timer)
@@ -368,15 +368,18 @@ static int posix_cpu_clock_get(const clockid_t clock, struct timespec64 *tp)
 	struct task_struct *tsk;
 	u64 t;
 
+	rcu_read_lock();
 	tsk = get_task_for_clock_get(clock);
-	if (!tsk)
+	if (!tsk) {
+		rcu_read_unlock();
 		return -EINVAL;
+	}
 
 	if (CPUCLOCK_PERTHREAD(clock))
 		t = cpu_clock_sample(clkid, tsk);
 	else
 		t = cpu_clock_sample_group(clkid, tsk, false);
-	put_task_struct(tsk);
+	rcu_read_unlock();
 
 	*tp = ns_to_timespec64(t);
 	return 0;
@@ -389,19 +392,19 @@ static int posix_cpu_clock_get(const clockid_t clock, struct timespec64 *tp)
  */
 static int posix_cpu_timer_create(struct k_itimer *new_timer)
 {
-	struct task_struct *p = get_task_for_clock(new_timer->it_clock);
+	struct task_struct *p;
 
-	if (!p)
+	rcu_read_lock();
+	p = get_task_for_clock(new_timer->it_clock);
+	if (!p) {
+		rcu_read_unlock();
 		return -EINVAL;
+	}
 
 	new_timer->kclock = &clock_posix_cpu;
 	timerqueue_init(&new_timer->it.cpu.node);
 	new_timer->it.cpu.pid = get_task_pid(p, cpu_timer_pid_type(new_timer));
-	/*
-	 * get_task_for_clock() took a reference on @p. Drop it as the timer
-	 * holds a reference on the pid of @p.
-	 */
-	put_task_struct(p);
+	rcu_read_unlock();
 	return 0;
 }
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v1 2/3] posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
  2020-04-30 11:54                       ` [PATCH v1 0/3] posix-cpu-timers: Use pids not tasks in lookup Eric W. Biederman
  2020-04-30 11:55                         ` [PATCH v1 1/3] posix-cpu-timers: Extend rcu_read_lock removing task_struct references Eric W. Biederman
@ 2020-04-30 11:56                         ` Eric W. Biederman
  2020-04-30 11:56                         ` [PATCH v1 3/3] posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock Eric W. Biederman
  2 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-30 11:56 UTC (permalink / raw)
  To: LKML; +Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner


Taking a clock and returning a pid_type is a more general and
a superset of taking a timer and returning a pid_type.

Perform this generalization so that future changes may use
this code on clocks as well as timers.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/time/posix-cpu-timers.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 91996dd089a4..42f673974d71 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -113,14 +113,14 @@ static inline int validate_clock_permissions(const clockid_t clock)
 	return ret;
 }
 
-static inline enum pid_type cpu_timer_pid_type(struct k_itimer *timer)
+static inline enum pid_type clock_pid_type(const clockid_t clock)
 {
-	return CPUCLOCK_PERTHREAD(timer->it_clock) ? PIDTYPE_PID : PIDTYPE_TGID;
+	return CPUCLOCK_PERTHREAD(clock) ? PIDTYPE_PID : PIDTYPE_TGID;
 }
 
 static inline struct task_struct *cpu_timer_task_rcu(struct k_itimer *timer)
 {
-	return pid_task(timer->it.cpu.pid, cpu_timer_pid_type(timer));
+	return pid_task(timer->it.cpu.pid, clock_pid_type(timer->it_clock));
 }
 
 /*
@@ -403,7 +403,7 @@ static int posix_cpu_timer_create(struct k_itimer *new_timer)
 
 	new_timer->kclock = &clock_posix_cpu;
 	timerqueue_init(&new_timer->it.cpu.node);
-	new_timer->it.cpu.pid = get_task_pid(p, cpu_timer_pid_type(new_timer));
+	new_timer->it.cpu.pid = get_task_pid(p, clock_pid_type(new_timer->it_clock));
 	rcu_read_unlock();
 	return 0;
 }
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v1 3/3] posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
  2020-04-30 11:54                       ` [PATCH v1 0/3] posix-cpu-timers: Use pids not tasks in lookup Eric W. Biederman
  2020-04-30 11:55                         ` [PATCH v1 1/3] posix-cpu-timers: Extend rcu_read_lock removing task_struct references Eric W. Biederman
  2020-04-30 11:56                         ` [PATCH v1 2/3] posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type Eric W. Biederman
@ 2020-04-30 11:56                         ` Eric W. Biederman
  2 siblings, 0 replies; 54+ messages in thread
From: Eric W. Biederman @ 2020-04-30 11:56 UTC (permalink / raw)
  To: LKML; +Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner


Now that the codes store references to pids instead of referendes to
tasks.  Looking up a task for a clock instead of looking up a struct
pid makes the code more difficult to verify it is correct than
necessary.

In posix_cpu_timers_create get_task_pid can race with release_task for
threads and return a NULL pid.  As put_pid and cpu_timer_task_rcu
handle NULL pids just fine the code works without problems but it is
an extra case to consider and keep in mind while verifying and
modifying the code.

There are races with de_thread to consider that only don't apply
because thread clocks are only allowed for threads in the same
thread_group.

So instead of leaving a burden for people making modification to the
code in the future return a rcu protected struct pid for the clock
instead.

The logic for __get_task_for_pid and lookup_task has been folded into
the new function pid_for_clock with the only change being the logic
has been modified from working on a task to working on a pid that
will be returned.

In posix_cpu_clock_get instead of calling pid_for_clock checking the
result and then calling pid_task to get the task.  The result of
pid_for_clock is fed directly into pid_task.  This is safe because
pid_task handles NULL pids.  As such an extra error check was
unnecessary.

Instead of hiding the flag that enables the special clock_gettime
handling, I have made the 3 callers just pass the flag in themselves.
That is less code and seems just as simple to work with as the
wrapper functions.

Historically the clock_gettime special case of allowing a process
clock to be found by the thread id did not even exist [33ab0fec3352]
but Thomas Gleixner reports that he has found code that uses that
functionality [55e8c8eb2c7b].

Link: https://lkml.kernel.org/r/87zhaxqkwa.fsf@nanos.tec.linutronix.de/
Ref: 33ab0fec3352 ("posix-timers: Consolidate posix_cpu_clock_get()")
Ref: 55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/time/posix-cpu-timers.c | 75 ++++++++++++++--------------------
 1 file changed, 30 insertions(+), 45 deletions(-)

diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 42f673974d71..165117996ea0 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -47,59 +47,44 @@ void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new)
 /*
  * Functions for validating access to tasks.
  */
-static struct task_struct *lookup_task(const pid_t pid, bool thread,
-				       bool gettime)
+static struct pid *pid_for_clock(const clockid_t clock, bool gettime)
 {
-	struct task_struct *p;
+	const bool thread = !!CPUCLOCK_PERTHREAD(clock);
+	const pid_t upid = CPUCLOCK_PID(clock);
+	struct pid *pid;
+
+	if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
+		return NULL;
 
 	/*
 	 * If the encoded PID is 0, then the timer is targeted at current
 	 * or the process to which current belongs.
 	 */
-	if (!pid)
-		return thread ? current : current->group_leader;
+	if (upid == 0)
+		return thread ? task_pid(current) : task_tgid(current);
 
-	p = find_task_by_vpid(pid);
-	if (!p)
-		return p;
+	pid = find_vpid(upid);
+	if (!pid)
+		return NULL;
 
-	if (thread)
-		return same_thread_group(p, current) ? p : NULL;
+	if (thread) {
+		struct task_struct *tsk = pid_task(pid, PIDTYPE_PID);
+		return (tsk && same_thread_group(tsk, current)) ? pid : NULL;
+	}
 
 	/*
-	 * For clock_gettime(PROCESS) the task does not need to be
-	 * the actual group leader. task->signal gives
-	 * access to the group's clock.
+	 * For clock_gettime(PROCESS) allow finding the process by
+	 * with the pid of the current task.  The code needs the tgid
+	 * of the process so that pid_task(pid, PIDTYPE_TGID) can be
+	 * used to find the process.
 	 */
-	if (gettime && (p == current))
-		return p;
+	if (gettime && (pid == task_pid(current)))
+		return task_tgid(current);
 
 	/*
-	 * For processes require that p is group leader.
+	 * For processes require that pid identifies a process.
 	 */
-	return thread_group_leader(p) ? p : NULL;
-}
-
-static struct task_struct *__get_task_for_clock(const clockid_t clock,
-						bool gettime)
-{
-	const bool thread = !!CPUCLOCK_PERTHREAD(clock);
-	const pid_t pid = CPUCLOCK_PID(clock);
-
-	if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
-		return NULL;
-
-	return lookup_task(pid, thread, gettime);
-}
-
-static inline struct task_struct *get_task_for_clock(const clockid_t clock)
-{
-	return __get_task_for_clock(clock, false);
-}
-
-static inline struct task_struct *get_task_for_clock_get(const clockid_t clock)
-{
-	return __get_task_for_clock(clock, true);
+	return pid_has_task(pid, PIDTYPE_TGID) ? pid : NULL;
 }
 
 static inline int validate_clock_permissions(const clockid_t clock)
@@ -107,7 +92,7 @@ static inline int validate_clock_permissions(const clockid_t clock)
 	int ret;
 
 	rcu_read_lock();
-	ret = __get_task_for_clock(clock, false) ? 0 : -EINVAL;
+	ret = pid_for_clock(clock, false) ? 0 : -EINVAL;
 	rcu_read_unlock();
 
 	return ret;
@@ -369,7 +354,7 @@ static int posix_cpu_clock_get(const clockid_t clock, struct timespec64 *tp)
 	u64 t;
 
 	rcu_read_lock();
-	tsk = get_task_for_clock_get(clock);
+	tsk = pid_task(pid_for_clock(clock, true), clock_pid_type(clock));
 	if (!tsk) {
 		rcu_read_unlock();
 		return -EINVAL;
@@ -392,18 +377,18 @@ static int posix_cpu_clock_get(const clockid_t clock, struct timespec64 *tp)
  */
 static int posix_cpu_timer_create(struct k_itimer *new_timer)
 {
-	struct task_struct *p;
+	struct pid *pid;
 
 	rcu_read_lock();
-	p = get_task_for_clock(new_timer->it_clock);
-	if (!p) {
+	pid = pid_for_clock(new_timer->it_clock, false);
+	if (!pid) {
 		rcu_read_unlock();
 		return -EINVAL;
 	}
 
 	new_timer->kclock = &clock_posix_cpu;
 	timerqueue_init(&new_timer->it.cpu.node);
-	new_timer->it.cpu.pid = get_task_pid(p, clock_pid_type(new_timer->it_clock));
+	new_timer->it.cpu.pid = get_pid(pid);
 	rcu_read_unlock();
 	return 0;
 }
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2020-04-30 12:00 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-19 14:10 [PATCH v12 0/7] proc: modernize proc to support multiple private instances Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 1/7] proc: rename struct proc_fs_info to proc_fs_opts Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 2/7] proc: allow to mount many instances of proc in one pid namespace Alexey Gladkov
2020-04-23 11:28   ` [PATCH v13 " Alexey Gladkov
2020-04-23 12:16     ` Eric W. Biederman
2020-04-23 20:01       ` Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 3/7] proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 4/7] proc: add option to mount only a pids subset Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 5/7] docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 6/7] proc: use human-readable values for hidepid Alexey Gladkov
2020-04-19 14:10 ` [PATCH v12 7/7] proc: use named enums for better readability Alexey Gladkov
     [not found] ` <87ftcv1nqe.fsf@x220.int.ebiederm.org>
2020-04-23 17:54   ` [PATCH v2 0/2] proc: Calling proc_flush_task exactly once per task Oleg Nesterov
2020-04-23 19:38     ` Eric W. Biederman
2020-04-23 19:39   ` [PATCH v2 1/2] proc: Use PIDTYPE_TGID in next_tgid Eric W. Biederman
2020-04-24 17:29     ` Oleg Nesterov
2020-04-23 19:39   ` [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
2020-04-23 20:28     ` Linus Torvalds
2020-04-24  3:33       ` Eric W. Biederman
2020-04-24 18:02         ` Linus Torvalds
2020-04-24 18:46           ` Linus Torvalds
2020-04-24 19:51           ` Eric W. Biederman
2020-04-24 20:10             ` Linus Torvalds
2020-04-24 17:39     ` Oleg Nesterov
2020-04-24 18:10       ` Eric W. Biederman
2020-04-24 20:50       ` [PATCH] proc: Put thread_pid in release_task not proc_flush_pid Eric W. Biederman
     [not found]       ` <87mu6ymkea.fsf_-_@x220.int.ebiederm.org>
     [not found]         ` <87blnemj5t.fsf_-_@x220.int.ebiederm.org>
2020-04-26 17:22           ` [PATCH v3 2/6] posix-cpu-timers: Use PIDTYPE_TGID to simplify the logic in lookup_task Oleg Nesterov
2020-04-27 11:51             ` Eric W. Biederman
2020-04-28 18:03               ` Oleg Nesterov
2020-04-27 10:32           ` Thomas Gleixner
2020-04-27 19:46             ` Eric W. Biederman
     [not found]         ` <875zdmmj4y.fsf_-_@x220.int.ebiederm.org>
2020-04-26 17:40           ` [PATCH v3 3/6] rculist: Add hlist_swap_before_rcu Linus Torvalds
2020-04-27 14:28             ` Eric W. Biederman
2020-04-27 20:27               ` Linus Torvalds
2020-04-28 12:16                 ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman
2020-04-28 12:18                   ` [PATCH v4 1/2] rculist: Add hlists_swap_heads_rcu Eric W. Biederman
2020-04-28 12:19                   ` [PATCH v4 2/2] proc: Ensure we see the exit of each process tid exactly once Eric W. Biederman
2020-04-28 16:53                   ` [PATCH v4 0/2] proc: Ensure we see the exit of each process tid exactly Linus Torvalds
2020-04-28 17:55                     ` Eric W. Biederman
2020-04-28 18:55                     ` Eric W. Biederman
2020-04-28 19:36                       ` Linus Torvalds
2020-04-28 18:05                   ` Oleg Nesterov
2020-04-28 18:54                     ` Eric W. Biederman
2020-04-28 21:39                     ` [PATCH v1 0/4] signal: Removing has_group_leader_pid Eric W. Biederman
2020-04-28 21:45                       ` [PATCH v1 1/4] posix-cpu-timer: Tidy up group_leader logic in lookup_task Eric W. Biederman
2020-04-28 21:48                       ` [PATCH 2/4] posix-cpu-timer: Unify the now redundant code " Eric W. Biederman
2020-04-28 21:53                       ` [PATCH v1 3/4] exec: Remove BUG_ON(has_group_leader_pid) Eric W. Biederman
2020-04-28 21:56                       ` [PATCH v4 4/4] signal: Remove has_group_leader_pid Eric W. Biederman
2020-04-30 11:54                       ` [PATCH v1 0/3] posix-cpu-timers: Use pids not tasks in lookup Eric W. Biederman
2020-04-30 11:55                         ` [PATCH v1 1/3] posix-cpu-timers: Extend rcu_read_lock removing task_struct references Eric W. Biederman
2020-04-30 11:56                         ` [PATCH v1 2/3] posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type Eric W. Biederman
2020-04-30 11:56                         ` [PATCH v1 3/3] posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock Eric W. Biederman
     [not found]         ` <87h7x6mj6h.fsf_-_@x220.int.ebiederm.org>
2020-04-27  9:43           ` [PATCH v3 1/6] posix-cpu-timers: Always call __get_task_for_clock holding rcu_read_lock Thomas Gleixner
2020-04-27 11:53             ` Eric W. Biederman
     [not found]         ` <87r1w8ete7.fsf@x220.int.ebiederm.org>
2020-04-27 20:23           ` [PATCH v3] proc: Ensure we see the exit of each process tid exactly Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).