linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry
@ 2020-12-03 18:31 Wen Yang
  2020-12-03 18:31 ` [PATCH 01/10] clone: add CLONE_PIDFD Wen Yang
                   ` (10 more replies)
  0 siblings, 11 replies; 19+ messages in thread
From: Wen Yang @ 2020-12-03 18:31 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Wen Yang, Pavel Emelyanov,
	Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage,
	Eric W. Biederman, stable

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 12247 bytes --]

The dentries such as /proc/<pid>/ns/ have the DCACHE_OP_DELETE flag, they 
should be deleted when the process exits. 

Suppose the following race appears: 

release_task                 dput 
-> proc_flush_task 
                             -> dentry->d_op->d_delete(dentry) 
-> __exit_signal 
                             -> dentry->d_lockref.count--  and return. 

In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.

This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.

This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.

This problem occurred in our cluster environment (Linux 4.9 LTS).
We could reproduce it by manually constructing a test program + adding some
debugging switches in the kernel:
* A test program to open the directory (/proc/<pid>/ns) [1]
* Adding some debugging switches to the kernel, adding a delay between
   proc_flush_task and __exit_signal in release_task() [2]

The test process is as follows:

A, terminal #1

Turn on the debug switch:
echo 1> /proc/sys/vm/dentry_debug_trace

Execute the following unshare command:
sudo unshare --pid --fork --mount-proc bash


B, terminal #2

Find the pid of the unshare process:

# pstree -p | grep unshare
           | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)


Find the corresponding dentry:
# dmesg | grep pid=818
[70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 entry=818/ffff8802c7b670e8


C, terminal #3

Execute the opendir program, it will always open the /proc/818/ns/ directory:

# ./a.out /proc/818/ns/
pid: 876
.
..
net
uts
ipc
pid
user
mnt
cgroup

D, go back to terminal #2

Turn on the debugging switches to construct the race:
# echo 818> /proc/sys/vm/dentry_debug_pid
# echo 1> /proc/sys/vm/dentry_debug_delay

Kill the unshare process (pid 818). Since the debugging switches have been
turned on, it will get stuck in release_task():
# kill -9 818

Then kill the process that opened the /proc/818/ns/ directory:
# kill -9 876

Then turn off these debugging switches to allow the 818 process to exit:
# echo 0> /proc/sys/vm/dentry_debug_delay
# echo 0> /proc/sys/vm/dentry_debug_pid

Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0,
and the flag is 2800cc (#define DCACHE_OP_DELETE 0x00000008), but it is still
cached:
# dmesg | grep ffff8802a3999548
…
[565.559156] XXX dput:853 dentry=ns/ffff8802bea7b528, flag=2800cc, cnt=0, inode=ffff8802b38c2010, pdentry=818/ffff8802c7b670e8, pflag=20008c, pcnt=1, pinode=ffff8802c7812010, keywords: be cached


It could also be verified via the crash tool:

crash> dentry.d_flags,d_iname,d_inode,d_lockref -x  ffff8802bea7b528
  d_flags = 0x2800cc
  d_iname = "ns\000kkkkkkkkkkkkkkkkkkkkkkkkkkkk"
  d_inode = 0xffff8802b38c2010
  d_lockref = {
    {
      lock_count = 0x0,
      {
        lock = {
          {
            rlock = {
              raw_lock = {
                {
                  val = {
                    counter = 0x0
                  },
                  {
                    locked = 0x0,
                    pending = 0x0
                  },
                  {
                    locked_pending = 0x0,
                    tail = 0x0
                  }
                }
              }
            }
          }
        },
        count = 0x0
      }
    }
  }
crash> kmem  ffff8802bea7b528
CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
ffff8802dd5f5900      192      23663     26130    871    16k  dentry
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  ffffea000afa9e00  ffff8802bea78000     0     30         25     5
  FREE / [ALLOCATED]
  [ffff8802bea7b520]

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffea000afa9ec0 2bea7b000 dead000000000400        0  0 2fffff80000000
crash>

This series of patches is to fix this issue.

Regards,
Wen

Alexey Dobriyan (1):
  proc: use %u for pid printing and slightly less stack

Andreas Gruenbacher (1):
  proc: Pass file mode to proc_pid_make_inode

Christian Brauner (1):
  clone: add CLONE_PIDFD

Eric W. Biederman (6):
  proc: Better ownership of files for non-dumpable tasks in user
    namespaces
  proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
  proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
  proc: Clear the pieces of proc_inode that proc_evict_inode cares about
  proc: Use d_invalidate in proc_prune_siblings_dcache
  proc: Use a list of inodes to flush from proc

Joel Fernandes (Google) (1):
  pidfd: add polling support

 fs/proc/base.c             | 242 ++++++++++++++++++++-------------------------
 fs/proc/fd.c               |  20 +---
 fs/proc/inode.c            |  67 ++++++++++++-
 fs/proc/internal.h         |  22 ++---
 fs/proc/namespaces.c       |   3 +-
 fs/proc/proc_sysctl.c      |  45 ++-------
 fs/proc/self.c             |   6 +-
 fs/proc/thread_self.c      |   5 +-
 include/linux/pid.h        |   5 +
 include/linux/proc_fs.h    |   4 +-
 include/uapi/linux/sched.h |   1 +
 kernel/exit.c              |   5 +-
 kernel/fork.c              | 145 ++++++++++++++++++++++++++-
 kernel/pid.c               |   3 +
 kernel/signal.c            |  11 +++
 security/selinux/hooks.c   |   1 +
 16 files changed, 357 insertions(+), 228 deletions(-)

[1] A test program to open the directory (/proc/<pid>/ns)
#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
#include <errno.h>

int main(int argc, char *argv[])
{
	DIR *dip;
	struct dirent *dit;

	if (argc < 2) {
		printf("Usage :%s <directory>\n", argv[0]);
		return -1;
	}

	if ((dip = opendir(argv[1])) == NULL) {
		perror("opendir");
		return -1;
	}

	printf("pid: %d\n", getpid());
	while((dit = readdir (dip)) != NULL) {
		printf("%s\n", dit->d_name);
	}

	while (1)
		sleep (1);

	return 0;
}

[2] Adding some debugging switches to the kernel, also adding a delay between
    proc_flush_task and __exit_signal in release_task():

diff --git a/fs/dcache.c b/fs/dcache.c
index 05bad55..fafad37 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -84,6 +84,9 @@
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);

+int sysctl_dentry_debug_trace __read_mostly = 0;
+EXPORT_SYMBOL_GPL(sysctl_dentry_debug_trace);
+
 __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

 EXPORT_SYMBOL(rename_lock);
@@ -758,6 +761,26 @@ static inline bool fast_dput(struct dentry *dentry)
 	return 0;
 }

+#define DENTRY_DEBUG_TRACE(dentry, keywords)                            \
+do {                                                                    \
+	if (sysctl_dentry_debug_trace)                                   \
+		printk("XXX %s:%d "                                      \
+                	"dentry=%s/%p, flag=%x, cnt=%d, inode=%p, "      \
+                	"pdentry=%s/%p, pflag=%x, pcnt=%d, pinode=%p, "  \
+			"keywords: %s\n",                                \
+			__func__, __LINE__,                              \
+			dentry->d_name.name,                             \
+			dentry,                                          \
+			dentry->d_flags,                                 \
+			dentry->d_lockref.count,                         \
+			dentry->d_inode,                                 \
+			dentry->d_parent->d_name.name,                   \
+			dentry->d_parent,                                \
+			dentry->d_parent->d_flags,                       \
+			dentry->d_parent->d_lockref.count,               \
+			dentry->d_parent->d_inode,                       \
+			keywords);                                       \
+} while (0)

 /*
  * This is dput
@@ -804,6 +827,8 @@ void dput(struct dentry *dentry)

 	WARN_ON(d_in_lookup(dentry));

+	DENTRY_DEBUG_TRACE(dentry, "be checked");
+
 	/* Unreachable? Get rid of it */
 	if (unlikely(d_unhashed(dentry)))
 		goto kill_it;
@@ -812,8 +837,10 @@ void dput(struct dentry *dentry)
 		goto kill_it;

 	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
-		if (dentry->d_op->d_delete(dentry))
+		if (dentry->d_op->d_delete(dentry)) {
+			DENTRY_DEBUG_TRACE(dentry, "be killed");
 			goto kill_it;
+		}
 	}

 	if (!(dentry->d_flags & DCACHE_REFERENCED))
@@ -822,6 +849,9 @@ void dput(struct dentry *dentry)

 	dentry->d_lockref.count--;
 	spin_unlock(&dentry->d_lock);
+
+	DENTRY_DEBUG_TRACE(dentry, "be cached");
+
 	return;

 kill_it:
diff --git a/fs/proc/base.c b/fs/proc/base.c
index b9e4183..419a409 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3090,6 +3090,8 @@ void proc_flush_task(struct task_struct *task)
 	}
 }

+extern int sysctl_dentry_debug_trace;
+
 static int proc_pid_instantiate(struct inode *dir,
 				   struct dentry * dentry,
 				   struct task_struct *task, const void *ptr)
@@ -3111,6 +3113,12 @@ static int proc_pid_instantiate(struct inode *dir,
 	d_set_d_op(dentry, &pid_dentry_operations);

 	d_add(dentry, inode);
+
+	if (sysctl_dentry_debug_trace)
+		printk("XXX %s:%d pid=%d tid=%d  entry=%s/%p\n",
+			__func__, __LINE__, task->pid, task->tgid,
+			dentry->d_name.name, dentry);
+
 	/* Close the race of the process dying before we return the dentry */
 	if (pid_revalidate(dentry, 0))
 		return 0;
diff --git a/kernel/exit.c b/kernel/exit.c
index 27f4168..2b3e1b6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -55,6 +55,8 @@
 #include <linux/shm.h>
 #include <linux/kcov.h>

+#include <linux/delay.h>
+
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
 #include <asm/pgtable.h>
@@ -164,6 +166,8 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	put_task_struct(tsk);
 }

+int sysctl_dentry_debug_delay __read_mostly = 0;
+int sysctl_dentry_debug_pid __read_mostly = 0;

 void release_task(struct task_struct *p)
 {
@@ -178,6 +182,11 @@ void release_task(struct task_struct *p)

 	proc_flush_task(p);

+	if (sysctl_dentry_debug_delay && p->pid == sysctl_dentry_debug_pid) {
+		while (sysctl_dentry_debug_delay)
+			mdelay(1);
+	}
+
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
 	__exit_signal(p);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 513e6da..27f1395 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -282,6 +282,10 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 static int max_extfrag_threshold = 1000;
 #endif

+extern int sysctl_dentry_debug_trace;
+extern int sysctl_dentry_debug_delay;
+extern int sysctl_dentry_debug_pid;
+
 static struct ctl_table kern_table[] = {
 	{
 		.procname	= "sched_child_runs_first",
@@ -1498,6 +1502,30 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.proc_handler	= proc_dointvec,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "dentry_debug_trace",
+		.data		= &sysctl_dentry_debug_trace,
+		.maxlen		= sizeof(sysctl_dentry_debug_trace),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "dentry_debug_delay",
+		.data		= &sysctl_dentry_debug_delay,
+		.maxlen		= sizeof(sysctl_dentry_debug_delay),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "dentry_debug_pid",
+		.data		= &sysctl_dentry_debug_pid,
+		.maxlen		= sizeof(sysctl_dentry_debug_pid),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+		.extra1		= &zero,
+	},
 #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
 	{
 		.procname	= "legacy_va_layout",


Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 01/10] clone: add CLONE_PIDFD
  2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
@ 2020-12-03 18:31 ` Wen Yang
  2021-01-04 13:03   ` Greg Kroah-Hartman
  2020-12-03 18:31 ` [PATCH 02/10] pidfd: add polling support Wen Yang
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 19+ messages in thread
From: Wen Yang @ 2020-12-03 18:31 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Christian Brauner, Linus Torvalds,
	Jann Horn, Oleg Nesterov, Arnd Bergmann, Eric W. Biederman,
	Kees Cook, Thomas Gleixner, David Howells,
	Michael Kerrisk (man-pages),
	Andy Lutomirsky, Andrew Morton, Aleksa Sarai, Al Viro, stable,
	Wen Yang

From: Christian Brauner <christian@brauner.io>

[ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]

This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call.  Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call.  As
spotted by Linus, there is exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api.  They serve as a simple opaque handle on pids.  Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending).  Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege.  This does not imply it
cannot ever be that way but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs.  The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone.  This has the advantage that we can
give back the associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>.  The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org> # 4.9.x
(clone: fix up cherry-pick conflicts for b3e583825266)
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
---
 include/linux/pid.h        |   1 +
 include/uapi/linux/sched.h |   1 +
 kernel/fork.c              | 119 +++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 117 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 97b745d..7599a78 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -73,6 +73,7 @@ struct pid_link
 	struct hlist_node node;
 	struct pid *pid;
 };
+extern const struct file_operations pidfd_fops;
 
 static inline struct pid *get_pid(struct pid *pid)
 {
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..ed6e31d 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -9,6 +9,7 @@
 #define CLONE_FS	0x00000200	/* set if fs info shared between processes */
 #define CLONE_FILES	0x00000400	/* set if open files shared between processes */
 #define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
+#define CLONE_PIDFD	0x00001000	/* set if a pidfd should be placed in parent */
 #define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
 #define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
 #define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index b64efec..076297a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,7 +11,22 @@
  * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
  */
 
+#include <linux/anon_inodes.h>
 #include <linux/slab.h>
+#if 0
+#include <linux/sched/autogroup.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/coredump.h>
+#include <linux/sched/user.h>
+#include <linux/sched/numa_balancing.h>
+#include <linux/sched/stat.h>
+#include <linux/sched/task.h>
+#include <linux/sched/task_stack.h>
+#include <linux/sched/cputime.h>
+#include <linux/seq_file.h>
+#include <linux/rtmutex.h>
+>>>>>>> b3e58382... clone: add CLONE_PIDFD
+#endif
 #include <linux/init.h>
 #include <linux/unistd.h>
 #include <linux/module.h>
@@ -1460,6 +1475,58 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
 	 task->pids[type].pid = pid;
 }
 
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+	struct pid *pid = file->private_data;
+
+	file->private_data = NULL;
+	put_pid(pid);
+	return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct pid_namespace *ns = file_inode(m->file)->i_sb->s_fs_info;
+	struct pid *pid = f->private_data;
+
+	seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+	seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+	.release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+/**
+ * pidfd_create() - Create a new pid file descriptor.
+ *
+ * @pid:  struct pid that the pidfd will reference
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set.
+ *
+ * Note, that this function can only be called after the fd table has
+ * been unshared to avoid leaking the pidfd to the new process.
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ *         On error, a negative errno number will be returned.
+ */
+static int pidfd_create(struct pid *pid)
+{
+	int fd;
+
+	fd = anon_inode_getfd("[pidfd]", &pidfd_fops, get_pid(pid),
+			      O_RDWR | O_CLOEXEC);
+	if (fd < 0)
+		put_pid(pid);
+
+	return fd;
+}
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -1472,13 +1539,14 @@ static __latent_entropy struct task_struct *copy_process(
 					unsigned long clone_flags,
 					unsigned long stack_start,
 					unsigned long stack_size,
+					int __user *parent_tidptr,
 					int __user *child_tidptr,
 					struct pid *pid,
 					int trace,
 					unsigned long tls,
 					int node)
 {
-	int retval;
+	int pidfd = -1, retval;
 	struct task_struct *p;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
@@ -1526,6 +1594,30 @@ static __latent_entropy struct task_struct *copy_process(
 	retval = security_task_create(clone_flags);
 	if (retval)
 		goto fork_out;
+	if (clone_flags & CLONE_PIDFD) {
+		int reserved;
+
+		/*
+		 * - CLONE_PARENT_SETTID is useless for pidfds and also
+		 *   parent_tidptr is used to return pidfds.
+		 * - CLONE_DETACHED is blocked so that we can potentially
+		 *   reuse it later for CLONE_PIDFD.
+		 * - CLONE_THREAD is blocked until someone really needs it.
+		 */
+		if (clone_flags &
+		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
+			return ERR_PTR(-EINVAL);
+
+		/*
+		 * Verify that parent_tidptr is sane so we can potentially
+		 * reuse it later.
+		 */
+		if (get_user(reserved, parent_tidptr))
+			return ERR_PTR(-EFAULT);
+
+		if (reserved != 0)
+			return ERR_PTR(-EINVAL);
+	}
 
 	retval = -ENOMEM;
 	p = dup_task_struct(current, node);
@@ -1703,6 +1795,22 @@ static __latent_entropy struct task_struct *copy_process(
 		}
 	}
 
+	/*
+	 * This has to happen after we've potentially unshared the file
+	 * descriptor table (so that the pidfd doesn't leak into the child
+	 * if the fd table isn't shared).
+	 */
+	if (clone_flags & CLONE_PIDFD) {
+		retval = pidfd_create(pid);
+		if (retval < 0)
+			goto bad_fork_free_pid;
+
+		pidfd = retval;
+		retval = put_user(pidfd, parent_tidptr);
+		if (retval)
+			goto bad_fork_put_pidfd;
+	}
+
 #ifdef CONFIG_BLOCK
 	p->plug = NULL;
 #endif
@@ -1758,7 +1866,7 @@ static __latent_entropy struct task_struct *copy_process(
 	 */
 	retval = cgroup_can_fork(p);
 	if (retval)
-		goto bad_fork_free_pid;
+		goto bad_fork_put_pidfd;
 
 	/*
 	 * From this point on we must avoid any synchronous user-space
@@ -1869,6 +1977,9 @@ static __latent_entropy struct task_struct *copy_process(
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	cgroup_cancel_fork(p);
+bad_fork_put_pidfd:
+	if (clone_flags & CLONE_PIDFD)
+		__close_fd(current->files, pidfd);
 bad_fork_free_pid:
 	threadgroup_change_end(current);
 	if (pid != &init_struct_pid)
@@ -1928,7 +2039,7 @@ static inline void init_idle_pids(struct pid_link *links)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0,
+	task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
 			    cpu_to_node(cpu));
 	if (!IS_ERR(task)) {
 		init_idle_pids(task->pids);
@@ -1973,7 +2084,7 @@ long _do_fork(unsigned long clone_flags,
 			trace = 0;
 	}
 
-	p = copy_process(clone_flags, stack_start, stack_size,
+	p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
 			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
 	add_latent_entropy();
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 02/10] pidfd: add polling support
  2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
  2020-12-03 18:31 ` [PATCH 01/10] clone: add CLONE_PIDFD Wen Yang
@ 2020-12-03 18:31 ` Wen Yang
  2020-12-03 18:31 ` [PATCH 03/10] proc: Pass file mode to proc_pid_make_inode Wen Yang
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Wen Yang @ 2020-12-03 18:31 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Joel Fernandes (Google),
	Andy Lutomirski, Steven Rostedt, Daniel Colascione, Jann Horn,
	Tim Murray, Jonathan Kowalski, Linus Torvalds, Al Viro,
	Kees Cook, David Howells, Oleg Nesterov, kernel-team,
	Christian Brauner, stable, Wen Yang

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>

[ Upstream commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24 ]

This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.

We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
   it in task_struct would not be option in this case because the task can
   be reaped and destroyed before the poll returns.

2. By including the struct pid for the waitqueue means that during
   de_thread(), the new thread group leader automatically gets the new
   waitqueue/pid even though its task_struct is different.

Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@android.com
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Co-developed-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: <stable@vger.kernel.org> # 4.9.x: b3e5838: clone: add CLONE_PIDFD
Cc: <stable@vger.kernel.org> # 4.9.x
(pidfd: fix up cherry-pick conflicts for b53b0b9d9a61)
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
---
 include/linux/pid.h |  3 +++
 kernel/fork.c       | 26 ++++++++++++++++++++++++++
 kernel/pid.c        |  2 ++
 kernel/signal.c     | 11 +++++++++++
 4 files changed, 42 insertions(+)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 7599a78..f5552ba 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -2,6 +2,7 @@
 #define _LINUX_PID_H
 
 #include <linux/rcupdate.h>
+#include <linux/wait.h>
 
 enum pid_type
 {
@@ -62,6 +63,8 @@ struct pid
 	unsigned int level;
 	/* lists of tasks that use this pid */
 	struct hlist_head tasks[PIDTYPE_MAX];
+	/* wait queue for pidfd notifications */
+	wait_queue_head_t wait_pidfd;
 	struct rcu_head rcu;
 	struct upid numbers[1];
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 076297a..ac57f91 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1495,8 +1495,34 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
 }
 #endif
 
+/*
+ * Poll support for process exit notification.
+ */
+static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts)
+{
+	struct task_struct *task;
+	struct pid *pid = file->private_data;
+	int poll_flags = 0;
+
+	poll_wait(file, &pid->wait_pidfd, pts);
+
+	rcu_read_lock();
+	task = pid_task(pid, PIDTYPE_PID);
+	/*
+	 * Inform pollers only when the whole thread group exits.
+	 * If the thread group leader exits before all other threads in the
+	 * group, then poll(2) should block, similar to the wait(2) family.
+	 */
+	if (!task || (task->exit_state && thread_group_empty(task)))
+		poll_flags = POLLIN | POLLRDNORM;
+	rcu_read_unlock();
+
+	return poll_flags;
+}
+
 const struct file_operations pidfd_fops = {
 	.release = pidfd_release,
+	.poll = pidfd_poll,
 #ifdef CONFIG_PROC_FS
 	.show_fdinfo = pidfd_show_fdinfo,
 #endif
diff --git a/kernel/pid.c b/kernel/pid.c
index fa704f8..e605398 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -333,6 +333,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	for (type = 0; type < PIDTYPE_MAX; ++type)
 		INIT_HLIST_HEAD(&pid->tasks[type]);
 
+	init_waitqueue_head(&pid->wait_pidfd);
+
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
 	if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
diff --git a/kernel/signal.c b/kernel/signal.c
index bedca16..053de87a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1632,6 +1632,14 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
 	return ret;
 }
 
+static void do_notify_pidfd(struct task_struct *task)
+{
+	struct pid *pid;
+
+	pid = task_pid(task);
+	wake_up_all(&pid->wait_pidfd);
+}
+
 /*
  * Let a parent know about the death of a child.
  * For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1655,6 +1663,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
 	BUG_ON(!tsk->ptrace &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
 
+	/* Wake up all pidfd waiters */
+	do_notify_pidfd(tsk);
+
 	if (sig != SIGCHLD) {
 		/*
 		 * This is only possible if parent == real_parent.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 03/10] proc: Pass file mode to proc_pid_make_inode
  2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
  2020-12-03 18:31 ` [PATCH 01/10] clone: add CLONE_PIDFD Wen Yang
  2020-12-03 18:31 ` [PATCH 02/10] pidfd: add polling support Wen Yang
@ 2020-12-03 18:31 ` Wen Yang
  2020-12-03 18:31 ` [PATCH 04/10] proc: Better ownership of files for non-dumpable tasks in user namespaces Wen Yang
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Wen Yang @ 2020-12-03 18:31 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Andreas Gruenbacher, Paul Moore,
	stable, Wen Yang

From: Andreas Gruenbacher <agruenba@redhat.com>

[ Upstream commit db978da8fa1d0819b210c137d31a339149b88875 ]

Pass the file mode of the proc inode to be created to
proc_pid_make_inode.  In proc_pid_make_inode, initialize inode->i_mode
before calling security_task_to_inode.  This allows selinux to set
isec->sclass right away without introducing "half-initialized" inode
security structs.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Cc: <stable@vger.kernel.org> # 4.9.x
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
---
 fs/proc/base.c           | 23 +++++++++--------------
 fs/proc/fd.c             |  6 ++----
 fs/proc/internal.h       |  2 +-
 fs/proc/namespaces.c     |  3 +--
 security/selinux/hooks.c |  1 +
 5 files changed, 14 insertions(+), 21 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index b9e4183..ee2e0ec 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1676,7 +1676,8 @@ static int proc_pid_readlink(struct dentry * dentry, char __user * buffer, int b
 
 /* building an inode */
 
-struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task)
+struct inode *proc_pid_make_inode(struct super_block * sb,
+				  struct task_struct *task, umode_t mode)
 {
 	struct inode * inode;
 	struct proc_inode *ei;
@@ -1690,6 +1691,7 @@ struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *t
 
 	/* Common stuff */
 	ei = PROC_I(inode);
+	inode->i_mode = mode;
 	inode->i_ino = get_next_ino();
 	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
 	inode->i_op = &proc_def_inode_operations;
@@ -2041,7 +2043,9 @@ struct map_files_info {
 	struct proc_inode *ei;
 	struct inode *inode;
 
-	inode = proc_pid_make_inode(dir->i_sb, task);
+	inode = proc_pid_make_inode(dir->i_sb, task, S_IFLNK |
+				    ((mode & FMODE_READ ) ? S_IRUSR : 0) |
+				    ((mode & FMODE_WRITE) ? S_IWUSR : 0));
 	if (!inode)
 		return -ENOENT;
 
@@ -2050,12 +2054,6 @@ struct map_files_info {
 
 	inode->i_op = &proc_map_files_link_inode_operations;
 	inode->i_size = 64;
-	inode->i_mode = S_IFLNK;
-
-	if (mode & FMODE_READ)
-		inode->i_mode |= S_IRUSR;
-	if (mode & FMODE_WRITE)
-		inode->i_mode |= S_IWUSR;
 
 	d_set_d_op(dentry, &tid_map_files_dentry_operations);
 	d_add(dentry, inode);
@@ -2409,12 +2407,11 @@ static int proc_pident_instantiate(struct inode *dir,
 	struct inode *inode;
 	struct proc_inode *ei;
 
-	inode = proc_pid_make_inode(dir->i_sb, task);
+	inode = proc_pid_make_inode(dir->i_sb, task, p->mode);
 	if (!inode)
 		goto out;
 
 	ei = PROC_I(inode);
-	inode->i_mode = p->mode;
 	if (S_ISDIR(inode->i_mode))
 		set_nlink(inode, 2);	/* Use getattr to fix if necessary */
 	if (p->iop)
@@ -3096,11 +3093,10 @@ static int proc_pid_instantiate(struct inode *dir,
 {
 	struct inode *inode;
 
-	inode = proc_pid_make_inode(dir->i_sb, task);
+	inode = proc_pid_make_inode(dir->i_sb, task, S_IFDIR | S_IRUGO | S_IXUGO);
 	if (!inode)
 		goto out;
 
-	inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
 	inode->i_op = &proc_tgid_base_inode_operations;
 	inode->i_fop = &proc_tgid_base_operations;
 	inode->i_flags|=S_IMMUTABLE;
@@ -3391,11 +3387,10 @@ static int proc_task_instantiate(struct inode *dir,
 	struct dentry *dentry, struct task_struct *task, const void *ptr)
 {
 	struct inode *inode;
-	inode = proc_pid_make_inode(dir->i_sb, task);
+	inode = proc_pid_make_inode(dir->i_sb, task, S_IFDIR | S_IRUGO | S_IXUGO);
 
 	if (!inode)
 		goto out;
-	inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
 	inode->i_op = &proc_tid_base_inode_operations;
 	inode->i_fop = &proc_tid_base_operations;
 	inode->i_flags|=S_IMMUTABLE;
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index d21dafe..4274f83 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -183,14 +183,13 @@ static int proc_fd_link(struct dentry *dentry, struct path *path)
 	struct proc_inode *ei;
 	struct inode *inode;
 
-	inode = proc_pid_make_inode(dir->i_sb, task);
+	inode = proc_pid_make_inode(dir->i_sb, task, S_IFLNK);
 	if (!inode)
 		goto out;
 
 	ei = PROC_I(inode);
 	ei->fd = fd;
 
-	inode->i_mode = S_IFLNK;
 	inode->i_op = &proc_pid_link_inode_operations;
 	inode->i_size = 64;
 
@@ -322,14 +321,13 @@ int proc_fd_permission(struct inode *inode, int mask)
 	struct proc_inode *ei;
 	struct inode *inode;
 
-	inode = proc_pid_make_inode(dir->i_sb, task);
+	inode = proc_pid_make_inode(dir->i_sb, task, S_IFREG | S_IRUSR);
 	if (!inode)
 		goto out;
 
 	ei = PROC_I(inode);
 	ei->fd = fd;
 
-	inode->i_mode = S_IFREG | S_IRUSR;
 	inode->i_fop = &proc_fdinfo_file_operations;
 
 	d_set_d_op(dentry, &tid_fd_dentry_operations);
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c0bdece..5bc057b 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -163,7 +163,7 @@ extern int proc_pid_statm(struct seq_file *, struct pid_namespace *,
 extern const struct dentry_operations pid_dentry_operations;
 extern int pid_getattr(struct vfsmount *, struct dentry *, struct kstat *);
 extern int proc_setattr(struct dentry *, struct iattr *);
-extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *);
+extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t);
 extern int pid_revalidate(struct dentry *, unsigned int);
 extern int pid_delete_dentry(const struct dentry *);
 extern int proc_pid_readdir(struct file *, struct dir_context *);
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 51b8b0a..766f0c6 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -92,12 +92,11 @@ static int proc_ns_instantiate(struct inode *dir,
 	struct inode *inode;
 	struct proc_inode *ei;
 
-	inode = proc_pid_make_inode(dir->i_sb, task);
+	inode = proc_pid_make_inode(dir->i_sb, task, S_IFLNK | S_IRWXUGO);
 	if (!inode)
 		goto out;
 
 	ei = PROC_I(inode);
-	inode->i_mode = S_IFLNK|S_IRWXUGO;
 	inode->i_op = &proc_ns_link_inode_operations;
 	ei->ns_ops = ns_ops;
 
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index a2b63a6..98b5f40 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3967,6 +3967,7 @@ static void selinux_task_to_inode(struct task_struct *p,
 	struct inode_security_struct *isec = inode->i_security;
 	u32 sid = task_sid(p);
 
+	isec->sclass = inode_mode_to_security_class(inode->i_mode);
 	isec->sid = sid;
 	isec->initialized = LABEL_INITIALIZED;
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 04/10] proc: Better ownership of files for non-dumpable tasks in user namespaces
  2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
                   ` (2 preceding siblings ...)
  2020-12-03 18:31 ` [PATCH 03/10] proc: Pass file mode to proc_pid_make_inode Wen Yang
@ 2020-12-03 18:31 ` Wen Yang
  2020-12-03 18:31 ` [PATCH 05/10] proc: use %u for pid printing and slightly less stack Wen Yang
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Wen Yang @ 2020-12-03 18:31 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Eric W. Biederman, stable, Wen Yang

From: "Eric W. Biederman" <ebiederm@xmission.com>

[ Upstream commit 68eb94f16227336a5773b83ecfa8290f1d6b78ce ]

Instead of making the files owned by the GLOBAL_ROOT_USER.  Make
non-dumpable files whose mm has always lived in a user namespace owned
by the user namespace root.  This allows the container root to have
things work as expected in a container.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org> # 4.9.x
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
---
 fs/proc/base.c     | 102 ++++++++++++++++++++++++++++++-----------------------
 fs/proc/fd.c       |  12 +------
 fs/proc/internal.h |  16 ++-------
 3 files changed, 61 insertions(+), 69 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ee2e0ec..5bfdb61 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1676,12 +1676,63 @@ static int proc_pid_readlink(struct dentry * dentry, char __user * buffer, int b
 
 /* building an inode */
 
+void task_dump_owner(struct task_struct *task, mode_t mode,
+		     kuid_t *ruid, kgid_t *rgid)
+{
+	/* Depending on the state of dumpable compute who should own a
+	 * proc file for a task.
+	 */
+	const struct cred *cred;
+	kuid_t uid;
+	kgid_t gid;
+
+	/* Default to the tasks effective ownership */
+	rcu_read_lock();
+	cred = __task_cred(task);
+	uid = cred->euid;
+	gid = cred->egid;
+	rcu_read_unlock();
+
+	/*
+	 * Before the /proc/pid/status file was created the only way to read
+	 * the effective uid of a /process was to stat /proc/pid.  Reading
+	 * /proc/pid/status is slow enough that procps and other packages
+	 * kept stating /proc/pid.  To keep the rules in /proc simple I have
+	 * made this apply to all per process world readable and executable
+	 * directories.
+	 */
+	if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) {
+		struct mm_struct *mm;
+		task_lock(task);
+		mm = task->mm;
+		/* Make non-dumpable tasks owned by some root */
+		if (mm) {
+			if (get_dumpable(mm) != SUID_DUMP_USER) {
+				struct user_namespace *user_ns = mm->user_ns;
+
+				uid = make_kuid(user_ns, 0);
+				if (!uid_valid(uid))
+					uid = GLOBAL_ROOT_UID;
+
+				gid = make_kgid(user_ns, 0);
+				if (!gid_valid(gid))
+					gid = GLOBAL_ROOT_GID;
+			}
+		} else {
+			uid = GLOBAL_ROOT_UID;
+			gid = GLOBAL_ROOT_GID;
+		}
+		task_unlock(task);
+	}
+	*ruid = uid;
+	*rgid = gid;
+}
+
 struct inode *proc_pid_make_inode(struct super_block * sb,
 				  struct task_struct *task, umode_t mode)
 {
 	struct inode * inode;
 	struct proc_inode *ei;
-	const struct cred *cred;
 
 	/* We need a new inode */
 
@@ -1703,13 +1754,7 @@ struct inode *proc_pid_make_inode(struct super_block * sb,
 	if (!ei->pid)
 		goto out_unlock;
 
-	if (task_dumpable(task)) {
-		rcu_read_lock();
-		cred = __task_cred(task);
-		inode->i_uid = cred->euid;
-		inode->i_gid = cred->egid;
-		rcu_read_unlock();
-	}
+	task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
 	security_task_to_inode(task, inode);
 
 out:
@@ -1724,7 +1769,6 @@ int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 {
 	struct inode *inode = d_inode(dentry);
 	struct task_struct *task;
-	const struct cred *cred;
 	struct pid_namespace *pid = dentry->d_sb->s_fs_info;
 
 	generic_fillattr(inode, stat);
@@ -1742,12 +1786,7 @@ int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 			 */
 			return -ENOENT;
 		}
-		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
-		    task_dumpable(task)) {
-			cred = __task_cred(task);
-			stat->uid = cred->euid;
-			stat->gid = cred->egid;
-		}
+		task_dump_owner(task, inode->i_mode, &stat->uid, &stat->gid);
 	}
 	rcu_read_unlock();
 	return 0;
@@ -1763,18 +1802,11 @@ int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
  * Rewrite the inode's ownerships here because the owning task may have
  * performed a setuid(), etc.
  *
- * Before the /proc/pid/status file was created the only way to read
- * the effective uid of a /process was to stat /proc/pid.  Reading
- * /proc/pid/status is slow enough that procps and other packages
- * kept stating /proc/pid.  To keep the rules in /proc simple I have
- * made this apply to all per process world readable and executable
- * directories.
  */
 int pid_revalidate(struct dentry *dentry, unsigned int flags)
 {
 	struct inode *inode;
 	struct task_struct *task;
-	const struct cred *cred;
 
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
@@ -1783,17 +1815,8 @@ int pid_revalidate(struct dentry *dentry, unsigned int flags)
 	task = get_proc_task(inode);
 
 	if (task) {
-		if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
-		    task_dumpable(task)) {
-			rcu_read_lock();
-			cred = __task_cred(task);
-			inode->i_uid = cred->euid;
-			inode->i_gid = cred->egid;
-			rcu_read_unlock();
-		} else {
-			inode->i_uid = GLOBAL_ROOT_UID;
-			inode->i_gid = GLOBAL_ROOT_GID;
-		}
+		task_dump_owner(task, inode->i_mode, &inode->i_uid, &inode->i_gid);
+
 		inode->i_mode &= ~(S_ISUID | S_ISGID);
 		security_task_to_inode(task, inode);
 		put_task_struct(task);
@@ -1915,7 +1938,6 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
 	bool exact_vma_exists = false;
 	struct mm_struct *mm = NULL;
 	struct task_struct *task;
-	const struct cred *cred;
 	struct inode *inode;
 	int status = 0;
 
@@ -1940,16 +1962,8 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags)
 	mmput(mm);
 
 	if (exact_vma_exists) {
-		if (task_dumpable(task)) {
-			rcu_read_lock();
-			cred = __task_cred(task);
-			inode->i_uid = cred->euid;
-			inode->i_gid = cred->egid;
-			rcu_read_unlock();
-		} else {
-			inode->i_uid = GLOBAL_ROOT_UID;
-			inode->i_gid = GLOBAL_ROOT_GID;
-		}
+		task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
+
 		security_task_to_inode(task, inode);
 		status = 1;
 	}
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 4274f83..00ce153 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -84,7 +84,6 @@ static int tid_fd_revalidate(struct dentry *dentry, unsigned int flags)
 {
 	struct files_struct *files;
 	struct task_struct *task;
-	const struct cred *cred;
 	struct inode *inode;
 	unsigned int fd;
 
@@ -108,16 +107,7 @@ static int tid_fd_revalidate(struct dentry *dentry, unsigned int flags)
 				rcu_read_unlock();
 				put_files_struct(files);
 
-				if (task_dumpable(task)) {
-					rcu_read_lock();
-					cred = __task_cred(task);
-					inode->i_uid = cred->euid;
-					inode->i_gid = cred->egid;
-					rcu_read_unlock();
-				} else {
-					inode->i_uid = GLOBAL_ROOT_UID;
-					inode->i_gid = GLOBAL_ROOT_GID;
-				}
+				task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
 
 				if (S_ISLNK(inode->i_mode)) {
 					unsigned i_mode = S_IFLNK;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 5bc057b..103435f 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -98,20 +98,8 @@ static inline struct task_struct *get_proc_task(struct inode *inode)
 	return get_pid_task(proc_pid(inode), PIDTYPE_PID);
 }
 
-static inline int task_dumpable(struct task_struct *task)
-{
-	int dumpable = 0;
-	struct mm_struct *mm;
-
-	task_lock(task);
-	mm = task->mm;
-	if (mm)
-		dumpable = get_dumpable(mm);
-	task_unlock(task);
-	if (dumpable == SUID_DUMP_USER)
-		return 1;
-	return 0;
-}
+void task_dump_owner(struct task_struct *task, mode_t mode,
+		     kuid_t *ruid, kgid_t *rgid);
 
 static inline unsigned name_to_int(const struct qstr *qstr)
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 05/10] proc: use %u for pid printing and slightly less stack
  2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
                   ` (3 preceding siblings ...)
  2020-12-03 18:31 ` [PATCH 04/10] proc: Better ownership of files for non-dumpable tasks in user namespaces Wen Yang
@ 2020-12-03 18:31 ` Wen Yang
  2020-12-03 20:08   ` Alexey Dobriyan
  2020-12-03 18:32 ` [PATCH 06/10] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes Wen Yang
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 19+ messages in thread
From: Wen Yang @ 2020-12-03 18:31 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Alexey Dobriyan, Alexander Viro,
	Andrew Morton, Linus Torvalds, stable, Wen Yang

From: Alexey Dobriyan <adobriyan@gmail.com>

[ Upstream commit e3912ac37e07a13c70675cd75020694de4841c74 ]

PROC_NUMBUF is 13 which is enough for "negative int + \n + \0".

However PIDs and TGIDs are never negative and newline is not a concern,
so use just 10 per integer.

Link: http://lkml.kernel.org/r/20171120203005.GA27743@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexander Viro <viro@ftp.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org> # 4.9.x
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
---
 fs/proc/base.c        | 16 ++++++++--------
 fs/proc/fd.c          |  2 +-
 fs/proc/self.c        |  6 +++---
 fs/proc/thread_self.c |  5 ++---
 4 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5bfdb61..3502a40 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3018,11 +3018,11 @@ static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *de
 static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
 {
 	struct dentry *dentry, *leader, *dir;
-	char buf[PROC_NUMBUF];
+	char buf[10 + 1];
 	struct qstr name;
 
 	name.name = buf;
-	name.len = snprintf(buf, sizeof(buf), "%d", pid);
+	name.len = snprintf(buf, sizeof(buf), "%u", pid);
 	/* no ->d_hash() rejects on procfs */
 	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
 	if (dentry) {
@@ -3034,7 +3034,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
 		return;
 
 	name.name = buf;
-	name.len = snprintf(buf, sizeof(buf), "%d", tgid);
+	name.len = snprintf(buf, sizeof(buf), "%u", tgid);
 	leader = d_hash_and_lookup(mnt->mnt_root, &name);
 	if (!leader)
 		goto out;
@@ -3046,7 +3046,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
 		goto out_put_leader;
 
 	name.name = buf;
-	name.len = snprintf(buf, sizeof(buf), "%d", pid);
+	name.len = snprintf(buf, sizeof(buf), "%u", pid);
 	dentry = d_hash_and_lookup(dir, &name);
 	if (dentry) {
 		d_invalidate(dentry);
@@ -3226,14 +3226,14 @@ int proc_pid_readdir(struct file *file, struct dir_context *ctx)
 	for (iter = next_tgid(ns, iter);
 	     iter.task;
 	     iter.tgid += 1, iter = next_tgid(ns, iter)) {
-		char name[PROC_NUMBUF];
+		char name[10 + 1];
 		int len;
 
 		cond_resched();
 		if (!has_pid_permissions(ns, iter.task, 2))
 			continue;
 
-		len = snprintf(name, sizeof(name), "%d", iter.tgid);
+		len = snprintf(name, sizeof(name), "%u", iter.tgid);
 		ctx->pos = iter.tgid + TGID_OFFSET;
 		if (!proc_fill_cache(file, ctx, name, len,
 				     proc_pid_instantiate, iter.task, NULL)) {
@@ -3557,10 +3557,10 @@ static int proc_task_readdir(struct file *file, struct dir_context *ctx)
 	for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
 	     task;
 	     task = next_tid(task), ctx->pos++) {
-		char name[PROC_NUMBUF];
+		char name[10 + 1];
 		int len;
 		tid = task_pid_nr_ns(task, ns);
-		len = snprintf(name, sizeof(name), "%d", tid);
+		len = snprintf(name, sizeof(name), "%u", tid);
 		if (!proc_fill_cache(file, ctx, name, len,
 				proc_task_instantiate, task, NULL)) {
 			/* returning this tgid failed, save it as the first
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 00ce153..390c2fe 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -235,7 +235,7 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 	for (fd = ctx->pos - 2;
 	     fd < files_fdtable(files)->max_fds;
 	     fd++, ctx->pos++) {
-		char name[PROC_NUMBUF];
+		char name[10 + 1];
 		int len;
 
 		if (!fcheck_files(files, fd))
diff --git a/fs/proc/self.c b/fs/proc/self.c
index f6e2e3f..dd06755 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -35,11 +35,11 @@ static const char *proc_self_get_link(struct dentry *dentry,
 
 	if (!tgid)
 		return ERR_PTR(-ENOENT);
-	/* 11 for max length of signed int in decimal + NULL term */
-	name = kmalloc(12, dentry ? GFP_KERNEL : GFP_ATOMIC);
+	/* max length of unsigned int in decimal + NULL term */
+	name = kmalloc(10 + 1, dentry ? GFP_KERNEL : GFP_ATOMIC);
 	if (unlikely(!name))
 		return dentry ? ERR_PTR(-ENOMEM) : ERR_PTR(-ECHILD);
-	sprintf(name, "%d", tgid);
+	sprintf(name, "%u", tgid);
 	set_delayed_call(done, kfree_link, name);
 	return name;
 }
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index 02d1db8..44e0921 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -30,11 +30,10 @@ static const char *proc_thread_self_get_link(struct dentry *dentry,
 
 	if (!pid)
 		return ERR_PTR(-ENOENT);
-	name = kmalloc(PROC_NUMBUF + 6 + PROC_NUMBUF,
-				dentry ? GFP_KERNEL : GFP_ATOMIC);
+	name = kmalloc(10 + 6 + 10 + 1, dentry ? GFP_KERNEL : GFP_ATOMIC);
 	if (unlikely(!name))
 		return dentry ? ERR_PTR(-ENOMEM) : ERR_PTR(-ECHILD);
-	sprintf(name, "%d/task/%d", tgid, pid);
+	sprintf(name, "%u/task/%u", tgid, pid);
 	set_delayed_call(done, kfree_link, name);
 	return name;
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 06/10] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
  2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
                   ` (4 preceding siblings ...)
  2020-12-03 18:31 ` [PATCH 05/10] proc: use %u for pid printing and slightly less stack Wen Yang
@ 2020-12-03 18:32 ` Wen Yang
  2020-12-03 18:32 ` [PATCH 07/10] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache Wen Yang
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Wen Yang @ 2020-12-03 18:32 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Eric W. Biederman, stable, Wen Yang

From: "Eric W. Biederman" <ebiederm@xmission.com>

[ Upstream commit 0afa5ca82212247456f9de1468b595a111fee633 ]

I about to need and use the same functionality for pid based
inodes and there is no point in adding a second field when
this field is already here and serving the same purporse.

Just give the field a generic name so it is clear that
it is no longer sysctl specific.

Also for good measure initialize sibling_inodes when
proc_inode is initialized.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org> # 4.9.x
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
---
 fs/proc/inode.c       | 1 +
 fs/proc/internal.h    | 2 +-
 fs/proc/proc_sysctl.c | 8 ++++----
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index a289349..14d9c1d 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -67,6 +67,7 @@ static struct inode *proc_alloc_inode(struct super_block *sb)
 	ei->pde = NULL;
 	ei->sysctl = NULL;
 	ei->sysctl_entry = NULL;
+	INIT_HLIST_NODE(&ei->sibling_inodes);
 	ei->ns_ops = NULL;
 	inode = &ei->vfs_inode;
 	return inode;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 103435f..409b5c5 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -65,7 +65,7 @@ struct proc_inode {
 	struct proc_dir_entry *pde;
 	struct ctl_table_header *sysctl;
 	struct ctl_table *sysctl_entry;
-	struct hlist_node sysctl_inodes;
+	struct hlist_node sibling_inodes;
 	const struct proc_ns_operations *ns_ops;
 	struct inode vfs_inode;
 };
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 191573a..671490e 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -272,9 +272,9 @@ static void proc_sys_prune_dcache(struct ctl_table_header *head)
 		node = hlist_first_rcu(&head->inodes);
 		if (!node)
 			break;
-		ei = hlist_entry(node, struct proc_inode, sysctl_inodes);
+		ei = hlist_entry(node, struct proc_inode, sibling_inodes);
 		spin_lock(&sysctl_lock);
-		hlist_del_init_rcu(&ei->sysctl_inodes);
+		hlist_del_init_rcu(&ei->sibling_inodes);
 		spin_unlock(&sysctl_lock);
 
 		inode = &ei->vfs_inode;
@@ -480,7 +480,7 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 	}
 	ei->sysctl = head;
 	ei->sysctl_entry = table;
-	hlist_add_head_rcu(&ei->sysctl_inodes, &head->inodes);
+	hlist_add_head_rcu(&ei->sibling_inodes, &head->inodes);
 	head->count++;
 	spin_unlock(&sysctl_lock);
 
@@ -511,7 +511,7 @@ static struct inode *proc_sys_make_inode(struct super_block *sb,
 void proc_sys_evict_inode(struct inode *inode, struct ctl_table_header *head)
 {
 	spin_lock(&sysctl_lock);
-	hlist_del_init_rcu(&PROC_I(inode)->sysctl_inodes);
+	hlist_del_init_rcu(&PROC_I(inode)->sibling_inodes);
 	if (!--head->count)
 		kfree_rcu(head, rcu);
 	spin_unlock(&sysctl_lock);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 07/10] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
  2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
                   ` (5 preceding siblings ...)
  2020-12-03 18:32 ` [PATCH 06/10] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes Wen Yang
@ 2020-12-03 18:32 ` Wen Yang
  2020-12-03 18:32 ` [PATCH 08/10] proc: Clear the pieces of proc_inode that proc_evict_inode cares about Wen Yang
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Wen Yang @ 2020-12-03 18:32 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Eric W. Biederman, stable, Wen Yang

From: "Eric W. Biederman" <ebiederm@xmission.com>

[ Upstream commit 26dbc60f385ff9cff475ea2a3bad02e80fd6fa43 ]

This prepares the way for allowing the pid part of proc to use this
dcache pruning code as well.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org> # 4.9.x
(proc: fix up cherry-pick conflicts for 26dbc60f385f)
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
---
 fs/proc/inode.c       | 38 ++++++++++++++++++++++++++++++++++++++
 fs/proc/internal.h    |  1 +
 fs/proc/proc_sysctl.c | 35 +----------------------------------
 3 files changed, 40 insertions(+), 34 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 14d9c1d..920c761 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -101,6 +101,44 @@ void __init proc_init_inodecache(void)
 					     init_once);
 }
 
+void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+{
+	struct inode *inode;
+	struct proc_inode *ei;
+	struct hlist_node *node;
+	struct super_block *sb;
+
+	rcu_read_lock();
+	for (;;) {
+		node = hlist_first_rcu(inodes);
+		if (!node)
+			break;
+		ei = hlist_entry(node, struct proc_inode, sibling_inodes);
+		spin_lock(lock);
+		hlist_del_init_rcu(&ei->sibling_inodes);
+		spin_unlock(lock);
+
+		inode = &ei->vfs_inode;
+		sb = inode->i_sb;
+		if (!atomic_inc_not_zero(&sb->s_active))
+			continue;
+		inode = igrab(inode);
+		rcu_read_unlock();
+		if (unlikely(!inode)) {
+			deactivate_super(sb);
+			rcu_read_lock();
+			continue;
+		}
+
+		d_prune_aliases(inode);
+		iput(inode);
+		deactivate_super(sb);
+
+		rcu_read_lock();
+	}
+	rcu_read_unlock();
+}
+
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct super_block *sb = root->d_sb;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 409b5c5..9bc44a1 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -200,6 +200,7 @@ struct pde_opener {
 extern const struct inode_operations proc_pid_link_inode_operations;
 
 extern void proc_init_inodecache(void);
+void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
 extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
 extern int proc_fill_super(struct super_block *, void *data, int flags);
 extern void proc_entry_rundown(struct proc_dir_entry *);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 671490e..f19063b 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -262,40 +262,7 @@ static void unuse_table(struct ctl_table_header *p)
 
 static void proc_sys_prune_dcache(struct ctl_table_header *head)
 {
-	struct inode *inode;
-	struct proc_inode *ei;
-	struct hlist_node *node;
-	struct super_block *sb;
-
-	rcu_read_lock();
-	for (;;) {
-		node = hlist_first_rcu(&head->inodes);
-		if (!node)
-			break;
-		ei = hlist_entry(node, struct proc_inode, sibling_inodes);
-		spin_lock(&sysctl_lock);
-		hlist_del_init_rcu(&ei->sibling_inodes);
-		spin_unlock(&sysctl_lock);
-
-		inode = &ei->vfs_inode;
-		sb = inode->i_sb;
-		if (!atomic_inc_not_zero(&sb->s_active))
-			continue;
-		inode = igrab(inode);
-		rcu_read_unlock();
-		if (unlikely(!inode)) {
-			deactivate_super(sb);
-			rcu_read_lock();
-			continue;
-		}
-
-		d_prune_aliases(inode);
-		iput(inode);
-		deactivate_super(sb);
-
-		rcu_read_lock();
-	}
-	rcu_read_unlock();
+	proc_prune_siblings_dcache(&head->inodes, &sysctl_lock);
 }
 
 /* called under sysctl_lock, will reacquire if has to wait */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 08/10] proc: Clear the pieces of proc_inode that proc_evict_inode cares about
  2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
                   ` (6 preceding siblings ...)
  2020-12-03 18:32 ` [PATCH 07/10] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache Wen Yang
@ 2020-12-03 18:32 ` Wen Yang
  2020-12-03 18:32 ` [PATCH 09/10] proc: Use d_invalidate in proc_prune_siblings_dcache Wen Yang
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Wen Yang @ 2020-12-03 18:32 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Eric W. Biederman, stable, Wen Yang

From: "Eric W. Biederman" <ebiederm@xmission.com>

[ Upstream commit 71448011ea2a1cd36d8f5cbdab0ed716c454d565 ]

This just keeps everything tidier, and allows for using flags like
SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse.
I don't see reuse without reinitializing happening with the proc_inode
but I had a false alarm while reworking flushing of proc dentries and
indoes when a process dies that caused me to tidy this up.

The code is a little easier to follow and reason about this
way so I figured the changes might as well be kept.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org> # 4.9.x
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
---
 fs/proc/inode.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 920c761..739fb9c 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -32,21 +32,27 @@ static void proc_evict_inode(struct inode *inode)
 {
 	struct proc_dir_entry *de;
 	struct ctl_table_header *head;
+	struct proc_inode *ei = PROC_I(inode);
 
 	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
 
 	/* Stop tracking associated processes */
-	put_pid(PROC_I(inode)->pid);
+	if (ei->pid) {
+		put_pid(ei->pid);
+		ei->pid = NULL;
+	}
 
 	/* Let go of any associated proc directory entry */
-	de = PDE(inode);
-	if (de)
+	de = ei->pde;
+	if (de) {
 		pde_put(de);
+		ei->pde = NULL;
+	}
 
-	head = PROC_I(inode)->sysctl;
+	head = ei->sysctl;
 	if (head) {
-		RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
+		RCU_INIT_POINTER(ei->sysctl, NULL);
 		proc_sys_evict_inode(inode, head);
 	}
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 09/10] proc: Use d_invalidate in proc_prune_siblings_dcache
  2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
                   ` (7 preceding siblings ...)
  2020-12-03 18:32 ` [PATCH 08/10] proc: Clear the pieces of proc_inode that proc_evict_inode cares about Wen Yang
@ 2020-12-03 18:32 ` Wen Yang
  2020-12-03 18:32 ` [PATCH 10/10] proc: Use a list of inodes to flush from proc Wen Yang
  2020-12-17  2:26 ` [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
  10 siblings, 0 replies; 19+ messages in thread
From: Wen Yang @ 2020-12-03 18:32 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Eric W. Biederman, stable, Wen Yang

From: "Eric W. Biederman" <ebiederm@xmission.com>

[ Upstream commit f90f3cafe8d56d593fc509a4185da1d5800efea4 ]

The function d_prune_aliases has the problem that it will only prune
aliases thare are completely unused.  It will not remove aliases for
the dcache or even think of removing mounts from the dcache.  For that
behavior d_invalidate is needed.

To use d_invalidate replace d_prune_aliases with d_find_alias followed
by d_invalidate and dput.

For completeness the directory and the non-directory cases are
separated because in theory (although not in currently in practice for
proc) directories can only ever have a single dentry while
non-directories can have hardlinks and thus multiple dentries.
As part of this separation use d_find_any_alias for directories
to spare d_find_alias the extra work of doing that.

Plus the differences between d_find_any_alias and d_find_alias makes
it clear why the directory and non-directory code and not share code.

To make it clear these routines now invalidate dentries rename
proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename
proc_sys_prune_dcache proc_sys_invalidate_dcache.

V2: Split the directory and non-directory cases.  To make this
    code robust to future changes in proc.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org> # 4.9.x
(proc: fix up cherry-pick conflicts for f90f3cafe8d5)
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
---
 fs/proc/inode.c       | 16 ++++++++++++++--
 fs/proc/internal.h    |  2 +-
 fs/proc/proc_sysctl.c |  8 ++++----
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 739fb9c..2af9f4f 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -107,7 +107,7 @@ void __init proc_init_inodecache(void)
 					     init_once);
 }
 
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
 {
 	struct inode *inode;
 	struct proc_inode *ei;
@@ -136,7 +136,19 @@ void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
 			continue;
 		}
 
-		d_prune_aliases(inode);
+		if (S_ISDIR(inode->i_mode)) {
+			struct dentry *dir = d_find_any_alias(inode);
+			if (dir) {
+				d_invalidate(dir);
+				dput(dir);
+			}
+		} else {
+			struct dentry *dentry;
+			while ((dentry = d_find_alias(inode))) {
+				d_invalidate(dentry);
+				dput(dentry);
+			}
+		}
 		iput(inode);
 		deactivate_super(sb);
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 9bc44a1..6a1d679 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -200,7 +200,7 @@ struct pde_opener {
 extern const struct inode_operations proc_pid_link_inode_operations;
 
 extern void proc_init_inodecache(void);
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
 extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
 extern int proc_fill_super(struct super_block *, void *data, int flags);
 extern void proc_entry_rundown(struct proc_dir_entry *);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index f19063b..b6668a5 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -260,9 +260,9 @@ static void unuse_table(struct ctl_table_header *p)
 			complete(p->unregistering);
 }
 
-static void proc_sys_prune_dcache(struct ctl_table_header *head)
+static void proc_sys_invalidate_dcache(struct ctl_table_header *head)
 {
-	proc_prune_siblings_dcache(&head->inodes, &sysctl_lock);
+	proc_invalidate_siblings_dcache(&head->inodes, &sysctl_lock);
 }
 
 /* called under sysctl_lock, will reacquire if has to wait */
@@ -284,10 +284,10 @@ static void start_unregistering(struct ctl_table_header *p)
 		spin_unlock(&sysctl_lock);
 	}
 	/*
-	 * Prune dentries for unregistered sysctls: namespaced sysctls
+	 * Invalidate dentries for unregistered sysctls: namespaced sysctls
 	 * can have duplicate names and contaminate dcache very badly.
 	 */
-	proc_sys_prune_dcache(p);
+	proc_sys_invalidate_dcache(p);
 	/*
 	 * do not remove from the list until nobody holds it; walking the
 	 * list in do_sysctl() relies on that.
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 10/10] proc: Use a list of inodes to flush from proc
  2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
                   ` (8 preceding siblings ...)
  2020-12-03 18:32 ` [PATCH 09/10] proc: Use d_invalidate in proc_prune_siblings_dcache Wen Yang
@ 2020-12-03 18:32 ` Wen Yang
  2020-12-17  2:26 ` [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
  10 siblings, 0 replies; 19+ messages in thread
From: Wen Yang @ 2020-12-03 18:32 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Eric W. Biederman, stable, Wen Yang

From: "Eric W. Biederman" <ebiederm@xmission.com>

[ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ]

Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization.  The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died.  When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid.  Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid.  Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place.  This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid->inodes.  I somehow failed to get that
    initialization into the initial version of the patch.  A boot
    failure was reported by "kernel test robot <lkp@intel.com>", and
    failure to initialize that pid->inodes matches all of the reported
    symptoms.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces")
Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees")
Cc: <stable@vger.kernel.org> # 4.9.x: b3e5838: clone: add CLONE_PIDFD
Cc: <stable@vger.kernel.org> # 4.9.x: b53b0b9: pidfd: add polling support
Cc: <stable@vger.kernel.org> # 4.9.x: db978da: proc: Pass file mode to proc_pid_make_inode
Cc: <stable@vger.kernel.org> # 4.9.x: 68eb94f: proc: Better ownership of files for non-dumpable tasks in user namespaces
Cc: <stable@vger.kernel.org> # 4.9.x: e3912ac: proc: use %u for pid printing and slightly less stack
Cc: <stable@vger.kernel.org> # 4.9.x: 0afa5ca: proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
Cc: <stable@vger.kernel.org> # 4.9.x: 26dbc60: proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
Cc: <stable@vger.kernel.org> # 4.9.x: 7144801: proc: Clear the pieces of proc_inode that proc_evict_inode cares about
Cc: <stable@vger.kernel.org> # 4.9.x: f90f3ca: Use d_invalidate in proc_prune_siblings_dcache
Cc: <stable@vger.kernel.org> # 4.9.x
(proc: fix up cherry-pick conflicts for 7bc3e6e55acf)
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
---
 fs/proc/base.c          | 111 ++++++++++++++++--------------------------------
 fs/proc/inode.c         |   2 +-
 fs/proc/internal.h      |   1 +
 include/linux/pid.h     |   1 +
 include/linux/proc_fs.h |   4 +-
 kernel/exit.c           |   5 ++-
 kernel/pid.c            |   1 +
 7 files changed, 45 insertions(+), 80 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3502a40..11caf35 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1728,11 +1728,25 @@ void task_dump_owner(struct task_struct *task, mode_t mode,
 	*rgid = gid;
 }
 
+void proc_pid_evict_inode(struct proc_inode *ei)
+{
+	struct pid *pid = ei->pid;
+
+	if (S_ISDIR(ei->vfs_inode.i_mode)) {
+		spin_lock(&pid->wait_pidfd.lock);
+		hlist_del_init_rcu(&ei->sibling_inodes);
+		spin_unlock(&pid->wait_pidfd.lock);
+	}
+
+	put_pid(pid);
+}
+
 struct inode *proc_pid_make_inode(struct super_block * sb,
 				  struct task_struct *task, umode_t mode)
 {
 	struct inode * inode;
 	struct proc_inode *ei;
+	struct pid *pid;
 
 	/* We need a new inode */
 
@@ -1750,10 +1764,18 @@ struct inode *proc_pid_make_inode(struct super_block * sb,
 	/*
 	 * grab the reference to task.
 	 */
-	ei->pid = get_task_pid(task, PIDTYPE_PID);
-	if (!ei->pid)
+	pid = get_task_pid(task, PIDTYPE_PID);
+	if (!pid)
 		goto out_unlock;
 
+	/* Let the pid remember us for quick removal */
+	ei->pid = pid;
+	if (S_ISDIR(mode)) {
+		spin_lock(&pid->wait_pidfd.lock);
+		hlist_add_head_rcu(&ei->sibling_inodes, &pid->inodes);
+		spin_unlock(&pid->wait_pidfd.lock);
+	}
+
 	task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid);
 	security_task_to_inode(task, inode);
 
@@ -3015,90 +3037,29 @@ static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *de
 	.permission	= proc_pid_permission,
 };
 
-static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
-{
-	struct dentry *dentry, *leader, *dir;
-	char buf[10 + 1];
-	struct qstr name;
-
-	name.name = buf;
-	name.len = snprintf(buf, sizeof(buf), "%u", pid);
-	/* no ->d_hash() rejects on procfs */
-	dentry = d_hash_and_lookup(mnt->mnt_root, &name);
-	if (dentry) {
-		d_invalidate(dentry);
-		dput(dentry);
-	}
-
-	if (pid == tgid)
-		return;
-
-	name.name = buf;
-	name.len = snprintf(buf, sizeof(buf), "%u", tgid);
-	leader = d_hash_and_lookup(mnt->mnt_root, &name);
-	if (!leader)
-		goto out;
-
-	name.name = "task";
-	name.len = strlen(name.name);
-	dir = d_hash_and_lookup(leader, &name);
-	if (!dir)
-		goto out_put_leader;
-
-	name.name = buf;
-	name.len = snprintf(buf, sizeof(buf), "%u", pid);
-	dentry = d_hash_and_lookup(dir, &name);
-	if (dentry) {
-		d_invalidate(dentry);
-		dput(dentry);
-	}
-
-	dput(dir);
-out_put_leader:
-	dput(leader);
-out:
-	return;
-}
-
 /**
- * proc_flush_task -  Remove dcache entries for @task from the /proc dcache.
- * @task: task that should be flushed.
+ * proc_flush_pid -  Remove dcache entries for @pid from the /proc dcache.
+ * @pid: pid that should be flushed.
  *
- * When flushing dentries from proc, one needs to flush them from global
- * proc (proc_mnt) and from all the namespaces' procs this task was seen
- * in. This call is supposed to do all of this job.
- *
- * Looks in the dcache for
- * /proc/@pid
- * /proc/@tgid/task/@pid
- * if either directory is present flushes it and all of it'ts children
- * from the dcache.
+ * This function walks a list of inodes (that belong to any proc
+ * filesystem) that are attached to the pid and flushes them from
+ * the dentry cache.
  *
  * It is safe and reasonable to cache /proc entries for a task until
  * that task exits.  After that they just clog up the dcache with
  * useless entries, possibly causing useful dcache entries to be
- * flushed instead.  This routine is proved to flush those useless
- * dcache entries at process exit time.
+ * flushed instead.  This routine is provided to flush those useless
+ * dcache entries when a process is reaped.
  *
  * NOTE: This routine is just an optimization so it does not guarantee
- *       that no dcache entries will exist at process exit time it
- *       just makes it very unlikely that any will persist.
+ *       that no dcache entries will exist after a process is reaped
+ *       it just makes it very unlikely that any will persist.
  */
 
-void proc_flush_task(struct task_struct *task)
+void proc_flush_pid(struct pid *pid)
 {
-	int i;
-	struct pid *pid, *tgid;
-	struct upid *upid;
-
-	pid = task_pid(task);
-	tgid = task_tgid(task);
-
-	for (i = 0; i <= pid->level; i++) {
-		upid = &pid->numbers[i];
-		proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
-					tgid->numbers[i].nr);
-	}
+	proc_invalidate_siblings_dcache(&pid->inodes, &pid->wait_pidfd.lock);
+	put_pid(pid);
 }
 
 static int proc_pid_instantiate(struct inode *dir,
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2af9f4f..8503444 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -39,7 +39,7 @@ static void proc_evict_inode(struct inode *inode)
 
 	/* Stop tracking associated processes */
 	if (ei->pid) {
-		put_pid(ei->pid);
+		proc_pid_evict_inode(ei);
 		ei->pid = NULL;
 	}
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 6a1d679..0c6ca639 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -151,6 +151,7 @@ extern int proc_pid_statm(struct seq_file *, struct pid_namespace *,
 extern const struct dentry_operations pid_dentry_operations;
 extern int pid_getattr(struct vfsmount *, struct dentry *, struct kstat *);
 extern int proc_setattr(struct dentry *, struct iattr *);
+extern void proc_pid_evict_inode(struct proc_inode *);
 extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t);
 extern int pid_revalidate(struct dentry *, unsigned int);
 extern int pid_delete_dentry(const struct dentry *);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index f5552ba..04b4aaa 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -63,6 +63,7 @@ struct pid
 	unsigned int level;
 	/* lists of tasks that use this pid */
 	struct hlist_head tasks[PIDTYPE_MAX];
+	struct hlist_head inodes;
 	/* wait queue for pidfd notifications */
 	wait_queue_head_t wait_pidfd;
 	struct rcu_head rcu;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index b97bf2e..d3580f5 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -12,7 +12,7 @@
 #ifdef CONFIG_PROC_FS
 
 extern void proc_root_init(void);
-extern void proc_flush_task(struct task_struct *);
+extern void proc_flush_pid(struct pid *);
 
 extern struct proc_dir_entry *proc_symlink(const char *,
 		struct proc_dir_entry *, const char *);
@@ -48,7 +48,7 @@ static inline void proc_root_init(void)
 {
 }
 
-static inline void proc_flush_task(struct task_struct *task)
+static inline void proc_flush_pid(struct pid *pid)
 {
 }
 
diff --git a/kernel/exit.c b/kernel/exit.c
index f9943ef..5e66030 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -168,6 +168,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 void release_task(struct task_struct *p)
 {
 	struct task_struct *leader;
+	struct pid *thread_pid;
 	int zap_leader;
 repeat:
 	/* don't need to get the RCU readlock here - the process is dead and
@@ -176,10 +177,9 @@ void release_task(struct task_struct *p)
 	atomic_dec(&__task_cred(p)->user->processes);
 	rcu_read_unlock();
 
-	proc_flush_task(p);
-
 	write_lock_irq(&tasklist_lock);
 	ptrace_release_task(p);
+	thread_pid = get_pid(p->pids[PIDTYPE_PID].pid);
 	__exit_signal(p);
 
 	/*
@@ -202,6 +202,7 @@ void release_task(struct task_struct *p)
 	}
 
 	write_unlock_irq(&tasklist_lock);
+	proc_flush_pid(thread_pid);
 	release_thread(p);
 	call_rcu(&p->rcu, delayed_put_task_struct);
 
diff --git a/kernel/pid.c b/kernel/pid.c
index e605398..fb32a81 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -334,6 +334,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 		INIT_HLIST_HEAD(&pid->tasks[type]);
 
 	init_waitqueue_head(&pid->wait_pidfd);
+	INIT_HLIST_HEAD(&pid->inodes);
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 05/10] proc: use %u for pid printing and slightly less stack
  2020-12-03 18:31 ` [PATCH 05/10] proc: use %u for pid printing and slightly less stack Wen Yang
@ 2020-12-03 20:08   ` Alexey Dobriyan
  0 siblings, 0 replies; 19+ messages in thread
From: Alexey Dobriyan @ 2020-12-03 20:08 UTC (permalink / raw)
  To: Wen Yang
  Cc: Greg Kroah-Hartman, Sasha Levin, Xunlei Pang, linux-kernel,
	Alexander Viro, Andrew Morton, Linus Torvalds, stable

On Fri, Dec 04, 2020 at 02:31:59AM +0800, Wen Yang wrote:
> From: Alexey Dobriyan <adobriyan@gmail.com>
> 
> [ Upstream commit e3912ac37e07a13c70675cd75020694de4841c74 ]
> 
> PROC_NUMBUF is 13 which is enough for "negative int + \n + \0".
> 
> However PIDs and TGIDs are never negative and newline is not a concern,
> so use just 10 per integer.
> 
> Link: http://lkml.kernel.org/r/20171120203005.GA27743@avx2
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Alexander Viro <viro@ftp.linux.org.uk>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: <stable@vger.kernel.org> # 4.9.x

A what? How does this belong to stable?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry
  2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
                   ` (9 preceding siblings ...)
  2020-12-03 18:32 ` [PATCH 10/10] proc: Use a list of inodes to flush from proc Wen Yang
@ 2020-12-17  2:26 ` Wen Yang
  2020-12-31  9:22   ` Greg Kroah-Hartman
  10 siblings, 1 reply; 19+ messages in thread
From: Wen Yang @ 2020-12-17  2:26 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Sasha Levin
  Cc: Xunlei Pang, linux-kernel, Pavel Emelyanov, Oleg Nesterov,
	Sukadev Bhattiprolu, Paul Menage, Eric W. Biederman, stable



在 2020/12/4 上午2:31, Wen Yang 写道:
> The dentries such as /proc/<pid>/ns/ have the DCACHE_OP_DELETE flag, they
> should be deleted when the process exits.
> 
> Suppose the following race appears:
> 
> release_task                 dput
> -> proc_flush_task
>                               -> dentry->d_op->d_delete(dentry)
> -> __exit_signal
>                               -> dentry->d_lockref.count--  and return.
> 
> In the proc_flush_task(), if another process is using this dentry, it will
> not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
> before __exit_signal(pid has not been hashed), d_delete returns false, so
> this dentry still cannot be deleted.
> 
> This dentry will always be cached (although its count is 0 and the
> DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
> these dentries can only be deleted when drop_caches is manually triggered.
> 
> This will result in wasted memory. What's more troublesome is that these
> dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
> limit on the number of pid namespaces"), if the pid cannot be released, it
> may result in the inability to create a new pid_ns.
> 
> This problem occurred in our cluster environment (Linux 4.9 LTS).
> We could reproduce it by manually constructing a test program + adding some
> debugging switches in the kernel:
> * A test program to open the directory (/proc/<pid>/ns) [1]
> * Adding some debugging switches to the kernel, adding a delay between
>     proc_flush_task and __exit_signal in release_task() [2]
> 
> The test process is as follows:
> 
> A, terminal #1
> 
> Turn on the debug switch:
> echo 1> /proc/sys/vm/dentry_debug_trace
> 
> Execute the following unshare command:
> sudo unshare --pid --fork --mount-proc bash
> 
> 
> B, terminal #2
> 
> Find the pid of the unshare process:
> 
> # pstree -p | grep unshare
>             | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)
> 
> 
> Find the corresponding dentry:
> # dmesg | grep pid=818
> [70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 entry=818/ffff8802c7b670e8
> 
> 
> C, terminal #3
> 
> Execute the opendir program, it will always open the /proc/818/ns/ directory:
> 
> # ./a.out /proc/818/ns/
> pid: 876
> .
> ..
> net
> uts
> ipc
> pid
> user
> mnt
> cgroup
> 
> D, go back to terminal #2
> 
> Turn on the debugging switches to construct the race:
> # echo 818> /proc/sys/vm/dentry_debug_pid
> # echo 1> /proc/sys/vm/dentry_debug_delay
> 
> Kill the unshare process (pid 818). Since the debugging switches have been
> turned on, it will get stuck in release_task():
> # kill -9 818
> 
> Then kill the process that opened the /proc/818/ns/ directory:
> # kill -9 876
> 
> Then turn off these debugging switches to allow the 818 process to exit:
> # echo 0> /proc/sys/vm/dentry_debug_delay
> # echo 0> /proc/sys/vm/dentry_debug_pid
> 
> Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0,
> and the flag is 2800cc (#define DCACHE_OP_DELETE 0x00000008), but it is still
> cached:
> # dmesg | grep ffff8802a3999548
> …
> [565.559156] XXX dput:853 dentry=ns/ffff8802bea7b528, flag=2800cc, cnt=0, inode=ffff8802b38c2010, pdentry=818/ffff8802c7b670e8, pflag=20008c, pcnt=1, pinode=ffff8802c7812010, keywords: be cached
> 
> 
> It could also be verified via the crash tool:
> 
> crash> dentry.d_flags,d_iname,d_inode,d_lockref -x  ffff8802bea7b528
>    d_flags = 0x2800cc
>    d_iname = "ns\000kkkkkkkkkkkkkkkkkkkkkkkkkkkk"
>    d_inode = 0xffff8802b38c2010
>    d_lockref = {
>      {
>        lock_count = 0x0,
>        {
>          lock = {
>            {
>              rlock = {
>                raw_lock = {
>                  {
>                    val = {
>                      counter = 0x0
>                    },
>                    {
>                      locked = 0x0,
>                      pending = 0x0
>                    },
>                    {
>                      locked_pending = 0x0,
>                      tail = 0x0
>                    }
>                  }
>                }
>              }
>            }
>          },
>          count = 0x0
>        }
>      }
>    }
> crash> kmem  ffff8802bea7b528
> CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
> ffff8802dd5f5900      192      23663     26130    871    16k  dentry
>    SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>    ffffea000afa9e00  ffff8802bea78000     0     30         25     5
>    FREE / [ALLOCATED]
>    [ffff8802bea7b520]
> 
>        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
> ffffea000afa9ec0 2bea7b000 dead000000000400        0  0 2fffff80000000
> crash>
> 
> This series of patches is to fix this issue.
> 
> Regards,
> Wen
> 
> Alexey Dobriyan (1):
>    proc: use %u for pid printing and slightly less stack
> 
> Andreas Gruenbacher (1):
>    proc: Pass file mode to proc_pid_make_inode
> 
> Christian Brauner (1):
>    clone: add CLONE_PIDFD
> 
> Eric W. Biederman (6):
>    proc: Better ownership of files for non-dumpable tasks in user
>      namespaces
>    proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
>    proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
>    proc: Clear the pieces of proc_inode that proc_evict_inode cares about
>    proc: Use d_invalidate in proc_prune_siblings_dcache
>    proc: Use a list of inodes to flush from proc
> 
> Joel Fernandes (Google) (1):
>    pidfd: add polling support
> 
>   fs/proc/base.c             | 242 ++++++++++++++++++++-------------------------
>   fs/proc/fd.c               |  20 +---
>   fs/proc/inode.c            |  67 ++++++++++++-
>   fs/proc/internal.h         |  22 ++---
>   fs/proc/namespaces.c       |   3 +-
>   fs/proc/proc_sysctl.c      |  45 ++-------
>   fs/proc/self.c             |   6 +-
>   fs/proc/thread_self.c      |   5 +-
>   include/linux/pid.h        |   5 +
>   include/linux/proc_fs.h    |   4 +-
>   include/uapi/linux/sched.h |   1 +
>   kernel/exit.c              |   5 +-
>   kernel/fork.c              | 145 ++++++++++++++++++++++++++-
>   kernel/pid.c               |   3 +
>   kernel/signal.c            |  11 +++
>   security/selinux/hooks.c   |   1 +
>   16 files changed, 357 insertions(+), 228 deletions(-)
> 
> [1] A test program to open the directory (/proc/<pid>/ns)
> #include <stdio.h>
> #include <sys/types.h>
> #include <dirent.h>
> #include <errno.h>
> 
> int main(int argc, char *argv[])
> {
> 	DIR *dip;
> 	struct dirent *dit;
> 
> 	if (argc < 2) {
> 		printf("Usage :%s <directory>\n", argv[0]);
> 		return -1;
> 	}
> 
> 	if ((dip = opendir(argv[1])) == NULL) {
> 		perror("opendir");
> 		return -1;
> 	}
> 
> 	printf("pid: %d\n", getpid());
> 	while((dit = readdir (dip)) != NULL) {
> 		printf("%s\n", dit->d_name);
> 	}
> 
> 	while (1)
> 		sleep (1);
> 
> 	return 0;
> }
> 
> [2] Adding some debugging switches to the kernel, also adding a delay between
>      proc_flush_task and __exit_signal in release_task():
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 05bad55..fafad37 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -84,6 +84,9 @@
>   int sysctl_vfs_cache_pressure __read_mostly = 100;
>   EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
> 
> +int sysctl_dentry_debug_trace __read_mostly = 0;
> +EXPORT_SYMBOL_GPL(sysctl_dentry_debug_trace);
> +
>   __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
> 
>   EXPORT_SYMBOL(rename_lock);
> @@ -758,6 +761,26 @@ static inline bool fast_dput(struct dentry *dentry)
>   	return 0;
>   }
> 
> +#define DENTRY_DEBUG_TRACE(dentry, keywords)                            \
> +do {                                                                    \
> +	if (sysctl_dentry_debug_trace)                                   \
> +		printk("XXX %s:%d "                                      \
> +                	"dentry=%s/%p, flag=%x, cnt=%d, inode=%p, "      \
> +                	"pdentry=%s/%p, pflag=%x, pcnt=%d, pinode=%p, "  \
> +			"keywords: %s\n",                                \
> +			__func__, __LINE__,                              \
> +			dentry->d_name.name,                             \
> +			dentry,                                          \
> +			dentry->d_flags,                                 \
> +			dentry->d_lockref.count,                         \
> +			dentry->d_inode,                                 \
> +			dentry->d_parent->d_name.name,                   \
> +			dentry->d_parent,                                \
> +			dentry->d_parent->d_flags,                       \
> +			dentry->d_parent->d_lockref.count,               \
> +			dentry->d_parent->d_inode,                       \
> +			keywords);                                       \
> +} while (0)
> 
>   /*
>    * This is dput
> @@ -804,6 +827,8 @@ void dput(struct dentry *dentry)
> 
>   	WARN_ON(d_in_lookup(dentry));
> 
> +	DENTRY_DEBUG_TRACE(dentry, "be checked");
> +
>   	/* Unreachable? Get rid of it */
>   	if (unlikely(d_unhashed(dentry)))
>   		goto kill_it;
> @@ -812,8 +837,10 @@ void dput(struct dentry *dentry)
>   		goto kill_it;
> 
>   	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
> -		if (dentry->d_op->d_delete(dentry))
> +		if (dentry->d_op->d_delete(dentry)) {
> +			DENTRY_DEBUG_TRACE(dentry, "be killed");
>   			goto kill_it;
> +		}
>   	}
> 
>   	if (!(dentry->d_flags & DCACHE_REFERENCED))
> @@ -822,6 +849,9 @@ void dput(struct dentry *dentry)
> 
>   	dentry->d_lockref.count--;
>   	spin_unlock(&dentry->d_lock);
> +
> +	DENTRY_DEBUG_TRACE(dentry, "be cached");
> +
>   	return;
> 
>   kill_it:
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index b9e4183..419a409 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3090,6 +3090,8 @@ void proc_flush_task(struct task_struct *task)
>   	}
>   }
> 
> +extern int sysctl_dentry_debug_trace;
> +
>   static int proc_pid_instantiate(struct inode *dir,
>   				   struct dentry * dentry,
>   				   struct task_struct *task, const void *ptr)
> @@ -3111,6 +3113,12 @@ static int proc_pid_instantiate(struct inode *dir,
>   	d_set_d_op(dentry, &pid_dentry_operations);
> 
>   	d_add(dentry, inode);
> +
> +	if (sysctl_dentry_debug_trace)
> +		printk("XXX %s:%d pid=%d tid=%d  entry=%s/%p\n",
> +			__func__, __LINE__, task->pid, task->tgid,
> +			dentry->d_name.name, dentry);
> +
>   	/* Close the race of the process dying before we return the dentry */
>   	if (pid_revalidate(dentry, 0))
>   		return 0;
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 27f4168..2b3e1b6 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -55,6 +55,8 @@
>   #include <linux/shm.h>
>   #include <linux/kcov.h>
> 
> +#include <linux/delay.h>
> +
>   #include <asm/uaccess.h>
>   #include <asm/unistd.h>
>   #include <asm/pgtable.h>
> @@ -164,6 +166,8 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>   	put_task_struct(tsk);
>   }
> 
> +int sysctl_dentry_debug_delay __read_mostly = 0;
> +int sysctl_dentry_debug_pid __read_mostly = 0;
> 
>   void release_task(struct task_struct *p)
>   {
> @@ -178,6 +182,11 @@ void release_task(struct task_struct *p)
> 
>   	proc_flush_task(p);
> 
> +	if (sysctl_dentry_debug_delay && p->pid == sysctl_dentry_debug_pid) {
> +		while (sysctl_dentry_debug_delay)
> +			mdelay(1);
> +	}
> +
>   	write_lock_irq(&tasklist_lock);
>   	ptrace_release_task(p);
>   	__exit_signal(p);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 513e6da..27f1395 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -282,6 +282,10 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>   static int max_extfrag_threshold = 1000;
>   #endif
> 
> +extern int sysctl_dentry_debug_trace;
> +extern int sysctl_dentry_debug_delay;
> +extern int sysctl_dentry_debug_pid;
> +
>   static struct ctl_table kern_table[] = {
>   	{
>   		.procname	= "sched_child_runs_first",
> @@ -1498,6 +1502,30 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>   		.proc_handler	= proc_dointvec,
>   		.extra1		= &zero,
>   	},
> +	{
> +		.procname	= "dentry_debug_trace",
> +		.data		= &sysctl_dentry_debug_trace,
> +		.maxlen		= sizeof(sysctl_dentry_debug_trace),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +		.extra1		= &zero,
> +	},
> +	{
> +		.procname	= "dentry_debug_delay",
> +		.data		= &sysctl_dentry_debug_delay,
> +		.maxlen		= sizeof(sysctl_dentry_debug_delay),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +		.extra1		= &zero,
> +	},
> +	{
> +		.procname	= "dentry_debug_pid",
> +		.data		= &sysctl_dentry_debug_pid,
> +		.maxlen		= sizeof(sysctl_dentry_debug_pid),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +		.extra1		= &zero,
> +	},
>   #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
>   	{
>   		.procname	= "legacy_va_layout",
> 
> 
> Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
> Cc: Pavel Emelyanov <xemul@openvz.org>
> Cc: Oleg Nesterov <oleg@tv-sign.ru>
> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
> Cc: Paul Menage <menage@google.com>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: <stable@vger.kernel.org>
> 

Hi Greg,

Could you kindly give some suggestions?

Thanks,



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry
  2020-12-17  2:26 ` [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
@ 2020-12-31  9:22   ` Greg Kroah-Hartman
  2021-01-04  4:15     ` Wen Yang
  0 siblings, 1 reply; 19+ messages in thread
From: Greg Kroah-Hartman @ 2020-12-31  9:22 UTC (permalink / raw)
  To: Wen Yang
  Cc: Sasha Levin, Xunlei Pang, linux-kernel, Pavel Emelyanov,
	Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage,
	Eric W. Biederman, stable

On Thu, Dec 17, 2020 at 10:26:23AM +0800, Wen Yang wrote:
> 
> 
> 在 2020/12/4 上午2:31, Wen Yang 写道:
> > The dentries such as /proc/<pid>/ns/ have the DCACHE_OP_DELETE flag, they
> > should be deleted when the process exits.
> > 
> > Suppose the following race appears:
> > 
> > release_task                 dput
> > -> proc_flush_task
> >                               -> dentry->d_op->d_delete(dentry)
> > -> __exit_signal
> >                               -> dentry->d_lockref.count--  and return.
> > 
> > In the proc_flush_task(), if another process is using this dentry, it will
> > not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
> > before __exit_signal(pid has not been hashed), d_delete returns false, so
> > this dentry still cannot be deleted.
> > 
> > This dentry will always be cached (although its count is 0 and the
> > DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
> > these dentries can only be deleted when drop_caches is manually triggered.
> > 
> > This will result in wasted memory. What's more troublesome is that these
> > dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
> > limit on the number of pid namespaces"), if the pid cannot be released, it
> > may result in the inability to create a new pid_ns.
> > 
> > This problem occurred in our cluster environment (Linux 4.9 LTS).
> > We could reproduce it by manually constructing a test program + adding some
> > debugging switches in the kernel:
> > * A test program to open the directory (/proc/<pid>/ns) [1]
> > * Adding some debugging switches to the kernel, adding a delay between
> >     proc_flush_task and __exit_signal in release_task() [2]
> > 
> > The test process is as follows:
> > 
> > A, terminal #1
> > 
> > Turn on the debug switch:
> > echo 1> /proc/sys/vm/dentry_debug_trace
> > 
> > Execute the following unshare command:
> > sudo unshare --pid --fork --mount-proc bash
> > 
> > 
> > B, terminal #2
> > 
> > Find the pid of the unshare process:
> > 
> > # pstree -p | grep unshare
> >             | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)
> > 
> > 
> > Find the corresponding dentry:
> > # dmesg | grep pid=818
> > [70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 entry=818/ffff8802c7b670e8
> > 
> > 
> > C, terminal #3
> > 
> > Execute the opendir program, it will always open the /proc/818/ns/ directory:
> > 
> > # ./a.out /proc/818/ns/
> > pid: 876
> > .
> > ..
> > net
> > uts
> > ipc
> > pid
> > user
> > mnt
> > cgroup
> > 
> > D, go back to terminal #2
> > 
> > Turn on the debugging switches to construct the race:
> > # echo 818> /proc/sys/vm/dentry_debug_pid
> > # echo 1> /proc/sys/vm/dentry_debug_delay
> > 
> > Kill the unshare process (pid 818). Since the debugging switches have been
> > turned on, it will get stuck in release_task():
> > # kill -9 818
> > 
> > Then kill the process that opened the /proc/818/ns/ directory:
> > # kill -9 876
> > 
> > Then turn off these debugging switches to allow the 818 process to exit:
> > # echo 0> /proc/sys/vm/dentry_debug_delay
> > # echo 0> /proc/sys/vm/dentry_debug_pid
> > 
> > Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0,
> > and the flag is 2800cc (#define DCACHE_OP_DELETE 0x00000008), but it is still
> > cached:
> > # dmesg | grep ffff8802a3999548
> > …
> > [565.559156] XXX dput:853 dentry=ns/ffff8802bea7b528, flag=2800cc, cnt=0, inode=ffff8802b38c2010, pdentry=818/ffff8802c7b670e8, pflag=20008c, pcnt=1, pinode=ffff8802c7812010, keywords: be cached
> > 
> > 
> > It could also be verified via the crash tool:
> > 
> > crash> dentry.d_flags,d_iname,d_inode,d_lockref -x  ffff8802bea7b528
> >    d_flags = 0x2800cc
> >    d_iname = "ns\000kkkkkkkkkkkkkkkkkkkkkkkkkkkk"
> >    d_inode = 0xffff8802b38c2010
> >    d_lockref = {
> >      {
> >        lock_count = 0x0,
> >        {
> >          lock = {
> >            {
> >              rlock = {
> >                raw_lock = {
> >                  {
> >                    val = {
> >                      counter = 0x0
> >                    },
> >                    {
> >                      locked = 0x0,
> >                      pending = 0x0
> >                    },
> >                    {
> >                      locked_pending = 0x0,
> >                      tail = 0x0
> >                    }
> >                  }
> >                }
> >              }
> >            }
> >          },
> >          count = 0x0
> >        }
> >      }
> >    }
> > crash> kmem  ffff8802bea7b528
> > CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
> > ffff8802dd5f5900      192      23663     26130    871    16k  dentry
> >    SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
> >    ffffea000afa9e00  ffff8802bea78000     0     30         25     5
> >    FREE / [ALLOCATED]
> >    [ffff8802bea7b520]
> > 
> >        PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
> > ffffea000afa9ec0 2bea7b000 dead000000000400        0  0 2fffff80000000
> > crash>
> > 
> > This series of patches is to fix this issue.
> > 
> > Regards,
> > Wen
> > 
> > Alexey Dobriyan (1):
> >    proc: use %u for pid printing and slightly less stack
> > 
> > Andreas Gruenbacher (1):
> >    proc: Pass file mode to proc_pid_make_inode
> > 
> > Christian Brauner (1):
> >    clone: add CLONE_PIDFD
> > 
> > Eric W. Biederman (6):
> >    proc: Better ownership of files for non-dumpable tasks in user
> >      namespaces
> >    proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
> >    proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
> >    proc: Clear the pieces of proc_inode that proc_evict_inode cares about
> >    proc: Use d_invalidate in proc_prune_siblings_dcache
> >    proc: Use a list of inodes to flush from proc
> > 
> > Joel Fernandes (Google) (1):
> >    pidfd: add polling support
> > 
> >   fs/proc/base.c             | 242 ++++++++++++++++++++-------------------------
> >   fs/proc/fd.c               |  20 +---
> >   fs/proc/inode.c            |  67 ++++++++++++-
> >   fs/proc/internal.h         |  22 ++---
> >   fs/proc/namespaces.c       |   3 +-
> >   fs/proc/proc_sysctl.c      |  45 ++-------
> >   fs/proc/self.c             |   6 +-
> >   fs/proc/thread_self.c      |   5 +-
> >   include/linux/pid.h        |   5 +
> >   include/linux/proc_fs.h    |   4 +-
> >   include/uapi/linux/sched.h |   1 +
> >   kernel/exit.c              |   5 +-
> >   kernel/fork.c              | 145 ++++++++++++++++++++++++++-
> >   kernel/pid.c               |   3 +
> >   kernel/signal.c            |  11 +++
> >   security/selinux/hooks.c   |   1 +
> >   16 files changed, 357 insertions(+), 228 deletions(-)
> > 
> > [1] A test program to open the directory (/proc/<pid>/ns)
> > #include <stdio.h>
> > #include <sys/types.h>
> > #include <dirent.h>
> > #include <errno.h>
> > 
> > int main(int argc, char *argv[])
> > {
> > 	DIR *dip;
> > 	struct dirent *dit;
> > 
> > 	if (argc < 2) {
> > 		printf("Usage :%s <directory>\n", argv[0]);
> > 		return -1;
> > 	}
> > 
> > 	if ((dip = opendir(argv[1])) == NULL) {
> > 		perror("opendir");
> > 		return -1;
> > 	}
> > 
> > 	printf("pid: %d\n", getpid());
> > 	while((dit = readdir (dip)) != NULL) {
> > 		printf("%s\n", dit->d_name);
> > 	}
> > 
> > 	while (1)
> > 		sleep (1);
> > 
> > 	return 0;
> > }
> > 
> > [2] Adding some debugging switches to the kernel, also adding a delay between
> >      proc_flush_task and __exit_signal in release_task():
> > 
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 05bad55..fafad37 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -84,6 +84,9 @@
> >   int sysctl_vfs_cache_pressure __read_mostly = 100;
> >   EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
> > 
> > +int sysctl_dentry_debug_trace __read_mostly = 0;
> > +EXPORT_SYMBOL_GPL(sysctl_dentry_debug_trace);
> > +
> >   __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
> > 
> >   EXPORT_SYMBOL(rename_lock);
> > @@ -758,6 +761,26 @@ static inline bool fast_dput(struct dentry *dentry)
> >   	return 0;
> >   }
> > 
> > +#define DENTRY_DEBUG_TRACE(dentry, keywords)                            \
> > +do {                                                                    \
> > +	if (sysctl_dentry_debug_trace)                                   \
> > +		printk("XXX %s:%d "                                      \
> > +                	"dentry=%s/%p, flag=%x, cnt=%d, inode=%p, "      \
> > +                	"pdentry=%s/%p, pflag=%x, pcnt=%d, pinode=%p, "  \
> > +			"keywords: %s\n",                                \
> > +			__func__, __LINE__,                              \
> > +			dentry->d_name.name,                             \
> > +			dentry,                                          \
> > +			dentry->d_flags,                                 \
> > +			dentry->d_lockref.count,                         \
> > +			dentry->d_inode,                                 \
> > +			dentry->d_parent->d_name.name,                   \
> > +			dentry->d_parent,                                \
> > +			dentry->d_parent->d_flags,                       \
> > +			dentry->d_parent->d_lockref.count,               \
> > +			dentry->d_parent->d_inode,                       \
> > +			keywords);                                       \
> > +} while (0)
> > 
> >   /*
> >    * This is dput
> > @@ -804,6 +827,8 @@ void dput(struct dentry *dentry)
> > 
> >   	WARN_ON(d_in_lookup(dentry));
> > 
> > +	DENTRY_DEBUG_TRACE(dentry, "be checked");
> > +
> >   	/* Unreachable? Get rid of it */
> >   	if (unlikely(d_unhashed(dentry)))
> >   		goto kill_it;
> > @@ -812,8 +837,10 @@ void dput(struct dentry *dentry)
> >   		goto kill_it;
> > 
> >   	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
> > -		if (dentry->d_op->d_delete(dentry))
> > +		if (dentry->d_op->d_delete(dentry)) {
> > +			DENTRY_DEBUG_TRACE(dentry, "be killed");
> >   			goto kill_it;
> > +		}
> >   	}
> > 
> >   	if (!(dentry->d_flags & DCACHE_REFERENCED))
> > @@ -822,6 +849,9 @@ void dput(struct dentry *dentry)
> > 
> >   	dentry->d_lockref.count--;
> >   	spin_unlock(&dentry->d_lock);
> > +
> > +	DENTRY_DEBUG_TRACE(dentry, "be cached");
> > +
> >   	return;
> > 
> >   kill_it:
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index b9e4183..419a409 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -3090,6 +3090,8 @@ void proc_flush_task(struct task_struct *task)
> >   	}
> >   }
> > 
> > +extern int sysctl_dentry_debug_trace;
> > +
> >   static int proc_pid_instantiate(struct inode *dir,
> >   				   struct dentry * dentry,
> >   				   struct task_struct *task, const void *ptr)
> > @@ -3111,6 +3113,12 @@ static int proc_pid_instantiate(struct inode *dir,
> >   	d_set_d_op(dentry, &pid_dentry_operations);
> > 
> >   	d_add(dentry, inode);
> > +
> > +	if (sysctl_dentry_debug_trace)
> > +		printk("XXX %s:%d pid=%d tid=%d  entry=%s/%p\n",
> > +			__func__, __LINE__, task->pid, task->tgid,
> > +			dentry->d_name.name, dentry);
> > +
> >   	/* Close the race of the process dying before we return the dentry */
> >   	if (pid_revalidate(dentry, 0))
> >   		return 0;
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 27f4168..2b3e1b6 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -55,6 +55,8 @@
> >   #include <linux/shm.h>
> >   #include <linux/kcov.h>
> > 
> > +#include <linux/delay.h>
> > +
> >   #include <asm/uaccess.h>
> >   #include <asm/unistd.h>
> >   #include <asm/pgtable.h>
> > @@ -164,6 +166,8 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
> >   	put_task_struct(tsk);
> >   }
> > 
> > +int sysctl_dentry_debug_delay __read_mostly = 0;
> > +int sysctl_dentry_debug_pid __read_mostly = 0;
> > 
> >   void release_task(struct task_struct *p)
> >   {
> > @@ -178,6 +182,11 @@ void release_task(struct task_struct *p)
> > 
> >   	proc_flush_task(p);
> > 
> > +	if (sysctl_dentry_debug_delay && p->pid == sysctl_dentry_debug_pid) {
> > +		while (sysctl_dentry_debug_delay)
> > +			mdelay(1);
> > +	}
> > +
> >   	write_lock_irq(&tasklist_lock);
> >   	ptrace_release_task(p);
> >   	__exit_signal(p);
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 513e6da..27f1395 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -282,6 +282,10 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
> >   static int max_extfrag_threshold = 1000;
> >   #endif
> > 
> > +extern int sysctl_dentry_debug_trace;
> > +extern int sysctl_dentry_debug_delay;
> > +extern int sysctl_dentry_debug_pid;
> > +
> >   static struct ctl_table kern_table[] = {
> >   	{
> >   		.procname	= "sched_child_runs_first",
> > @@ -1498,6 +1502,30 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
> >   		.proc_handler	= proc_dointvec,
> >   		.extra1		= &zero,
> >   	},
> > +	{
> > +		.procname	= "dentry_debug_trace",
> > +		.data		= &sysctl_dentry_debug_trace,
> > +		.maxlen		= sizeof(sysctl_dentry_debug_trace),
> > +		.mode		= 0644,
> > +		.proc_handler	= proc_dointvec,
> > +		.extra1		= &zero,
> > +	},
> > +	{
> > +		.procname	= "dentry_debug_delay",
> > +		.data		= &sysctl_dentry_debug_delay,
> > +		.maxlen		= sizeof(sysctl_dentry_debug_delay),
> > +		.mode		= 0644,
> > +		.proc_handler	= proc_dointvec,
> > +		.extra1		= &zero,
> > +	},
> > +	{
> > +		.procname	= "dentry_debug_pid",
> > +		.data		= &sysctl_dentry_debug_pid,
> > +		.maxlen		= sizeof(sysctl_dentry_debug_pid),
> > +		.mode		= 0644,
> > +		.proc_handler	= proc_dointvec,
> > +		.extra1		= &zero,
> > +	},
> >   #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
> >   	{
> >   		.procname	= "legacy_va_layout",
> > 
> > 
> > Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
> > Cc: Pavel Emelyanov <xemul@openvz.org>
> > Cc: Oleg Nesterov <oleg@tv-sign.ru>
> > Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
> > Cc: Paul Menage <menage@google.com>
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: <stable@vger.kernel.org>
> > 
> 
> Hi Greg,
> 
> Could you kindly give some suggestions?

I'm sorry, but I don't really understand what this patch series is for.

Why is there a patch in the 00/10 email?  What stable kernel(s) should
this be backported to?  And why can't you just use a newer kernel (like
4.19) if you are hitting this issue with much older ones?

confused,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry
  2020-12-31  9:22   ` Greg Kroah-Hartman
@ 2021-01-04  4:15     ` Wen Yang
  0 siblings, 0 replies; 19+ messages in thread
From: Wen Yang @ 2021-01-04  4:15 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Sasha Levin, Xunlei Pang, linux-kernel, Pavel Emelyanov,
	Oleg Nesterov, Sukadev Bhattiprolu, Paul Menage,
	Eric W. Biederman, stable



在 2020/12/31 下午5:22, Greg Kroah-Hartman 写道:
> On Thu, Dec 17, 2020 at 10:26:23AM +0800, Wen Yang wrote:
>>
>>
>> 在 2020/12/4 上午2:31, Wen Yang 写道:
>>> The dentries such as /proc/<pid>/ns/ have the DCACHE_OP_DELETE flag, they
>>> should be deleted when the process exits.
>>>
>>> Suppose the following race appears:
>>>
>>> release_task                 dput
>>> -> proc_flush_task
>>>                                -> dentry->d_op->d_delete(dentry)
>>> -> __exit_signal
>>>                                -> dentry->d_lockref.count--  and return.
>>>
>>> In the proc_flush_task(), if another process is using this dentry, it will
>>> not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
>>> before __exit_signal(pid has not been hashed), d_delete returns false, so
>>> this dentry still cannot be deleted.
>>>
>>> This dentry will always be cached (although its count is 0 and the
>>> DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
>>> these dentries can only be deleted when drop_caches is manually triggered.
>>>
>>> This will result in wasted memory. What's more troublesome is that these
>>> dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
>>> limit on the number of pid namespaces"), if the pid cannot be released, it
>>> may result in the inability to create a new pid_ns.
>>>
>>> This problem occurred in our cluster environment (Linux 4.9 LTS).
>>> We could reproduce it by manually constructing a test program + adding some
>>> debugging switches in the kernel:
>>> * A test program to open the directory (/proc/<pid>/ns) [1]
>>> * Adding some debugging switches to the kernel, adding a delay between
>>>      proc_flush_task and __exit_signal in release_task() [2]
>>>
>>> The test process is as follows:
>>>
>>> A, terminal #1
>>>
>>> Turn on the debug switch:
>>> echo 1> /proc/sys/vm/dentry_debug_trace
>>>
>>> Execute the following unshare command:
>>> sudo unshare --pid --fork --mount-proc bash
>>>
>>>
>>> B, terminal #2
>>>
>>> Find the pid of the unshare process:
>>>
>>> # pstree -p | grep unshare
>>>              | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)
>>>
>>>
>>> Find the corresponding dentry:
>>> # dmesg | grep pid=818
>>> [70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 entry=818/ffff8802c7b670e8
>>>
>>>
>>> C, terminal #3
>>>
>>> Execute the opendir program, it will always open the /proc/818/ns/ directory:
>>>
>>> # ./a.out /proc/818/ns/
>>> pid: 876
>>> .
>>> ..
>>> net
>>> uts
>>> ipc
>>> pid
>>> user
>>> mnt
>>> cgroup
>>>
>>> D, go back to terminal #2
>>>
>>> Turn on the debugging switches to construct the race:
>>> # echo 818> /proc/sys/vm/dentry_debug_pid
>>> # echo 1> /proc/sys/vm/dentry_debug_delay
>>>
>>> Kill the unshare process (pid 818). Since the debugging switches have been
>>> turned on, it will get stuck in release_task():
>>> # kill -9 818
>>>
>>> Then kill the process that opened the /proc/818/ns/ directory:
>>> # kill -9 876
>>>
>>> Then turn off these debugging switches to allow the 818 process to exit:
>>> # echo 0> /proc/sys/vm/dentry_debug_delay
>>> # echo 0> /proc/sys/vm/dentry_debug_pid
>>>
>>> Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0,
>>> and the flag is 2800cc (#define DCACHE_OP_DELETE 0x00000008), but it is still
>>> cached:
>>> # dmesg | grep ffff8802a3999548
>>> …
>>> [565.559156] XXX dput:853 dentry=ns/ffff8802bea7b528, flag=2800cc, cnt=0, inode=ffff8802b38c2010, pdentry=818/ffff8802c7b670e8, pflag=20008c, pcnt=1, pinode=ffff8802c7812010, keywords: be cached
>>>
>>>
>>> It could also be verified via the crash tool:
>>>
>>> crash> dentry.d_flags,d_iname,d_inode,d_lockref -x  ffff8802bea7b528
>>>     d_flags = 0x2800cc
>>>     d_iname = "ns\000kkkkkkkkkkkkkkkkkkkkkkkkkkkk"
>>>     d_inode = 0xffff8802b38c2010
>>>     d_lockref = {
>>>       {
>>>         lock_count = 0x0,
>>>         {
>>>           lock = {
>>>             {
>>>               rlock = {
>>>                 raw_lock = {
>>>                   {
>>>                     val = {
>>>                       counter = 0x0
>>>                     },
>>>                     {
>>>                       locked = 0x0,
>>>                       pending = 0x0
>>>                     },
>>>                     {
>>>                       locked_pending = 0x0,
>>>                       tail = 0x0
>>>                     }
>>>                   }
>>>                 }
>>>               }
>>>             }
>>>           },
>>>           count = 0x0
>>>         }
>>>       }
>>>     }
>>> crash> kmem  ffff8802bea7b528
>>> CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
>>> ffff8802dd5f5900      192      23663     26130    871    16k  dentry
>>>     SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
>>>     ffffea000afa9e00  ffff8802bea78000     0     30         25     5
>>>     FREE / [ALLOCATED]
>>>     [ffff8802bea7b520]
>>>
>>>         PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
>>> ffffea000afa9ec0 2bea7b000 dead000000000400        0  0 2fffff80000000
>>> crash>
>>>
>>> This series of patches is to fix this issue.
>>>
>>> Regards,
>>> Wen
>>>
>>> Alexey Dobriyan (1):
>>>     proc: use %u for pid printing and slightly less stack
>>>
>>> Andreas Gruenbacher (1):
>>>     proc: Pass file mode to proc_pid_make_inode
>>>
>>> Christian Brauner (1):
>>>     clone: add CLONE_PIDFD
>>>
>>> Eric W. Biederman (6):
>>>     proc: Better ownership of files for non-dumpable tasks in user
>>>       namespaces
>>>     proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
>>>     proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
>>>     proc: Clear the pieces of proc_inode that proc_evict_inode cares about
>>>     proc: Use d_invalidate in proc_prune_siblings_dcache
>>>     proc: Use a list of inodes to flush from proc
>>>
>>> Joel Fernandes (Google) (1):
>>>     pidfd: add polling support
>>>
>>>    fs/proc/base.c             | 242 ++++++++++++++++++++-------------------------
>>>    fs/proc/fd.c               |  20 +---
>>>    fs/proc/inode.c            |  67 ++++++++++++-
>>>    fs/proc/internal.h         |  22 ++---
>>>    fs/proc/namespaces.c       |   3 +-
>>>    fs/proc/proc_sysctl.c      |  45 ++-------
>>>    fs/proc/self.c             |   6 +-
>>>    fs/proc/thread_self.c      |   5 +-
>>>    include/linux/pid.h        |   5 +
>>>    include/linux/proc_fs.h    |   4 +-
>>>    include/uapi/linux/sched.h |   1 +
>>>    kernel/exit.c              |   5 +-
>>>    kernel/fork.c              | 145 ++++++++++++++++++++++++++-
>>>    kernel/pid.c               |   3 +
>>>    kernel/signal.c            |  11 +++
>>>    security/selinux/hooks.c   |   1 +
>>>    16 files changed, 357 insertions(+), 228 deletions(-)
>>>
>>> [1] A test program to open the directory (/proc/<pid>/ns)
>>> #include <stdio.h>
>>> #include <sys/types.h>
>>> #include <dirent.h>
>>> #include <errno.h>
>>>
>>> int main(int argc, char *argv[])
>>> {
>>> 	DIR *dip;
>>> 	struct dirent *dit;
>>>
>>> 	if (argc < 2) {
>>> 		printf("Usage :%s <directory>\n", argv[0]);
>>> 		return -1;
>>> 	}
>>>
>>> 	if ((dip = opendir(argv[1])) == NULL) {
>>> 		perror("opendir");
>>> 		return -1;
>>> 	}
>>>
>>> 	printf("pid: %d\n", getpid());
>>> 	while((dit = readdir (dip)) != NULL) {
>>> 		printf("%s\n", dit->d_name);
>>> 	}
>>>
>>> 	while (1)
>>> 		sleep (1);
>>>
>>> 	return 0;
>>> }
>>>
>>> [2] Adding some debugging switches to the kernel, also adding a delay between
>>>       proc_flush_task and __exit_signal in release_task():
>>>
>>> diff --git a/fs/dcache.c b/fs/dcache.c
>>> index 05bad55..fafad37 100644
>>> --- a/fs/dcache.c
>>> +++ b/fs/dcache.c
>>> @@ -84,6 +84,9 @@
>>>    int sysctl_vfs_cache_pressure __read_mostly = 100;
>>>    EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
>>>
>>> +int sysctl_dentry_debug_trace __read_mostly = 0;
>>> +EXPORT_SYMBOL_GPL(sysctl_dentry_debug_trace);
>>> +
>>>    __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
>>>
>>>    EXPORT_SYMBOL(rename_lock);
>>> @@ -758,6 +761,26 @@ static inline bool fast_dput(struct dentry *dentry)
>>>    	return 0;
>>>    }
>>>
>>> +#define DENTRY_DEBUG_TRACE(dentry, keywords)                            \
>>> +do {                                                                    \
>>> +	if (sysctl_dentry_debug_trace)                                   \
>>> +		printk("XXX %s:%d "                                      \
>>> +                	"dentry=%s/%p, flag=%x, cnt=%d, inode=%p, "      \
>>> +                	"pdentry=%s/%p, pflag=%x, pcnt=%d, pinode=%p, "  \
>>> +			"keywords: %s\n",                                \
>>> +			__func__, __LINE__,                              \
>>> +			dentry->d_name.name,                             \
>>> +			dentry,                                          \
>>> +			dentry->d_flags,                                 \
>>> +			dentry->d_lockref.count,                         \
>>> +			dentry->d_inode,                                 \
>>> +			dentry->d_parent->d_name.name,                   \
>>> +			dentry->d_parent,                                \
>>> +			dentry->d_parent->d_flags,                       \
>>> +			dentry->d_parent->d_lockref.count,               \
>>> +			dentry->d_parent->d_inode,                       \
>>> +			keywords);                                       \
>>> +} while (0)
>>>
>>>    /*
>>>     * This is dput
>>> @@ -804,6 +827,8 @@ void dput(struct dentry *dentry)
>>>
>>>    	WARN_ON(d_in_lookup(dentry));
>>>
>>> +	DENTRY_DEBUG_TRACE(dentry, "be checked");
>>> +
>>>    	/* Unreachable? Get rid of it */
>>>    	if (unlikely(d_unhashed(dentry)))
>>>    		goto kill_it;
>>> @@ -812,8 +837,10 @@ void dput(struct dentry *dentry)
>>>    		goto kill_it;
>>>
>>>    	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
>>> -		if (dentry->d_op->d_delete(dentry))
>>> +		if (dentry->d_op->d_delete(dentry)) {
>>> +			DENTRY_DEBUG_TRACE(dentry, "be killed");
>>>    			goto kill_it;
>>> +		}
>>>    	}
>>>
>>>    	if (!(dentry->d_flags & DCACHE_REFERENCED))
>>> @@ -822,6 +849,9 @@ void dput(struct dentry *dentry)
>>>
>>>    	dentry->d_lockref.count--;
>>>    	spin_unlock(&dentry->d_lock);
>>> +
>>> +	DENTRY_DEBUG_TRACE(dentry, "be cached");
>>> +
>>>    	return;
>>>
>>>    kill_it:
>>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>>> index b9e4183..419a409 100644
>>> --- a/fs/proc/base.c
>>> +++ b/fs/proc/base.c
>>> @@ -3090,6 +3090,8 @@ void proc_flush_task(struct task_struct *task)
>>>    	}
>>>    }
>>>
>>> +extern int sysctl_dentry_debug_trace;
>>> +
>>>    static int proc_pid_instantiate(struct inode *dir,
>>>    				   struct dentry * dentry,
>>>    				   struct task_struct *task, const void *ptr)
>>> @@ -3111,6 +3113,12 @@ static int proc_pid_instantiate(struct inode *dir,
>>>    	d_set_d_op(dentry, &pid_dentry_operations);
>>>
>>>    	d_add(dentry, inode);
>>> +
>>> +	if (sysctl_dentry_debug_trace)
>>> +		printk("XXX %s:%d pid=%d tid=%d  entry=%s/%p\n",
>>> +			__func__, __LINE__, task->pid, task->tgid,
>>> +			dentry->d_name.name, dentry);
>>> +
>>>    	/* Close the race of the process dying before we return the dentry */
>>>    	if (pid_revalidate(dentry, 0))
>>>    		return 0;
>>> diff --git a/kernel/exit.c b/kernel/exit.c
>>> index 27f4168..2b3e1b6 100644
>>> --- a/kernel/exit.c
>>> +++ b/kernel/exit.c
>>> @@ -55,6 +55,8 @@
>>>    #include <linux/shm.h>
>>>    #include <linux/kcov.h>
>>>
>>> +#include <linux/delay.h>
>>> +
>>>    #include <asm/uaccess.h>
>>>    #include <asm/unistd.h>
>>>    #include <asm/pgtable.h>
>>> @@ -164,6 +166,8 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>>>    	put_task_struct(tsk);
>>>    }
>>>
>>> +int sysctl_dentry_debug_delay __read_mostly = 0;
>>> +int sysctl_dentry_debug_pid __read_mostly = 0;
>>>
>>>    void release_task(struct task_struct *p)
>>>    {
>>> @@ -178,6 +182,11 @@ void release_task(struct task_struct *p)
>>>
>>>    	proc_flush_task(p);
>>>
>>> +	if (sysctl_dentry_debug_delay && p->pid == sysctl_dentry_debug_pid) {
>>> +		while (sysctl_dentry_debug_delay)
>>> +			mdelay(1);
>>> +	}
>>> +
>>>    	write_lock_irq(&tasklist_lock);
>>>    	ptrace_release_task(p);
>>>    	__exit_signal(p);
>>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>>> index 513e6da..27f1395 100644
>>> --- a/kernel/sysctl.c
>>> +++ b/kernel/sysctl.c
>>> @@ -282,6 +282,10 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>>>    static int max_extfrag_threshold = 1000;
>>>    #endif
>>>
>>> +extern int sysctl_dentry_debug_trace;
>>> +extern int sysctl_dentry_debug_delay;
>>> +extern int sysctl_dentry_debug_pid;
>>> +
>>>    static struct ctl_table kern_table[] = {
>>>    	{
>>>    		.procname	= "sched_child_runs_first",
>>> @@ -1498,6 +1502,30 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>>>    		.proc_handler	= proc_dointvec,
>>>    		.extra1		= &zero,
>>>    	},
>>> +	{
>>> +		.procname	= "dentry_debug_trace",
>>> +		.data		= &sysctl_dentry_debug_trace,
>>> +		.maxlen		= sizeof(sysctl_dentry_debug_trace),
>>> +		.mode		= 0644,
>>> +		.proc_handler	= proc_dointvec,
>>> +		.extra1		= &zero,
>>> +	},
>>> +	{
>>> +		.procname	= "dentry_debug_delay",
>>> +		.data		= &sysctl_dentry_debug_delay,
>>> +		.maxlen		= sizeof(sysctl_dentry_debug_delay),
>>> +		.mode		= 0644,
>>> +		.proc_handler	= proc_dointvec,
>>> +		.extra1		= &zero,
>>> +	},
>>> +	{
>>> +		.procname	= "dentry_debug_pid",
>>> +		.data		= &sysctl_dentry_debug_pid,
>>> +		.maxlen		= sizeof(sysctl_dentry_debug_pid),
>>> +		.mode		= 0644,
>>> +		.proc_handler	= proc_dointvec,
>>> +		.extra1		= &zero,
>>> +	},
>>>    #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
>>>    	{
>>>    		.procname	= "legacy_va_layout",
>>>
>>>
>>> Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
>>> Cc: Pavel Emelyanov <xemul@openvz.org>
>>> Cc: Oleg Nesterov <oleg@tv-sign.ru>
>>> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
>>> Cc: Paul Menage <menage@google.com>
>>> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
>>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>> Cc: <stable@vger.kernel.org>
>>>
>>
>> Hi Greg,
>>
>> Could you kindly give some suggestions?
> 
> I'm sorry, but I don't really understand what this patch series is for.
> 
> Why is there a patch in the 00/10 email?  What stable kernel(s) should
> this be backported to?  And why can't you just use a newer kernel (like
> 4.19) if you are hitting this issue with much older ones?
> 

This bug was introduced by 60347f6716aa ("pid namespaces: prepare 
proc_flust_task() to flush entries from multiple proc trees"), exposed 
by f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), 
and then fixed by 7bc3e6e55acf (“proc: Use a list of inodes to flush 
from proc”)

4.19 LTS did not solve it either.


The one that fixes the bug is this patch (10/10):
https://lkml.org/lkml/2020/12/3/1046
The previous 9 patches (from 01/10 to 09/10) are its pre-dependencies.

The 00/10 is some of our test programs, including a user mode program 
(open the /proc/<pid>/ns directory), and some hacks added to the kernel 
(just add log printing and some delays for easy construction this race).

Thanks.

-- 
Best wishes,
Wen


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/10] clone: add CLONE_PIDFD
  2020-12-03 18:31 ` [PATCH 01/10] clone: add CLONE_PIDFD Wen Yang
@ 2021-01-04 13:03   ` Greg Kroah-Hartman
  2021-01-04 13:13     ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Greg Kroah-Hartman @ 2021-01-04 13:03 UTC (permalink / raw)
  To: Wen Yang
  Cc: Sasha Levin, Xunlei Pang, linux-kernel, Christian Brauner,
	Linus Torvalds, Jann Horn, Oleg Nesterov, Arnd Bergmann,
	Eric W. Biederman, Kees Cook, Thomas Gleixner, David Howells,
	Michael Kerrisk (man-pages),
	Andy Lutomirsky, Andrew Morton, Aleksa Sarai, Al Viro, stable

On Fri, Dec 04, 2020 at 02:31:55AM +0800, Wen Yang wrote:
> From: Christian Brauner <christian@brauner.io>
> 
> [ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]
> 
> This patchset makes it possible to retrieve pid file descriptors at
> process creation time by introducing the new flag CLONE_PIDFD to the
> clone() system call.  Linus originally suggested to implement this as a
> new flag to clone() instead of making it a separate system call.  As
> spotted by Linus, there is exactly one bit for clone() left.
> 
> CLONE_PIDFD creates file descriptors based on the anonymous inode
> implementation in the kernel that will also be used to implement the new
> mount api.  They serve as a simple opaque handle on pids.  Logically,
> this makes it possible to interpret a pidfd differently, narrowing or
> widening the scope of various operations (e.g. signal sending).  Thus, a
> pidfd cannot just refer to a tgid, but also a tid, or in theory - given
> appropriate flag arguments in relevant syscalls - a process group or
> session. A pidfd does not represent a privilege.  This does not imply it
> cannot ever be that way but for now this is not the case.
> 
> A pidfd comes with additional information in fdinfo if the kernel supports
> procfs.  The fdinfo file contains the pid of the process in the callers
> pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".
> 
> As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
> parent_tidptr argument of clone.  This has the advantage that we can
> give back the associated pid and the pidfd at the same time.
> 
> To remove worries about missing metadata access this patchset comes with
> a sample program that illustrates how a combination of CLONE_PIDFD, and
> pidfd_send_signal() can be used to gain race-free access to process
> metadata through /proc/<pid>.  The sample program can easily be
> translated into a helper that would be suitable for inclusion in libc so
> that users don't have to worry about writing it themselves.
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Co-developed-by: Jann Horn <jannh@google.com>
> Signed-off-by: Jann Horn <jannh@google.com>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: David Howells <dhowells@redhat.com>
> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
> Cc: Andy Lutomirsky <luto@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: <stable@vger.kernel.org> # 4.9.x
> (clone: fix up cherry-pick conflicts for b3e583825266)
> Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
> ---
>  include/linux/pid.h        |   1 +
>  include/uapi/linux/sched.h |   1 +
>  kernel/fork.c              | 119 +++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 117 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 97b745d..7599a78 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -73,6 +73,7 @@ struct pid_link
>  	struct hlist_node node;
>  	struct pid *pid;
>  };
> +extern const struct file_operations pidfd_fops;
>  
>  static inline struct pid *get_pid(struct pid *pid)
>  {
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 5f0fe01..ed6e31d 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -9,6 +9,7 @@
>  #define CLONE_FS	0x00000200	/* set if fs info shared between processes */
>  #define CLONE_FILES	0x00000400	/* set if open files shared between processes */
>  #define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
> +#define CLONE_PIDFD	0x00001000	/* set if a pidfd should be placed in parent */
>  #define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
>  #define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
>  #define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b64efec..076297a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -11,7 +11,22 @@
>   * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
>   */
>  
> +#include <linux/anon_inodes.h>
>  #include <linux/slab.h>
> +#if 0
> +#include <linux/sched/autogroup.h>
> +#include <linux/sched/mm.h>
> +#include <linux/sched/coredump.h>
> +#include <linux/sched/user.h>
> +#include <linux/sched/numa_balancing.h>
> +#include <linux/sched/stat.h>
> +#include <linux/sched/task.h>
> +#include <linux/sched/task_stack.h>
> +#include <linux/sched/cputime.h>
> +#include <linux/seq_file.h>
> +#include <linux/rtmutex.h>
> +>>>>>>> b3e58382... clone: add CLONE_PIDFD
> +#endif

That looks odd :(

Can you please refresh this patch series, and make sure it is correct
and resend it?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/10] clone: add CLONE_PIDFD
  2021-01-04 13:03   ` Greg Kroah-Hartman
@ 2021-01-04 13:13     ` Christian Brauner
  2021-01-04 13:17       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2021-01-04 13:13 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Wen Yang, Sasha Levin, Xunlei Pang, linux-kernel,
	Christian Brauner, Linus Torvalds, Jann Horn, Oleg Nesterov,
	Arnd Bergmann, Eric W. Biederman, Kees Cook, Thomas Gleixner,
	David Howells, Michael Kerrisk (man-pages),
	Andy Lutomirsky, Andrew Morton, Aleksa Sarai, Al Viro, stable

On Mon, Jan 04, 2021 at 02:03:14PM +0100, Greg Kroah-Hartman wrote:
> On Fri, Dec 04, 2020 at 02:31:55AM +0800, Wen Yang wrote:
> > From: Christian Brauner <christian@brauner.io>
> > 
> > [ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]
> > 
> > This patchset makes it possible to retrieve pid file descriptors at
> > process creation time by introducing the new flag CLONE_PIDFD to the
> > clone() system call.  Linus originally suggested to implement this as a
> > new flag to clone() instead of making it a separate system call.  As
> > spotted by Linus, there is exactly one bit for clone() left.
> > 
> > CLONE_PIDFD creates file descriptors based on the anonymous inode
> > implementation in the kernel that will also be used to implement the new
> > mount api.  They serve as a simple opaque handle on pids.  Logically,
> > this makes it possible to interpret a pidfd differently, narrowing or
> > widening the scope of various operations (e.g. signal sending).  Thus, a
> > pidfd cannot just refer to a tgid, but also a tid, or in theory - given
> > appropriate flag arguments in relevant syscalls - a process group or
> > session. A pidfd does not represent a privilege.  This does not imply it
> > cannot ever be that way but for now this is not the case.
> > 
> > A pidfd comes with additional information in fdinfo if the kernel supports
> > procfs.  The fdinfo file contains the pid of the process in the callers
> > pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".
> > 
> > As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
> > parent_tidptr argument of clone.  This has the advantage that we can
> > give back the associated pid and the pidfd at the same time.
> > 
> > To remove worries about missing metadata access this patchset comes with
> > a sample program that illustrates how a combination of CLONE_PIDFD, and
> > pidfd_send_signal() can be used to gain race-free access to process
> > metadata through /proc/<pid>.  The sample program can easily be
> > translated into a helper that would be suitable for inclusion in libc so
> > that users don't have to worry about writing it themselves.
> > 
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Christian Brauner <christian@brauner.io>
> > Co-developed-by: Jann Horn <jannh@google.com>
> > Signed-off-by: Jann Horn <jannh@google.com>
> > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: David Howells <dhowells@redhat.com>
> > Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
> > Cc: Andy Lutomirsky <luto@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Aleksa Sarai <cyphar@cyphar.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: <stable@vger.kernel.org> # 4.9.x
> > (clone: fix up cherry-pick conflicts for b3e583825266)
> > Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
> > ---
> >  include/linux/pid.h        |   1 +
> >  include/uapi/linux/sched.h |   1 +
> >  kernel/fork.c              | 119 +++++++++++++++++++++++++++++++++++++++++++--
> >  3 files changed, 117 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > index 97b745d..7599a78 100644
> > --- a/include/linux/pid.h
> > +++ b/include/linux/pid.h
> > @@ -73,6 +73,7 @@ struct pid_link
> >  	struct hlist_node node;
> >  	struct pid *pid;
> >  };
> > +extern const struct file_operations pidfd_fops;
> >  
> >  static inline struct pid *get_pid(struct pid *pid)
> >  {
> > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > index 5f0fe01..ed6e31d 100644
> > --- a/include/uapi/linux/sched.h
> > +++ b/include/uapi/linux/sched.h
> > @@ -9,6 +9,7 @@
> >  #define CLONE_FS	0x00000200	/* set if fs info shared between processes */
> >  #define CLONE_FILES	0x00000400	/* set if open files shared between processes */
> >  #define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
> > +#define CLONE_PIDFD	0x00001000	/* set if a pidfd should be placed in parent */
> >  #define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
> >  #define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
> >  #define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index b64efec..076297a 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -11,7 +11,22 @@
> >   * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
> >   */
> >  
> > +#include <linux/anon_inodes.h>
> >  #include <linux/slab.h>
> > +#if 0
> > +#include <linux/sched/autogroup.h>
> > +#include <linux/sched/mm.h>
> > +#include <linux/sched/coredump.h>
> > +#include <linux/sched/user.h>
> > +#include <linux/sched/numa_balancing.h>
> > +#include <linux/sched/stat.h>
> > +#include <linux/sched/task.h>
> > +#include <linux/sched/task_stack.h>
> > +#include <linux/sched/cputime.h>
> > +#include <linux/seq_file.h>
> > +#include <linux/rtmutex.h>
> > +>>>>>>> b3e58382... clone: add CLONE_PIDFD
> > +#endif
> 
> That looks odd :(
> 
> Can you please refresh this patch series, and make sure it is correct
> and resend it?

Uhm, this patch series has been merged at least a year ago so this looks
like an accidental send.
This probably isn't meant for upstream but for some alibaba specific
kernel I'd reckon.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/10] clone: add CLONE_PIDFD
  2021-01-04 13:13     ` Christian Brauner
@ 2021-01-04 13:17       ` Greg Kroah-Hartman
  2021-01-04 13:19         ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Greg Kroah-Hartman @ 2021-01-04 13:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Wen Yang, Sasha Levin, Xunlei Pang, linux-kernel,
	Christian Brauner, Linus Torvalds, Jann Horn, Oleg Nesterov,
	Arnd Bergmann, Eric W. Biederman, Kees Cook, Thomas Gleixner,
	David Howells, Michael Kerrisk (man-pages),
	Andy Lutomirsky, Andrew Morton, Aleksa Sarai, Al Viro, stable

On Mon, Jan 04, 2021 at 02:13:42PM +0100, Christian Brauner wrote:
> On Mon, Jan 04, 2021 at 02:03:14PM +0100, Greg Kroah-Hartman wrote:
> > On Fri, Dec 04, 2020 at 02:31:55AM +0800, Wen Yang wrote:
> > > From: Christian Brauner <christian@brauner.io>
> > > 
> > > [ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]
> > > 
> > > This patchset makes it possible to retrieve pid file descriptors at
> > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > clone() system call.  Linus originally suggested to implement this as a
> > > new flag to clone() instead of making it a separate system call.  As
> > > spotted by Linus, there is exactly one bit for clone() left.
> > > 
> > > CLONE_PIDFD creates file descriptors based on the anonymous inode
> > > implementation in the kernel that will also be used to implement the new
> > > mount api.  They serve as a simple opaque handle on pids.  Logically,
> > > this makes it possible to interpret a pidfd differently, narrowing or
> > > widening the scope of various operations (e.g. signal sending).  Thus, a
> > > pidfd cannot just refer to a tgid, but also a tid, or in theory - given
> > > appropriate flag arguments in relevant syscalls - a process group or
> > > session. A pidfd does not represent a privilege.  This does not imply it
> > > cannot ever be that way but for now this is not the case.
> > > 
> > > A pidfd comes with additional information in fdinfo if the kernel supports
> > > procfs.  The fdinfo file contains the pid of the process in the callers
> > > pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".
> > > 
> > > As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
> > > parent_tidptr argument of clone.  This has the advantage that we can
> > > give back the associated pid and the pidfd at the same time.
> > > 
> > > To remove worries about missing metadata access this patchset comes with
> > > a sample program that illustrates how a combination of CLONE_PIDFD, and
> > > pidfd_send_signal() can be used to gain race-free access to process
> > > metadata through /proc/<pid>.  The sample program can easily be
> > > translated into a helper that would be suitable for inclusion in libc so
> > > that users don't have to worry about writing it themselves.
> > > 
> > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > Signed-off-by: Christian Brauner <christian@brauner.io>
> > > Co-developed-by: Jann Horn <jannh@google.com>
> > > Signed-off-by: Jann Horn <jannh@google.com>
> > > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> > > Cc: Arnd Bergmann <arnd@arndb.de>
> > > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > > Cc: Kees Cook <keescook@chromium.org>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: David Howells <dhowells@redhat.com>
> > > Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
> > > Cc: Andy Lutomirsky <luto@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Aleksa Sarai <cyphar@cyphar.com>
> > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > Cc: <stable@vger.kernel.org> # 4.9.x
> > > (clone: fix up cherry-pick conflicts for b3e583825266)
> > > Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
> > > ---
> > >  include/linux/pid.h        |   1 +
> > >  include/uapi/linux/sched.h |   1 +
> > >  kernel/fork.c              | 119 +++++++++++++++++++++++++++++++++++++++++++--
> > >  3 files changed, 117 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > > index 97b745d..7599a78 100644
> > > --- a/include/linux/pid.h
> > > +++ b/include/linux/pid.h
> > > @@ -73,6 +73,7 @@ struct pid_link
> > >  	struct hlist_node node;
> > >  	struct pid *pid;
> > >  };
> > > +extern const struct file_operations pidfd_fops;
> > >  
> > >  static inline struct pid *get_pid(struct pid *pid)
> > >  {
> > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > > index 5f0fe01..ed6e31d 100644
> > > --- a/include/uapi/linux/sched.h
> > > +++ b/include/uapi/linux/sched.h
> > > @@ -9,6 +9,7 @@
> > >  #define CLONE_FS	0x00000200	/* set if fs info shared between processes */
> > >  #define CLONE_FILES	0x00000400	/* set if open files shared between processes */
> > >  #define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
> > > +#define CLONE_PIDFD	0x00001000	/* set if a pidfd should be placed in parent */
> > >  #define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
> > >  #define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
> > >  #define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index b64efec..076297a 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -11,7 +11,22 @@
> > >   * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
> > >   */
> > >  
> > > +#include <linux/anon_inodes.h>
> > >  #include <linux/slab.h>
> > > +#if 0
> > > +#include <linux/sched/autogroup.h>
> > > +#include <linux/sched/mm.h>
> > > +#include <linux/sched/coredump.h>
> > > +#include <linux/sched/user.h>
> > > +#include <linux/sched/numa_balancing.h>
> > > +#include <linux/sched/stat.h>
> > > +#include <linux/sched/task.h>
> > > +#include <linux/sched/task_stack.h>
> > > +#include <linux/sched/cputime.h>
> > > +#include <linux/seq_file.h>
> > > +#include <linux/rtmutex.h>
> > > +>>>>>>> b3e58382... clone: add CLONE_PIDFD
> > > +#endif
> > 
> > That looks odd :(
> > 
> > Can you please refresh this patch series, and make sure it is correct
> > and resend it?
> 
> Uhm, this patch series has been merged at least a year ago so this looks
> like an accidental send.
> This probably isn't meant for upstream but for some alibaba specific
> kernel I'd reckon.

This was ment for the 4.19.y kernel to solve a problem reported in patch
00/XX of the series.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/10] clone: add CLONE_PIDFD
  2021-01-04 13:17       ` Greg Kroah-Hartman
@ 2021-01-04 13:19         ` Christian Brauner
  0 siblings, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2021-01-04 13:19 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Wen Yang, Sasha Levin, Xunlei Pang, linux-kernel,
	Christian Brauner, Linus Torvalds, Jann Horn, Oleg Nesterov,
	Arnd Bergmann, Eric W. Biederman, Kees Cook, Thomas Gleixner,
	David Howells, Michael Kerrisk (man-pages),
	Andy Lutomirsky, Andrew Morton, Aleksa Sarai, Al Viro, stable

On Mon, Jan 04, 2021 at 02:17:40PM +0100, Greg Kroah-Hartman wrote:
> On Mon, Jan 04, 2021 at 02:13:42PM +0100, Christian Brauner wrote:
> > On Mon, Jan 04, 2021 at 02:03:14PM +0100, Greg Kroah-Hartman wrote:
> > > On Fri, Dec 04, 2020 at 02:31:55AM +0800, Wen Yang wrote:
> > > > From: Christian Brauner <christian@brauner.io>
> > > > 
> > > > [ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]
> > > > 
> > > > This patchset makes it possible to retrieve pid file descriptors at
> > > > process creation time by introducing the new flag CLONE_PIDFD to the
> > > > clone() system call.  Linus originally suggested to implement this as a
> > > > new flag to clone() instead of making it a separate system call.  As
> > > > spotted by Linus, there is exactly one bit for clone() left.
> > > > 
> > > > CLONE_PIDFD creates file descriptors based on the anonymous inode
> > > > implementation in the kernel that will also be used to implement the new
> > > > mount api.  They serve as a simple opaque handle on pids.  Logically,
> > > > this makes it possible to interpret a pidfd differently, narrowing or
> > > > widening the scope of various operations (e.g. signal sending).  Thus, a
> > > > pidfd cannot just refer to a tgid, but also a tid, or in theory - given
> > > > appropriate flag arguments in relevant syscalls - a process group or
> > > > session. A pidfd does not represent a privilege.  This does not imply it
> > > > cannot ever be that way but for now this is not the case.
> > > > 
> > > > A pidfd comes with additional information in fdinfo if the kernel supports
> > > > procfs.  The fdinfo file contains the pid of the process in the callers
> > > > pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".
> > > > 
> > > > As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
> > > > parent_tidptr argument of clone.  This has the advantage that we can
> > > > give back the associated pid and the pidfd at the same time.
> > > > 
> > > > To remove worries about missing metadata access this patchset comes with
> > > > a sample program that illustrates how a combination of CLONE_PIDFD, and
> > > > pidfd_send_signal() can be used to gain race-free access to process
> > > > metadata through /proc/<pid>.  The sample program can easily be
> > > > translated into a helper that would be suitable for inclusion in libc so
> > > > that users don't have to worry about writing it themselves.
> > > > 
> > > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > > Signed-off-by: Christian Brauner <christian@brauner.io>
> > > > Co-developed-by: Jann Horn <jannh@google.com>
> > > > Signed-off-by: Jann Horn <jannh@google.com>
> > > > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> > > > Cc: Arnd Bergmann <arnd@arndb.de>
> > > > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > > > Cc: Kees Cook <keescook@chromium.org>
> > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > Cc: David Howells <dhowells@redhat.com>
> > > > Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
> > > > Cc: Andy Lutomirsky <luto@kernel.org>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Aleksa Sarai <cyphar@cyphar.com>
> > > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > > Cc: <stable@vger.kernel.org> # 4.9.x
> > > > (clone: fix up cherry-pick conflicts for b3e583825266)
> > > > Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
> > > > ---
> > > >  include/linux/pid.h        |   1 +
> > > >  include/uapi/linux/sched.h |   1 +
> > > >  kernel/fork.c              | 119 +++++++++++++++++++++++++++++++++++++++++++--
> > > >  3 files changed, 117 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/include/linux/pid.h b/include/linux/pid.h
> > > > index 97b745d..7599a78 100644
> > > > --- a/include/linux/pid.h
> > > > +++ b/include/linux/pid.h
> > > > @@ -73,6 +73,7 @@ struct pid_link
> > > >  	struct hlist_node node;
> > > >  	struct pid *pid;
> > > >  };
> > > > +extern const struct file_operations pidfd_fops;
> > > >  
> > > >  static inline struct pid *get_pid(struct pid *pid)
> > > >  {
> > > > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > > > index 5f0fe01..ed6e31d 100644
> > > > --- a/include/uapi/linux/sched.h
> > > > +++ b/include/uapi/linux/sched.h
> > > > @@ -9,6 +9,7 @@
> > > >  #define CLONE_FS	0x00000200	/* set if fs info shared between processes */
> > > >  #define CLONE_FILES	0x00000400	/* set if open files shared between processes */
> > > >  #define CLONE_SIGHAND	0x00000800	/* set if signal handlers and blocked signals shared */
> > > > +#define CLONE_PIDFD	0x00001000	/* set if a pidfd should be placed in parent */
> > > >  #define CLONE_PTRACE	0x00002000	/* set if we want to let tracing continue on the child too */
> > > >  #define CLONE_VFORK	0x00004000	/* set if the parent wants the child to wake it up on mm_release */
> > > >  #define CLONE_PARENT	0x00008000	/* set if we want to have the same parent as the cloner */
> > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > index b64efec..076297a 100644
> > > > --- a/kernel/fork.c
> > > > +++ b/kernel/fork.c
> > > > @@ -11,7 +11,22 @@
> > > >   * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
> > > >   */
> > > >  
> > > > +#include <linux/anon_inodes.h>
> > > >  #include <linux/slab.h>
> > > > +#if 0
> > > > +#include <linux/sched/autogroup.h>
> > > > +#include <linux/sched/mm.h>
> > > > +#include <linux/sched/coredump.h>
> > > > +#include <linux/sched/user.h>
> > > > +#include <linux/sched/numa_balancing.h>
> > > > +#include <linux/sched/stat.h>
> > > > +#include <linux/sched/task.h>
> > > > +#include <linux/sched/task_stack.h>
> > > > +#include <linux/sched/cputime.h>
> > > > +#include <linux/seq_file.h>
> > > > +#include <linux/rtmutex.h>
> > > > +>>>>>>> b3e58382... clone: add CLONE_PIDFD
> > > > +#endif
> > > 
> > > That looks odd :(
> > > 
> > > Can you please refresh this patch series, and make sure it is correct
> > > and resend it?
> > 
> > Uhm, this patch series has been merged at least a year ago so this looks
> > like an accidental send.
> > This probably isn't meant for upstream but for some alibaba specific
> > kernel I'd reckon.
> 
> This was ment for the 4.19.y kernel to solve a problem reported in patch
> 00/XX of the series.

Ah, ok. Then just as an fyi: the Cc line seems to indicate 4.9 and not
4.19.

Christian

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-01-04 13:20 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-03 18:31 [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
2020-12-03 18:31 ` [PATCH 01/10] clone: add CLONE_PIDFD Wen Yang
2021-01-04 13:03   ` Greg Kroah-Hartman
2021-01-04 13:13     ` Christian Brauner
2021-01-04 13:17       ` Greg Kroah-Hartman
2021-01-04 13:19         ` Christian Brauner
2020-12-03 18:31 ` [PATCH 02/10] pidfd: add polling support Wen Yang
2020-12-03 18:31 ` [PATCH 03/10] proc: Pass file mode to proc_pid_make_inode Wen Yang
2020-12-03 18:31 ` [PATCH 04/10] proc: Better ownership of files for non-dumpable tasks in user namespaces Wen Yang
2020-12-03 18:31 ` [PATCH 05/10] proc: use %u for pid printing and slightly less stack Wen Yang
2020-12-03 20:08   ` Alexey Dobriyan
2020-12-03 18:32 ` [PATCH 06/10] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes Wen Yang
2020-12-03 18:32 ` [PATCH 07/10] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache Wen Yang
2020-12-03 18:32 ` [PATCH 08/10] proc: Clear the pieces of proc_inode that proc_evict_inode cares about Wen Yang
2020-12-03 18:32 ` [PATCH 09/10] proc: Use d_invalidate in proc_prune_siblings_dcache Wen Yang
2020-12-03 18:32 ` [PATCH 10/10] proc: Use a list of inodes to flush from proc Wen Yang
2020-12-17  2:26 ` [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Wen Yang
2020-12-31  9:22   ` Greg Kroah-Hartman
2021-01-04  4:15     ` Wen Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).