linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 01/24] exec: Move unshare_files to fix posix file locking during exec
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:58   ` Linus Torvalds
  2020-12-07 22:22   ` Al Viro
  2020-11-20 23:14 ` [PATCH v2 02/24] exec: Simplify unshare_files Eric W. Biederman
                   ` (23 subsequent siblings)
  24 siblings, 2 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

Many moons ago the binfmts were doing some very questionable things
with file descriptors and an unsharing of the file descriptor table
was added to make things better[1][2].  The helper steal_lockss was
added to avoid breaking the userspace programs[3][4][6].

Unfortunately it turned out that steal_locks did not work for network
file systems[5], so it was removed to see if anyone would
complain[7][8].  It was thought at the time that NPTL would not be
affected as the unshare_files happened after the other threads were
killed[8].  Unfortunately because there was an unshare_files in
binfmt_elf.c before the threads were killed this analysis was
incorrect.

This unshare_files in binfmt_elf.c resulted in the unshares_files
happening whenever threads were present.  Which led to unshare_files
being moved to the start of do_execve[9].

Later the problems were rediscovered and the suggested approach was to
readd steal_locks under a different name[10].  I happened to be
reviewing patches and I noticed that this approach was a step
backwards[11].

I proposed simply moving unshare_files[12] and it was pointed
out that moving unshare_files without auditing the code was
also unsafe[13].

There were then several attempts to solve this[14][15][16] and I even
posted this set of changes[17].  Unfortunately because auditing all of
execve is time consuming this change did not make it in at the time.

Well now that I am cleaning up exec I have made the time to read
through all of the binfmts and the only playing with file descriptors
is either the security modules closing them in
security_bprm_committing_creds or is in the generic code in fs/exec.c.
None of it happens before begin_new_exec is called.

So move unshare_files into begin_new_exec, after the point of no
return.  If memory is very very very low and the application calling
exec is sharing file descriptor tables between processes we might fail
past the point of no return.  Which is unfortunate but no different
than any of the other places where we allocate memory after the point
of no return.

This movement allows another process that shares the file table, or
another thread of the same process and that closes files or changes
their close on exec behavior and races with execve to cause some
unexpected things to happen.  There is only one time of check to time
of use race and it is just there so that execve fails instead of
an interpreter failing when it tries to open the file it is supposed
to be interpreting.   Failing later if userspace is being silly is
not a problem.

With this change it the following discription from the removal
of steal_locks[8] finally becomes true.

    Apps using NPTL are not affected, since all other threads are killed before
    execve.

    Apps using LinuxThreads are only affected if they

      - have multiple threads during exec (LinuxThreads doesn't kill other
        threads, the app may do it with pthread_kill_other_threads_np())
      - rely on POSIX locks being inherited across exec

    Both conditions are documented, but not their interaction.

    Apps using clone() natively are affected if they

      - use clone(CLONE_FILES)
      - rely on POSIX locks being inherited across exec

I have investigated some paths to make it possible to solve this
without moving unshare_files but they all look more complicated[18].

Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Reported-by: Jeff Layton <jlayton@redhat.com>
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
[1] 02cda956de0b ("[PATCH] unshare_files"
[2] 04e9bcb4d106 ("[PATCH] use new unshare_files helper")
[3] 088f5d7244de ("[PATCH] add steal_locks helper")
[4] 02c541ec8ffa ("[PATCH] use new steal_locks helper")
[5] https://lkml.kernel.org/r/E1FLIlF-0007zR-00@dorka.pomaz.szeredi.hu
[6] https://lkml.kernel.org/r/0060321191605.GB15997@sorel.sous-sol.org
[7] https://lkml.kernel.org/r/E1FLwjC-0000kJ-00@dorka.pomaz.szeredi.hu
[8] c89681ed7d0e ("[PATCH] remove steal_locks()")
[9] fd8328be874f ("[PATCH] sanitize handling of shared descriptor tables in failing execve()")
[10] https://lkml.kernel.org/r/20180317142520.30520-1-jlayton@kernel.org
[11] https://lkml.kernel.org/r/87r2nwqk73.fsf@xmission.com
[12] https://lkml.kernel.org/r/87bmfgvg8w.fsf@xmission.com
[13] https://lkml.kernel.org/r/20180322111424.GE30522@ZenIV.linux.org.uk
[14] https://lkml.kernel.org/r/20180827174722.3723-1-jlayton@kernel.org
[15] https://lkml.kernel.org/r/20180830172423.21964-1-jlayton@kernel.org
[16] https://lkml.kernel.org/r/20180914105310.6454-1-jlayton@kernel.org
[17] https://lkml.kernel.org/r/87a7ohs5ow.fsf@xmission.com
[18] https://lkml.kernel.org/r/87pn8c1uj6.fsf_-_@x220.int.ebiederm.org
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-1-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 547a2390baf5..fa788988efe9 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1238,6 +1238,7 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
 int begin_new_exec(struct linux_binprm * bprm)
 {
 	struct task_struct *me = current;
+	struct files_struct *displaced;
 	int retval;
 
 	/* Once we are committed compute the creds */
@@ -1257,6 +1258,13 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	/* Ensure the files table is not shared. */
+	retval = unshare_files(&displaced);
+	if (retval)
+		goto out;
+	if (displaced)
+		put_files_struct(displaced);
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1776,7 +1784,6 @@ static int bprm_execve(struct linux_binprm *bprm,
 		       int fd, struct filename *filename, int flags)
 {
 	struct file *file;
-	struct files_struct *displaced;
 	int retval;
 
 	/*
@@ -1784,13 +1791,9 @@ static int bprm_execve(struct linux_binprm *bprm,
 	 */
 	io_uring_task_cancel();
 
-	retval = unshare_files(&displaced);
-	if (retval)
-		return retval;
-
 	retval = prepare_bprm_creds(bprm);
 	if (retval)
-		goto out_files;
+		return retval;
 
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
@@ -1805,8 +1808,12 @@ static int bprm_execve(struct linux_binprm *bprm,
 	bprm->file = file;
 	/*
 	 * Record that a name derived from an O_CLOEXEC fd will be
-	 * inaccessible after exec. Relies on having exclusive access to
-	 * current->files (due to unshare_files above).
+	 * inaccessible after exec.  This allows the code in exec to
+	 * choose to fail when the executable is not mmaped into the
+	 * interpreter and an open file descriptor is not passed to
+	 * the interpreter.  This makes for a better user experience
+	 * than having the interpreter start and then immediately fail
+	 * when it finds the executable is inaccessible.
 	 */
 	if (bprm->fdpath &&
 	    close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
@@ -1827,8 +1834,6 @@ static int bprm_execve(struct linux_binprm *bprm,
 	rseq_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
-	if (displaced)
-		put_files_struct(displaced);
 	return retval;
 
 out:
@@ -1845,10 +1850,6 @@ static int bprm_execve(struct linux_binprm *bprm,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 
-out_files:
-	if (displaced)
-		reset_files_struct(displaced);
-
 	return retval;
 }
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 02/24] exec: Simplify unshare_files
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
  2020-11-20 23:14 ` [PATCH v2 01/24] exec: Move unshare_files to fix posix file locking during exec Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-23 17:50   ` Oleg Nesterov
  2020-11-20 23:14 ` [PATCH v2 03/24] exec: Remove reset_files_struct Eric W. Biederman
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

Now that exec no longer needs to return the unshared files to their
previous value there is no reason to return displaced.

Instead when unshare_fd creates a copy of the file table, call
put_files_struct before returning from unshare_files.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-2-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/coredump.c           |  5 +----
 fs/exec.c               |  5 +----
 include/linux/fdtable.h |  2 +-
 kernel/fork.c           | 12 ++++++------
 4 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 0cd9056d79cc..abf807235262 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -585,7 +585,6 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 	int ispipe;
 	size_t *argv = NULL;
 	int argc = 0;
-	struct files_struct *displaced;
 	/* require nonrelative corefile path and be extra careful */
 	bool need_suid_safe = false;
 	bool core_dumped = false;
@@ -791,11 +790,9 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 	}
 
 	/* get us an unshared descriptor table; almost always a no-op */
-	retval = unshare_files(&displaced);
+	retval = unshare_files();
 	if (retval)
 		goto close_fail;
-	if (displaced)
-		put_files_struct(displaced);
 	if (!dump_interrupted()) {
 		/*
 		 * umh disabled with CONFIG_STATIC_USERMODEHELPER_PATH="" would
diff --git a/fs/exec.c b/fs/exec.c
index fa788988efe9..14fae2ec1c9d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1238,7 +1238,6 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
 int begin_new_exec(struct linux_binprm * bprm)
 {
 	struct task_struct *me = current;
-	struct files_struct *displaced;
 	int retval;
 
 	/* Once we are committed compute the creds */
@@ -1259,11 +1258,9 @@ int begin_new_exec(struct linux_binprm * bprm)
 		goto out;
 
 	/* Ensure the files table is not shared. */
-	retval = unshare_files(&displaced);
+	retval = unshare_files();
 	if (retval)
 		goto out;
-	if (displaced)
-		put_files_struct(displaced);
 
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index a32bf47c593e..f46a084b60b2 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -109,7 +109,7 @@ struct task_struct;
 struct files_struct *get_files_struct(struct task_struct *);
 void put_files_struct(struct files_struct *fs);
 void reset_files_struct(struct files_struct *);
-int unshare_files(struct files_struct **);
+int unshare_files(void);
 struct files_struct *dup_fd(struct files_struct *, unsigned, int *) __latent_entropy;
 void do_close_on_exec(struct files_struct *);
 int iterate_fd(struct files_struct *, unsigned,
diff --git a/kernel/fork.c b/kernel/fork.c
index 32083db7a2a2..837b546528c8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3023,21 +3023,21 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
  *	the exec layer of the kernel.
  */
 
-int unshare_files(struct files_struct **displaced)
+int unshare_files(void)
 {
 	struct task_struct *task = current;
-	struct files_struct *copy = NULL;
+	struct files_struct *old, *copy = NULL;
 	int error;
 
 	error = unshare_fd(CLONE_FILES, NR_OPEN_MAX, &copy);
-	if (error || !copy) {
-		*displaced = NULL;
+	if (error || !copy)
 		return error;
-	}
-	*displaced = task->files;
+
+	old = task->files;
 	task_lock(task);
 	task->files = copy;
 	task_unlock(task);
+	put_files_struct(old);
 	return 0;
 }
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 03/24] exec: Remove reset_files_struct
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
  2020-11-20 23:14 ` [PATCH v2 01/24] exec: Move unshare_files to fix posix file locking during exec Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 02/24] exec: Simplify unshare_files Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 04/24] kcmp: In kcmp_epoll_target use fget_task Eric W. Biederman
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

Now that exec no longer needs to restore the previous value of current->files
on error there are no more callers of reset_files_struct so remove it.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-3-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c               | 12 ------------
 include/linux/fdtable.h |  1 -
 2 files changed, 13 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 4559b5fec3bd..beae7c55c84c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -436,18 +436,6 @@ void put_files_struct(struct files_struct *files)
 	}
 }
 
-void reset_files_struct(struct files_struct *files)
-{
-	struct task_struct *tsk = current;
-	struct files_struct *old;
-
-	old = tsk->files;
-	task_lock(tsk);
-	tsk->files = files;
-	task_unlock(tsk);
-	put_files_struct(old);
-}
-
 void exit_files(struct task_struct *tsk)
 {
 	struct files_struct * files = tsk->files;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f46a084b60b2..7cc9885044d9 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -108,7 +108,6 @@ struct task_struct;
 
 struct files_struct *get_files_struct(struct task_struct *);
 void put_files_struct(struct files_struct *fs);
-void reset_files_struct(struct files_struct *);
 int unshare_files(void);
 struct files_struct *dup_fd(struct files_struct *, unsigned, int *) __latent_entropy;
 void do_close_on_exec(struct files_struct *);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 04/24] kcmp: In kcmp_epoll_target use fget_task
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (2 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 03/24] exec: Remove reset_files_struct Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 05/24] bpf: In bpf_task_fd_query " Eric W. Biederman
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

Use the helper fget_task and simplify the code.

As well as simplifying the code this removes one unnecessary increment of
struct files_struct.  This unnecessary increment of files_struct.count can
result in exec unnecessarily unsharing files_struct and breaking posix
locks, and it can result in fget_light having to fallback to fget reducing
performance.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-4-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/kcmp.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index b3ff9288c6cc..87c48c0104ad 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -107,7 +107,6 @@ static int kcmp_epoll_target(struct task_struct *task1,
 {
 	struct file *filp, *filp_epoll, *filp_tgt;
 	struct kcmp_epoll_slot slot;
-	struct files_struct *files;
 
 	if (copy_from_user(&slot, uslot, sizeof(slot)))
 		return -EFAULT;
@@ -116,23 +115,12 @@ static int kcmp_epoll_target(struct task_struct *task1,
 	if (!filp)
 		return -EBADF;
 
-	files = get_files_struct(task2);
-	if (!files)
+	filp_epoll = fget_task(task2, slot.efd);
+	if (!filp_epoll)
 		return -EBADF;
 
-	spin_lock(&files->file_lock);
-	filp_epoll = fcheck_files(files, slot.efd);
-	if (filp_epoll)
-		get_file(filp_epoll);
-	else
-		filp_tgt = ERR_PTR(-EBADF);
-	spin_unlock(&files->file_lock);
-	put_files_struct(files);
-
-	if (filp_epoll) {
-		filp_tgt = get_epoll_tfile_raw_ptr(filp_epoll, slot.tfd, slot.toff);
-		fput(filp_epoll);
-	}
+	filp_tgt = get_epoll_tfile_raw_ptr(filp_epoll, slot.tfd, slot.toff);
+	fput(filp_epoll);
 
 	if (IS_ERR(filp_tgt))
 		return PTR_ERR(filp_tgt);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 05/24] bpf: In bpf_task_fd_query use fget_task
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (3 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 04/24] kcmp: In kcmp_epoll_target use fget_task Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 06/24] proc/fd: In proc_fd_link " Eric W. Biederman
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

Use the helper fget_task to simplify bpf_task_fd_query.

As well as simplifying the code this removes one unnecessary increment of
struct files_struct.  This unnecessary increment of files_struct.count can
result in exec unnecessarily unsharing files_struct and breaking posix
locks, and it can result in fget_light having to fallback to fget reducing
performance.

This simplification comes from the observation that none of the
callers of get_files_struct actually need to call get_files_struct
that was made when discussing[1] exec and posix file locks.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-5-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/bpf/syscall.c | 20 +++-----------------
 1 file changed, 3 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8f50c9c19f1b..6d49c2e1634c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3878,7 +3878,6 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 	pid_t pid = attr->task_fd_query.pid;
 	u32 fd = attr->task_fd_query.fd;
 	const struct perf_event *event;
-	struct files_struct *files;
 	struct task_struct *task;
 	struct file *file;
 	int err;
@@ -3896,23 +3895,11 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 	if (!task)
 		return -ENOENT;
 
-	files = get_files_struct(task);
-	put_task_struct(task);
-	if (!files)
-		return -ENOENT;
-
 	err = 0;
-	spin_lock(&files->file_lock);
-	file = fcheck_files(files, fd);
+	file = fget_task(task, fd);
+	put_task_struct(task);
 	if (!file)
-		err = -EBADF;
-	else
-		get_file(file);
-	spin_unlock(&files->file_lock);
-	put_files_struct(files);
-
-	if (err)
-		goto out;
+		return -EBADF;
 
 	if (file->f_op == &bpf_link_fops) {
 		struct bpf_link *link = file->private_data;
@@ -3952,7 +3939,6 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 	err = -ENOTSUPP;
 put_file:
 	fput(file);
-out:
 	return err;
 }
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 06/24] proc/fd: In proc_fd_link use fget_task
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (4 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 05/24] bpf: In bpf_task_fd_query " Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 07/24] file: Rename __fcheck_files to files_lookup_fd_raw Eric W. Biederman
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Simplifying proc_fd_link is a little bit tricky.  It is necessary to
know that there is a reference to fd_f	 ile while path_get is running.
This reference can either be guaranteed to exist either by locking the
fdtable as the code currently does or by taking a reference on the
file in question.

Use fget_task to remove the need for get_files_struct and
to take a reference to file in question.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-8-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/fd.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 81882a13212d..d58960f6ee52 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -146,29 +146,22 @@ static const struct dentry_operations tid_fd_dentry_operations = {
 
 static int proc_fd_link(struct dentry *dentry, struct path *path)
 {
-	struct files_struct *files = NULL;
 	struct task_struct *task;
 	int ret = -ENOENT;
 
 	task = get_proc_task(d_inode(dentry));
 	if (task) {
-		files = get_files_struct(task);
-		put_task_struct(task);
-	}
-
-	if (files) {
 		unsigned int fd = proc_fd(d_inode(dentry));
 		struct file *fd_file;
 
-		spin_lock(&files->file_lock);
-		fd_file = fcheck_files(files, fd);
+		fd_file = fget_task(task, fd);
 		if (fd_file) {
 			*path = fd_file->f_path;
 			path_get(&fd_file->f_path);
 			ret = 0;
+			fput(fd_file);
 		}
-		spin_unlock(&files->file_lock);
-		put_files_struct(files);
+		put_task_struct(task);
 	}
 
 	return ret;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 07/24] file: Rename __fcheck_files to files_lookup_fd_raw
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (5 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 06/24] proc/fd: In proc_fd_link " Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 08/24] file: Factor files_lookup_fd_locked out of fcheck_files Eric W. Biederman
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

The function fcheck despite it's comment is poorly named
as it has no callers that only check it's return value.
All of fcheck's callers use the returned file descriptor.
The same is true for fcheck_files and __fcheck_files.

A new less confusing name is needed.  In addition the names
of these functions are confusing as they do not report
the kind of locks that are needed to be held when these
functions are called making error prone to use them.

To remedy this I am making the base functio name lookup_fd
and will and prefixes and sufficies to indicate the rest
of the context.

Name the function (previously called __fcheck_files) that proceeds
from a struct files_struct, looks up the struct file of a file
descriptor, and requires it's callers to verify all of the appropriate
locks are held files_lookup_fd_raw.

The need for better names became apparent in the last round of
discussion of this set of changes[1].

[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c               | 2 +-
 include/linux/fdtable.h | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index beae7c55c84c..b5591efb87f5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -887,7 +887,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 	struct file *file;
 
 	if (atomic_read(&files->count) == 1) {
-		file = __fcheck_files(files, fd);
+		file = files_lookup_fd_raw(files, fd);
 		if (!file || unlikely(file->f_mode & mask))
 			return 0;
 		return (unsigned long)file;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 7cc9885044d9..639933f37da9 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -80,7 +80,7 @@ struct dentry;
 /*
  * The caller must ensure that fd table isn't shared or hold rcu or file lock
  */
-static inline struct file *__fcheck_files(struct files_struct *files, unsigned int fd)
+static inline struct file *files_lookup_fd_raw(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
 
@@ -96,7 +96,7 @@ static inline struct file *fcheck_files(struct files_struct *files, unsigned int
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
 			   !lockdep_is_held(&files->file_lock),
 			   "suspicious rcu_dereference_check() usage");
-	return __fcheck_files(files, fd);
+	return files_lookup_fd_raw(files, fd);
 }
 
 /*
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 08/24] file: Factor files_lookup_fd_locked out of fcheck_files
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (6 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 07/24] file: Rename __fcheck_files to files_lookup_fd_raw Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 09/24] file: Replace fcheck_files with files_lookup_fd_rcu Eric W. Biederman
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

To make it easy to tell where files->file_lock protection is being
used when looking up a file create files_lookup_fd_locked.  Only allow
this function to be called with the file_lock held.

Update the callers of fcheck and fcheck_files that are called with the
files->file_lock held to call files_lookup_fd_locked instead.

Hopefully this makes it easier to quickly understand what is going on.

The need for better names became apparent in the last round of
discussion of this set of changes[1].

[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c               |  2 +-
 fs/locks.c              | 14 ++++++++------
 fs/proc/fd.c            |  2 +-
 include/linux/fdtable.h |  7 +++++++
 4 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index b5591efb87f5..9d0e91168be1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1098,7 +1098,7 @@ static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 
 	spin_lock(&files->file_lock);
 	err = expand_files(files, newfd);
-	file = fcheck(oldfd);
+	file = files_lookup_fd_locked(files, oldfd);
 	if (unlikely(!file))
 		goto Ebadf;
 	if (unlikely(err < 0)) {
diff --git a/fs/locks.c b/fs/locks.c
index 1f84a03601fe..148197c1b547 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2539,14 +2539,15 @@ int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd,
 	 */
 	if (!error && file_lock->fl_type != F_UNLCK &&
 	    !(file_lock->fl_flags & FL_OFDLCK)) {
+		struct files_struct *files = current->files;
 		/*
 		 * We need that spin_lock here - it prevents reordering between
 		 * update of i_flctx->flc_posix and check for it done in
 		 * close(). rcu_read_lock() wouldn't do.
 		 */
-		spin_lock(&current->files->file_lock);
-		f = fcheck(fd);
-		spin_unlock(&current->files->file_lock);
+		spin_lock(&files->file_lock);
+		f = files_lookup_fd_locked(files, fd);
+		spin_unlock(&files->file_lock);
 		if (f != filp) {
 			file_lock->fl_type = F_UNLCK;
 			error = do_lock_file_wait(filp, cmd, file_lock);
@@ -2670,14 +2671,15 @@ int fcntl_setlk64(unsigned int fd, struct file *filp, unsigned int cmd,
 	 */
 	if (!error && file_lock->fl_type != F_UNLCK &&
 	    !(file_lock->fl_flags & FL_OFDLCK)) {
+		struct files_struct *files = current->files;
 		/*
 		 * We need that spin_lock here - it prevents reordering between
 		 * update of i_flctx->flc_posix and check for it done in
 		 * close(). rcu_read_lock() wouldn't do.
 		 */
-		spin_lock(&current->files->file_lock);
-		f = fcheck(fd);
-		spin_unlock(&current->files->file_lock);
+		spin_lock(&files->file_lock);
+		f = files_lookup_fd_locked(files, fd);
+		spin_unlock(&files->file_lock);
 		if (f != filp) {
 			file_lock->fl_type = F_UNLCK;
 			error = do_lock_file_wait(filp, cmd, file_lock);
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index d58960f6ee52..2cca9bca3b3a 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -35,7 +35,7 @@ static int seq_show(struct seq_file *m, void *v)
 		unsigned int fd = proc_fd(m->private);
 
 		spin_lock(&files->file_lock);
-		file = fcheck_files(files, fd);
+		file = files_lookup_fd_locked(files, fd);
 		if (file) {
 			struct fdtable *fdt = files_fdtable(files);
 
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 639933f37da9..fda4b81dd735 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -91,6 +91,13 @@ static inline struct file *files_lookup_fd_raw(struct files_struct *files, unsig
 	return NULL;
 }
 
+static inline struct file *files_lookup_fd_locked(struct files_struct *files, unsigned int fd)
+{
+	RCU_LOCKDEP_WARN(!lockdep_is_held(&files->file_lock),
+			   "suspicious rcu_dereference_check() usage");
+	return files_lookup_fd_raw(files, fd);
+}
+
 static inline struct file *fcheck_files(struct files_struct *files, unsigned int fd)
 {
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 09/24] file: Replace fcheck_files with files_lookup_fd_rcu
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (7 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 08/24] file: Factor files_lookup_fd_locked out of fcheck_files Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-12-07 22:46   ` Al Viro
  2020-11-20 23:14 ` [PATCH v2 10/24] file: Rename fcheck lookup_fd_rcu Eric W. Biederman
                   ` (15 subsequent siblings)
  24 siblings, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

This change renames fcheck_files to files_lookup_fd_rcu.  All of the
remaining callers take the rcu_read_lock before calling this function
so the _rcu suffix is appropriate.  This change also tightens up the
debug check to verify that all callers hold the rcu_read_lock.

All callers that used to call files_check with the files->file_lock
held have now been changed to call files_lookup_fd_locked.

This change of name has helped remind me of which locks and which
guarantees are in place helping me to catch bugs later in the
patchset.

The need for better names became apparent in the last round of
discussion of this set of changes[1].

[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 Documentation/filesystems/files.rst | 6 +++---
 fs/file.c                           | 4 ++--
 fs/proc/fd.c                        | 4 ++--
 include/linux/fdtable.h             | 7 +++----
 kernel/bpf/task_iter.c              | 2 +-
 kernel/kcmp.c                       | 2 +-
 6 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/Documentation/filesystems/files.rst b/Documentation/filesystems/files.rst
index cbf8e57376bf..ea75acdb632c 100644
--- a/Documentation/filesystems/files.rst
+++ b/Documentation/filesystems/files.rst
@@ -62,7 +62,7 @@ the fdtable structure -
    be held.
 
 4. To look up the file structure given an fd, a reader
-   must use either fcheck() or fcheck_files() APIs. These
+   must use either fcheck() or files_lookup_fd_rcu() APIs. These
    take care of barrier requirements due to lock-free lookup.
 
    An example::
@@ -84,7 +84,7 @@ the fdtable structure -
    on ->f_count::
 
 	rcu_read_lock();
-	file = fcheck_files(files, fd);
+	file = files_lookup_fd_rcu(files, fd);
 	if (file) {
 		if (atomic_long_inc_not_zero(&file->f_count))
 			*fput_needed = 1;
@@ -104,7 +104,7 @@ the fdtable structure -
    lock-free, they must be installed using rcu_assign_pointer()
    API. If they are looked up lock-free, rcu_dereference()
    must be used. However it is advisable to use files_fdtable()
-   and fcheck()/fcheck_files() which take care of these issues.
+   and fcheck()/files_lookup_fd_rcu() which take care of these issues.
 
 7. While updating, the fdtable pointer must be looked up while
    holding files->file_lock. If ->file_lock is dropped, then
diff --git a/fs/file.c b/fs/file.c
index 9d0e91168be1..5861c4f89419 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -814,7 +814,7 @@ static struct file *__fget_files(struct files_struct *files, unsigned int fd,
 
 	rcu_read_lock();
 loop:
-	file = fcheck_files(files, fd);
+	file = files_lookup_fd_rcu(files, fd);
 	if (file) {
 		/* File object ref couldn't be taken.
 		 * dup2() atomicity guarantee is the reason
@@ -1127,7 +1127,7 @@ SYSCALL_DEFINE2(dup2, unsigned int, oldfd, unsigned int, newfd)
 		int retval = oldfd;
 
 		rcu_read_lock();
-		if (!fcheck_files(files, oldfd))
+		if (!files_lookup_fd_rcu(files, oldfd))
 			retval = -EBADF;
 		rcu_read_unlock();
 		return retval;
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 2cca9bca3b3a..3dec44d7c5c5 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -90,7 +90,7 @@ static bool tid_fd_mode(struct task_struct *task, unsigned fd, fmode_t *mode)
 		return false;
 
 	rcu_read_lock();
-	file = fcheck_files(files, fd);
+	file = files_lookup_fd_rcu(files, fd);
 	if (file)
 		*mode = file->f_mode;
 	rcu_read_unlock();
@@ -243,7 +243,7 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 		char name[10 + 1];
 		unsigned int len;
 
-		f = fcheck_files(files, fd);
+		f = files_lookup_fd_rcu(files, fd);
 		if (!f)
 			continue;
 		data.mode = f->f_mode;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index fda4b81dd735..fa8c402a7790 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -98,10 +98,9 @@ static inline struct file *files_lookup_fd_locked(struct files_struct *files, un
 	return files_lookup_fd_raw(files, fd);
 }
 
-static inline struct file *fcheck_files(struct files_struct *files, unsigned int fd)
+static inline struct file *files_lookup_fd_rcu(struct files_struct *files, unsigned int fd)
 {
-	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
-			   !lockdep_is_held(&files->file_lock),
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
 			   "suspicious rcu_dereference_check() usage");
 	return files_lookup_fd_raw(files, fd);
 }
@@ -109,7 +108,7 @@ static inline struct file *fcheck_files(struct files_struct *files, unsigned int
 /*
  * Check whether the specified fd has an open file.
  */
-#define fcheck(fd)	fcheck_files(current->files, fd)
+#define fcheck(fd)	files_lookup_fd_rcu(current->files, fd)
 
 struct task_struct;
 
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 5b6af30bfbcd..5ab2ccfb96cb 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -183,7 +183,7 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info,
 	for (; curr_fd < max_fds; curr_fd++) {
 		struct file *f;
 
-		f = fcheck_files(curr_files, curr_fd);
+		f = files_lookup_fd_rcu(curr_files, curr_fd);
 		if (!f)
 			continue;
 		if (!get_file_rcu(f))
diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index 87c48c0104ad..990717c1aed3 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -67,7 +67,7 @@ get_file_raw_ptr(struct task_struct *task, unsigned int idx)
 	rcu_read_lock();
 
 	if (task->files)
-		file = fcheck_files(task->files, idx);
+		file = files_lookup_fd_rcu(task->files, idx);
 
 	rcu_read_unlock();
 	task_unlock(task);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 10/24] file: Rename fcheck lookup_fd_rcu
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (8 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 09/24] file: Replace fcheck_files with files_lookup_fd_rcu Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 11/24] file: Implement task_lookup_fd_rcu Eric W. Biederman
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

Also remove the confusing comment about checking if a fd exists.  I
could not find one instance in the entire kernel that still matches
the description or the reason for the name fcheck.

The need for better names became apparent in the last round of
discussion of this set of changes[1].

[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 Documentation/filesystems/files.rst          | 6 +++---
 arch/powerpc/platforms/cell/spufs/coredump.c | 2 +-
 fs/notify/dnotify/dnotify.c                  | 2 +-
 include/linux/fdtable.h                      | 8 ++++----
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/Documentation/filesystems/files.rst b/Documentation/filesystems/files.rst
index ea75acdb632c..bcf84459917f 100644
--- a/Documentation/filesystems/files.rst
+++ b/Documentation/filesystems/files.rst
@@ -62,7 +62,7 @@ the fdtable structure -
    be held.
 
 4. To look up the file structure given an fd, a reader
-   must use either fcheck() or files_lookup_fd_rcu() APIs. These
+   must use either lookup_fd_rcu() or files_lookup_fd_rcu() APIs. These
    take care of barrier requirements due to lock-free lookup.
 
    An example::
@@ -70,7 +70,7 @@ the fdtable structure -
 	struct file *file;
 
 	rcu_read_lock();
-	file = fcheck(fd);
+	file = lookup_fd_rcu(fd);
 	if (file) {
 		...
 	}
@@ -104,7 +104,7 @@ the fdtable structure -
    lock-free, they must be installed using rcu_assign_pointer()
    API. If they are looked up lock-free, rcu_dereference()
    must be used. However it is advisable to use files_fdtable()
-   and fcheck()/files_lookup_fd_rcu() which take care of these issues.
+   and lookup_fd_rcu()/files_lookup_fd_rcu() which take care of these issues.
 
 7. While updating, the fdtable pointer must be looked up while
    holding files->file_lock. If ->file_lock is dropped, then
diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c b/arch/powerpc/platforms/cell/spufs/coredump.c
index 026c181a98c5..60b5583e9eaf 100644
--- a/arch/powerpc/platforms/cell/spufs/coredump.c
+++ b/arch/powerpc/platforms/cell/spufs/coredump.c
@@ -74,7 +74,7 @@ static struct spu_context *coredump_next_context(int *fd)
 	*fd = n - 1;
 
 	rcu_read_lock();
-	file = fcheck(*fd);
+	file = lookup_fd_rcu(*fd);
 	ctx = SPUFS_I(file_inode(file))->i_ctx;
 	get_spu_context(ctx);
 	rcu_read_unlock();
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index 5dcda8f20c04..5486aaca60b0 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -327,7 +327,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 	}
 
 	rcu_read_lock();
-	f = fcheck(fd);
+	f = lookup_fd_rcu(fd);
 	rcu_read_unlock();
 
 	/* if (f != filp) means that we lost a race and another task/thread
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index fa8c402a7790..2a4a8fed536e 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -105,10 +105,10 @@ static inline struct file *files_lookup_fd_rcu(struct files_struct *files, unsig
 	return files_lookup_fd_raw(files, fd);
 }
 
-/*
- * Check whether the specified fd has an open file.
- */
-#define fcheck(fd)	files_lookup_fd_rcu(current->files, fd)
+static inline struct file *lookup_fd_rcu(unsigned int fd)
+{
+	return files_lookup_fd_rcu(current->files, fd);
+}
 
 struct task_struct;
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 11/24] file: Implement task_lookup_fd_rcu
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (9 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 10/24] file: Rename fcheck lookup_fd_rcu Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-21 18:19   ` Cyrill Gorcunov
  2020-11-20 23:14 ` [PATCH v2 12/24] proc/fd: In tid_fd_mode use task_lookup_fd_rcu Eric W. Biederman
                   ` (13 subsequent siblings)
  24 siblings, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

As a companion to lookup_fd_rcu implement task_lookup_fd_rcu for
querying an arbitrary process about a specific file.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200818103713.aw46m7vprsy4vlve@wittgenstein
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c               | 15 +++++++++++++++
 include/linux/fdtable.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index 5861c4f89419..6448523ca29e 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -865,6 +865,21 @@ struct file *fget_task(struct task_struct *task, unsigned int fd)
 	return file;
 }
 
+struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
+{
+	/* Must be called with rcu_read_lock held */
+	struct files_struct *files;
+	struct file *file = NULL;
+
+	task_lock(task);
+	files = task->files;
+	if (files)
+		file = files_lookup_fd_rcu(files, fd);
+	task_unlock(task);
+
+	return file;
+}
+
 /*
  * Lightweight file lookup - no refcnt increment if fd table isn't shared.
  *
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 2a4a8fed536e..a0558ae9b40c 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -110,6 +110,8 @@ static inline struct file *lookup_fd_rcu(unsigned int fd)
 	return files_lookup_fd_rcu(current->files, fd);
 }
 
+struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd);
+
 struct task_struct;
 
 struct files_struct *get_files_struct(struct task_struct *);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 12/24] proc/fd: In tid_fd_mode use task_lookup_fd_rcu
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (10 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 11/24] file: Implement task_lookup_fd_rcu Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 13/24] kcmp: In get_file_raw_ptr " Eric W. Biederman
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Instead of manually coding finding the files struct for a task and
then calling files_lookup_fd_rcu, use the helper task_lookup_fd_rcu
that combines those to steps.   Making the code simpler and removing
the need to get a reference on a files_struct.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-7-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/fd.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 3dec44d7c5c5..c1a984f3c4df 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -83,18 +83,13 @@ static const struct file_operations proc_fdinfo_file_operations = {
 
 static bool tid_fd_mode(struct task_struct *task, unsigned fd, fmode_t *mode)
 {
-	struct files_struct *files = get_files_struct(task);
 	struct file *file;
 
-	if (!files)
-		return false;
-
 	rcu_read_lock();
-	file = files_lookup_fd_rcu(files, fd);
+	file = task_lookup_fd_rcu(task, fd);
 	if (file)
 		*mode = file->f_mode;
 	rcu_read_unlock();
-	put_files_struct(files);
 	return !!file;
 }
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 13/24] kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (11 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 12/24] proc/fd: In tid_fd_mode use task_lookup_fd_rcu Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-23 21:05   ` Cyrill Gorcunov
  2020-11-20 23:14 ` [PATCH v2 14/24] file: Implement task_lookup_next_fd_rcu Eric W. Biederman
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

Modify get_file_raw_ptr to use task_lookup_fd_rcu.  The helper
task_lookup_fd_rcu does the work of taking the task lock and verifying
that task->files != NULL and then calls files_lookup_fd_rcu.  So let
use the helper to make a simpler implementation of get_file_raw_ptr.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/kcmp.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index 990717c1aed3..36e58eb5a11d 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -61,16 +61,11 @@ static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)
 static struct file *
 get_file_raw_ptr(struct task_struct *task, unsigned int idx)
 {
-	struct file *file = NULL;
+	struct file *file;
 
-	task_lock(task);
 	rcu_read_lock();
-
-	if (task->files)
-		file = files_lookup_fd_rcu(task->files, idx);
-
+	file = task_lookup_fd_rcu(task, idx);
 	rcu_read_unlock();
-	task_unlock(task);
 
 	return file;
 }
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 14/24] file: Implement task_lookup_next_fd_rcu
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (12 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 13/24] kcmp: In get_file_raw_ptr " Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 15/24] proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu Eric W. Biederman
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

As a companion to fget_task and task_lookup_fd_rcu implement
task_lookup_next_fd_rcu that will return the struct file for the first
file descriptor number that is equal or greater than the fd argument
value, or NULL if there is no such struct file.

This allows file descriptors of foreign processes to be iterated
through safely, without needed to increment the count on files_struct.

Some concern[1] has been expressed that this function takes the task_lock
for each iteration and thus for each file descriptor.  This place
where this function will be called in a commonly used code path is for
listing /proc/<pid>/fd.  I did some small benchmarks and did not see
any measurable performance differences.  For ordinary users ls is
likely to stat each of the directory entries and tid_fd_mode called
from tid_fd_revalidae has always taken the task lock for each file
descriptor.  So this does not look like it will be a big change in
practice.

At some point is will probably be worth changing put_files_struct to
free files_struct after an rcu grace period so that task_lock won't be
needed at all.

[1] https://lkml.kernel.org/r/20200817220425.9389-10-ebiederm@xmission.com
v1: https://lkml.kernel.org/r/20200817220425.9389-9-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c               | 21 +++++++++++++++++++++
 include/linux/fdtable.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index 6448523ca29e..23b888a4acbe 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -880,6 +880,27 @@ struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
 	return file;
 }
 
+struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret_fd)
+{
+	/* Must be called with rcu_read_lock held */
+	struct files_struct *files;
+	unsigned int fd = *ret_fd;
+	struct file *file = NULL;
+
+	task_lock(task);
+	files = task->files;
+	if (files) {
+		for (; fd < files_fdtable(files)->max_fds; fd++) {
+			file = files_lookup_fd_rcu(files, fd);
+			if (file)
+				break;
+		}
+	}
+	task_unlock(task);
+	*ret_fd = fd;
+	return file;
+}
+
 /*
  * Lightweight file lookup - no refcnt increment if fd table isn't shared.
  *
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index a0558ae9b40c..cf6c52dae3a1 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -111,6 +111,7 @@ static inline struct file *lookup_fd_rcu(unsigned int fd)
 }
 
 struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd);
+struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *fd);
 
 struct task_struct;
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 15/24] proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (13 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 14/24] file: Implement task_lookup_next_fd_rcu Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-12-07 23:29   ` Al Viro
  2020-11-20 23:14 ` [PATCH v2 16/24] bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu Eric W. Biederman
                   ` (9 subsequent siblings)
  24 siblings, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman,
	Andy Lavr

When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Using task_lookup_next_fd_rcu simplifies proc_readfd_common, by moving
the checking for the maximum file descritor into the generic code, and
by remvoing the need for capturing and releasing a reference on
files_struct.

As task_lookup_fd_rcu may update the fd ctx->pos has been changed
to be the fd +2 after task_lookup_fd_rcu returns.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Andy Lavr <andy.lavr@gmail.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-10-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/proc/fd.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index c1a984f3c4df..72c1525b4b3e 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -217,7 +217,6 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 			      instantiate_t instantiate)
 {
 	struct task_struct *p = get_proc_task(file_inode(file));
-	struct files_struct *files;
 	unsigned int fd;
 
 	if (!p)
@@ -225,22 +224,18 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 
 	if (!dir_emit_dots(file, ctx))
 		goto out;
-	files = get_files_struct(p);
-	if (!files)
-		goto out;
 
 	rcu_read_lock();
-	for (fd = ctx->pos - 2;
-	     fd < files_fdtable(files)->max_fds;
-	     fd++, ctx->pos++) {
+	for (fd = ctx->pos - 2;; fd++) {
 		struct file *f;
 		struct fd_data data;
 		char name[10 + 1];
 		unsigned int len;
 
-		f = files_lookup_fd_rcu(files, fd);
+		f = task_lookup_next_fd_rcu(p, &fd);
+		ctx->pos = fd + 2LL;
 		if (!f)
-			continue;
+			break;
 		data.mode = f->f_mode;
 		rcu_read_unlock();
 		data.fd = fd;
@@ -249,13 +244,11 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 		if (!proc_fill_cache(file, ctx,
 				     name, len, instantiate, p,
 				     &data))
-			goto out_fd_loop;
+			goto out;
 		cond_resched();
 		rcu_read_lock();
 	}
 	rcu_read_unlock();
-out_fd_loop:
-	put_files_struct(files);
 out:
 	put_task_struct(p);
 	return 0;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 16/24] bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (14 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 15/24] proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-23 17:06   ` Yonghong Song
  2020-11-20 23:14 ` [PATCH v2 17/24] proc/fd: In fdinfo seq_show don't use get_files_struct Eric W. Biederman
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Using task_lookup_next_fd_rcu simplifies task_file_seq_get_next, by
moving the checking for the maximum file descritor into the generic
code, and by remvoing the need for capturing and releasing a reference
on files_struct.  As the reference count of files_struct no longer
needs to be maintained bpf_iter_seq_task_file_info can have it's files
member removed and task_file_seq_get_next no longer needs it's fstruct
argument.

The curr_fd local variable does need to become unsigned to be used
with fnext_task.  As curr_fd is assigned from and assigned a u32
making curr_fd an unsigned int won't cause problems and might prevent
them.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-11-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/bpf/task_iter.c | 44 ++++++++++--------------------------------
 1 file changed, 10 insertions(+), 34 deletions(-)

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 5ab2ccfb96cb..4ec63170c741 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -130,45 +130,33 @@ struct bpf_iter_seq_task_file_info {
 	 */
 	struct bpf_iter_seq_task_common common;
 	struct task_struct *task;
-	struct files_struct *files;
 	u32 tid;
 	u32 fd;
 };
 
 static struct file *
 task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info,
-		       struct task_struct **task, struct files_struct **fstruct)
+		       struct task_struct **task)
 {
 	struct pid_namespace *ns = info->common.ns;
-	u32 curr_tid = info->tid, max_fds;
-	struct files_struct *curr_files;
+	u32 curr_tid = info->tid;
 	struct task_struct *curr_task;
-	int curr_fd = info->fd;
+	unsigned int curr_fd = info->fd;
 
 	/* If this function returns a non-NULL file object,
-	 * it held a reference to the task/files_struct/file.
+	 * it held a reference to the task/file.
 	 * Otherwise, it does not hold any reference.
 	 */
 again:
 	if (*task) {
 		curr_task = *task;
-		curr_files = *fstruct;
 		curr_fd = info->fd;
 	} else {
 		curr_task = task_seq_get_next(ns, &curr_tid, true);
 		if (!curr_task)
 			return NULL;
 
-		curr_files = get_files_struct(curr_task);
-		if (!curr_files) {
-			put_task_struct(curr_task);
-			curr_tid = ++(info->tid);
-			info->fd = 0;
-			goto again;
-		}
-
-		/* set *fstruct, *task and info->tid */
-		*fstruct = curr_files;
+		/* set *task and info->tid */
 		*task = curr_task;
 		if (curr_tid == info->tid) {
 			curr_fd = info->fd;
@@ -179,13 +167,11 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info,
 	}
 
 	rcu_read_lock();
-	max_fds = files_fdtable(curr_files)->max_fds;
-	for (; curr_fd < max_fds; curr_fd++) {
+	for (;; curr_fd++) {
 		struct file *f;
-
-		f = files_lookup_fd_rcu(curr_files, curr_fd);
+		f = task_lookup_next_fd_rcu(curr_task, &curr_fd);
 		if (!f)
-			continue;
+			break;
 		if (!get_file_rcu(f))
 			continue;
 
@@ -197,10 +183,8 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info,
 
 	/* the current task is done, go to the next task */
 	rcu_read_unlock();
-	put_files_struct(curr_files);
 	put_task_struct(curr_task);
 	*task = NULL;
-	*fstruct = NULL;
 	info->fd = 0;
 	curr_tid = ++(info->tid);
 	goto again;
@@ -209,13 +193,11 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info,
 static void *task_file_seq_start(struct seq_file *seq, loff_t *pos)
 {
 	struct bpf_iter_seq_task_file_info *info = seq->private;
-	struct files_struct *files = NULL;
 	struct task_struct *task = NULL;
 	struct file *file;
 
-	file = task_file_seq_get_next(info, &task, &files);
+	file = task_file_seq_get_next(info, &task);
 	if (!file) {
-		info->files = NULL;
 		info->task = NULL;
 		return NULL;
 	}
@@ -223,7 +205,6 @@ static void *task_file_seq_start(struct seq_file *seq, loff_t *pos)
 	if (*pos == 0)
 		++*pos;
 	info->task = task;
-	info->files = files;
 
 	return file;
 }
@@ -231,22 +212,19 @@ static void *task_file_seq_start(struct seq_file *seq, loff_t *pos)
 static void *task_file_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	struct bpf_iter_seq_task_file_info *info = seq->private;
-	struct files_struct *files = info->files;
 	struct task_struct *task = info->task;
 	struct file *file;
 
 	++*pos;
 	++info->fd;
 	fput((struct file *)v);
-	file = task_file_seq_get_next(info, &task, &files);
+	file = task_file_seq_get_next(info, &task);
 	if (!file) {
-		info->files = NULL;
 		info->task = NULL;
 		return NULL;
 	}
 
 	info->task = task;
-	info->files = files;
 
 	return file;
 }
@@ -295,9 +273,7 @@ static void task_file_seq_stop(struct seq_file *seq, void *v)
 		(void)__task_file_seq_show(seq, v, true);
 	} else {
 		fput((struct file *)v);
-		put_files_struct(info->files);
 		put_task_struct(info->task);
-		info->files = NULL;
 		info->task = NULL;
 	}
 }
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 17/24] proc/fd: In fdinfo seq_show don't use get_files_struct
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (15 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 16/24] bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 18/24] file: Merge __fd_install into fd_install Eric W. Biederman
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Instead hold task_lock for the duration that task->files needs to be
stable in seq_show.  The task_lock was already taken in
get_files_struct, and so skipping get_files_struct performs less work
overall, and avoids the problems with the files_struct reference
count.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-12-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/fd.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 72c1525b4b3e..cb51763ed554 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -28,9 +28,8 @@ static int seq_show(struct seq_file *m, void *v)
 	if (!task)
 		return -ENOENT;
 
-	files = get_files_struct(task);
-	put_task_struct(task);
-
+	task_lock(task);
+	files = task->files;
 	if (files) {
 		unsigned int fd = proc_fd(m->private);
 
@@ -47,8 +46,9 @@ static int seq_show(struct seq_file *m, void *v)
 			ret = 0;
 		}
 		spin_unlock(&files->file_lock);
-		put_files_struct(files);
 	}
+	task_unlock(task);
+	put_task_struct(task);
 
 	if (ret)
 		return ret;
@@ -57,6 +57,7 @@ static int seq_show(struct seq_file *m, void *v)
 		   (long long)file->f_pos, f_flags,
 		   real_mount(file->f_path.mnt)->mnt_id);
 
+	/* show_fd_locks() never deferences files so a stale value is safe */
 	show_fd_locks(m, file, files);
 	if (seq_has_overflowed(m))
 		goto out;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 18/24] file: Merge __fd_install into fd_install
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (16 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 17/24] proc/fd: In fdinfo seq_show don't use get_files_struct Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 19/24] file: In f_dupfd read RLIMIT_NOFILE once Eric W. Biederman
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

The function __fd_install was added to support binder[1].  With binder
fixed[2] there are no more users.

As fd_install just calls __fd_install with "files=current->files",
merge them together by transforming the files parameter into a
local variable initialized to current->files.

[1] f869e8a7f753 ("expose a low-level variant of fd_install() for binder")
[2] 44d8047f1d87 ("binder: use standard functions to allocate fds")
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1:https://lkml.kernel.org/r/20200817220425.9389-14-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c               | 25 ++++++-------------------
 include/linux/fdtable.h |  2 --
 2 files changed, 6 insertions(+), 21 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 23b888a4acbe..0d4ec0fa23b3 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -158,7 +158,7 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
 	spin_unlock(&files->file_lock);
 	new_fdt = alloc_fdtable(nr);
 
-	/* make sure all __fd_install() have seen resize_in_progress
+	/* make sure all fd_install() have seen resize_in_progress
 	 * or have finished their rcu_read_lock_sched() section.
 	 */
 	if (atomic_read(&files->count) > 1)
@@ -181,7 +181,7 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
 	rcu_assign_pointer(files->fdt, new_fdt);
 	if (cur_fdt != &files->fdtab)
 		call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
-	/* coupled with smp_rmb() in __fd_install() */
+	/* coupled with smp_rmb() in fd_install() */
 	smp_wmb();
 	return 1;
 }
@@ -584,17 +584,13 @@ EXPORT_SYMBOL(put_unused_fd);
  * It should never happen - if we allow dup2() do it, _really_ bad things
  * will follow.
  *
- * NOTE: __fd_install() variant is really, really low-level; don't
- * use it unless you are forced to by truly lousy API shoved down
- * your throat.  'files' *MUST* be either current->files or obtained
- * by get_files_struct(current) done by whoever had given it to you,
- * or really bad things will happen.  Normally you want to use
- * fd_install() instead.
+ * This consumes the "file" refcount, so callers should treat it
+ * as if they had called fput(file).
  */
 
-void __fd_install(struct files_struct *files, unsigned int fd,
-		struct file *file)
+void fd_install(unsigned int fd, struct file *file)
 {
+	struct files_struct *files = current->files;
 	struct fdtable *fdt;
 
 	rcu_read_lock_sched();
@@ -616,15 +612,6 @@ void __fd_install(struct files_struct *files, unsigned int fd,
 	rcu_read_unlock_sched();
 }
 
-/*
- * This consumes the "file" refcount, so callers should treat it
- * as if they had called fput(file).
- */
-void fd_install(unsigned int fd, struct file *file)
-{
-	__fd_install(current->files, fd, file);
-}
-
 EXPORT_SYMBOL(fd_install);
 
 static struct file *pick_file(struct files_struct *files, unsigned fd)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index cf6c52dae3a1..a5ec736d74a5 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -126,8 +126,6 @@ int iterate_fd(struct files_struct *, unsigned,
 
 extern int __alloc_fd(struct files_struct *files,
 		      unsigned start, unsigned end, unsigned flags);
-extern void __fd_install(struct files_struct *files,
-		      unsigned int fd, struct file *file);
 extern int __close_fd(struct files_struct *files,
 		      unsigned int fd);
 extern int __close_range(unsigned int fd, unsigned int max_fd, unsigned int flags);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 19/24] file: In f_dupfd read RLIMIT_NOFILE once.
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (17 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 18/24] file: Merge __fd_install into fd_install Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 20/24] file: Merge __alloc_fd into alloc_fd Eric W. Biederman
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

Simplify the code, and remove the chance of races by reading
RLIMIT_NOFILE only once in f_dupfd.

Pass the read value of RLIMIT_NOFILE into alloc_fd which is the other
location the rlimit was read in f_dupfd.  As f_dupfd is the only
caller of alloc_fd this changing alloc_fd is trivially safe.

Further this causes alloc_fd to take all of the same arguments as
__alloc_fd except for the files_struct argument.

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-15-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 0d4ec0fa23b3..07e25f1b9dfd 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -538,9 +538,9 @@ int __alloc_fd(struct files_struct *files,
 	return error;
 }
 
-static int alloc_fd(unsigned start, unsigned flags)
+static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 {
-	return __alloc_fd(current->files, start, rlimit(RLIMIT_NOFILE), flags);
+	return __alloc_fd(current->files, start, end, flags);
 }
 
 int __get_unused_fd_flags(unsigned flags, unsigned long nofile)
@@ -1175,10 +1175,11 @@ SYSCALL_DEFINE1(dup, unsigned int, fildes)
 
 int f_dupfd(unsigned int from, struct file *file, unsigned flags)
 {
+	unsigned long nofile = rlimit(RLIMIT_NOFILE);
 	int err;
-	if (from >= rlimit(RLIMIT_NOFILE))
+	if (from >= nofile)
 		return -EINVAL;
-	err = alloc_fd(from, flags);
+	err = alloc_fd(from, nofile, flags);
 	if (err >= 0) {
 		get_file(file);
 		fd_install(err, file);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 20/24] file: Merge __alloc_fd into alloc_fd
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (18 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 19/24] file: In f_dupfd read RLIMIT_NOFILE once Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 21/24] file: Rename __close_fd to close_fd and remove the files parameter Eric W. Biederman
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

The function __alloc_fd was added to support binder[1].  With binder
fixed[2] there are no more users.

As alloc_fd just calls __alloc_fd with "files=current->files",
merge them together by transforming the files parameter into a
local variable initialized to current->files.

[1] dcfadfa4ec5a ("new helper: __alloc_fd()")
[2] 44d8047f1d87 ("binder: use standard functions to allocate fds")
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-16-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c               | 11 +++--------
 include/linux/fdtable.h |  2 --
 2 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 07e25f1b9dfd..621563701bd9 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -480,9 +480,9 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
 /*
  * allocate a file descriptor, mark it busy.
  */
-int __alloc_fd(struct files_struct *files,
-	       unsigned start, unsigned end, unsigned flags)
+static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 {
+	struct files_struct *files = current->files;
 	unsigned int fd;
 	int error;
 	struct fdtable *fdt;
@@ -538,14 +538,9 @@ int __alloc_fd(struct files_struct *files,
 	return error;
 }
 
-static int alloc_fd(unsigned start, unsigned end, unsigned flags)
-{
-	return __alloc_fd(current->files, start, end, flags);
-}
-
 int __get_unused_fd_flags(unsigned flags, unsigned long nofile)
 {
-	return __alloc_fd(current->files, 0, nofile, flags);
+	return alloc_fd(0, nofile, flags);
 }
 
 int get_unused_fd_flags(unsigned flags)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index a5ec736d74a5..dc476ae92f56 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -124,8 +124,6 @@ int iterate_fd(struct files_struct *, unsigned,
 		int (*)(const void *, struct file *, unsigned),
 		const void *);
 
-extern int __alloc_fd(struct files_struct *files,
-		      unsigned start, unsigned end, unsigned flags);
 extern int __close_fd(struct files_struct *files,
 		      unsigned int fd);
 extern int __close_range(unsigned int fd, unsigned int max_fd, unsigned int flags);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 21/24] file: Rename __close_fd to close_fd and remove the files parameter
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (19 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 20/24] file: Merge __alloc_fd into alloc_fd Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 22/24] file: Replace ksys_close with close_fd Eric W. Biederman
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

The function __close_fd was added to support binder[1].  Now that
binder has been fixed to no longer need __close_fd[2] all calls
to __close_fd pass current->files.

Therefore transform the files parameter into a local variable
initialized to current->files, and rename __close_fd to close_fd to
reflect this change, and keep it in sync with the similar changes to
__alloc_fd, and __fd_install.

This removes the need for callers to care about the extra care that
needs to be take if anything except current->files is passed, by
limiting the callers to only operation on current->files.

[1] 483ce1d4b8c3 ("take descriptor-related part of close() to file.c")
[2] 44d8047f1d87 ("binder: use standard functions to allocate fds")
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-17-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c                | 10 ++++------
 fs/open.c                |  2 +-
 include/linux/fdtable.h  |  3 +--
 include/linux/syscalls.h |  6 +++---
 4 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 621563701bd9..987ea51630b4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -629,11 +629,9 @@ static struct file *pick_file(struct files_struct *files, unsigned fd)
 	return file;
 }
 
-/*
- * The same warnings as for __alloc_fd()/__fd_install() apply here...
- */
-int __close_fd(struct files_struct *files, unsigned fd)
+int close_fd(unsigned fd)
 {
+	struct files_struct *files = current->files;
 	struct file *file;
 
 	file = pick_file(files, fd);
@@ -642,7 +640,7 @@ int __close_fd(struct files_struct *files, unsigned fd)
 
 	return filp_close(file, files);
 }
-EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
+EXPORT_SYMBOL(close_fd); /* for ksys_close() */
 
 /**
  * __close_range() - Close all file descriptors in a given range.
@@ -1027,7 +1025,7 @@ int replace_fd(unsigned fd, struct file *file, unsigned flags)
 	struct files_struct *files = current->files;
 
 	if (!file)
-		return __close_fd(files, fd);
+		return close_fd(fd);
 
 	if (fd >= rlimit(RLIMIT_NOFILE))
 		return -EBADF;
diff --git a/fs/open.c b/fs/open.c
index 9af548fb841b..581a674d7eee 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1292,7 +1292,7 @@ EXPORT_SYMBOL(filp_close);
  */
 SYSCALL_DEFINE1(close, unsigned int, fd)
 {
-	int retval = __close_fd(current->files, fd);
+	int retval = close_fd(fd);
 
 	/* can't restart close syscall because file table entry was cleared */
 	if (unlikely(retval == -ERESTARTSYS ||
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index dc476ae92f56..dad4a488ce60 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -124,8 +124,7 @@ int iterate_fd(struct files_struct *, unsigned,
 		int (*)(const void *, struct file *, unsigned),
 		const void *);
 
-extern int __close_fd(struct files_struct *files,
-		      unsigned int fd);
+extern int close_fd(unsigned int fd);
 extern int __close_range(unsigned int fd, unsigned int max_fd, unsigned int flags);
 extern int __close_fd_get_file(unsigned int fd, struct file **res);
 extern int unshare_fd(unsigned long unshare_flags, unsigned int max_fds,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 37bea07c12f2..9e055a0579f8 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1295,16 +1295,16 @@ static inline long ksys_ftruncate(unsigned int fd, loff_t length)
 	return do_sys_ftruncate(fd, length, 1);
 }
 
-extern int __close_fd(struct files_struct *files, unsigned int fd);
+extern int close_fd(unsigned int fd);
 
 /*
  * In contrast to sys_close(), this stub does not check whether the syscall
  * should or should not be restarted, but returns the raw error codes from
- * __close_fd().
+ * close_fd().
  */
 static inline int ksys_close(unsigned int fd)
 {
-	return __close_fd(current->files, fd);
+	return close_fd(fd);
 }
 
 extern long do_sys_truncate(const char __user *pathname, loff_t length);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 22/24] file: Replace ksys_close with close_fd
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (20 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 21/24] file: Rename __close_fd to close_fd and remove the files parameter Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 23/24] file: Rename __close_fd_get_file close_fd_get_file Eric W. Biederman
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman,
	Christoph Hellwig

Now that ksys_close is exactly identical to close_fd replace
the one caller of ksys_close with close_fd.

[1] https://lkml.kernel.org/r/20200818112020.GA17080@infradead.org
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/autofs/dev-ioctl.c    |  5 +++--
 include/linux/syscalls.h | 12 ------------
 2 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/fs/autofs/dev-ioctl.c b/fs/autofs/dev-ioctl.c
index 322b7dfb4ea0..5bf781ea6d67 100644
--- a/fs/autofs/dev-ioctl.c
+++ b/fs/autofs/dev-ioctl.c
@@ -4,9 +4,10 @@
  * Copyright 2008 Ian Kent <raven@themaw.net>
  */
 
+#include <linux/module.h>
 #include <linux/miscdevice.h>
 #include <linux/compat.h>
-#include <linux/syscalls.h>
+#include <linux/fdtable.h>
 #include <linux/magic.h>
 #include <linux/nospec.h>
 
@@ -289,7 +290,7 @@ static int autofs_dev_ioctl_closemount(struct file *fp,
 				       struct autofs_sb_info *sbi,
 				       struct autofs_dev_ioctl *param)
 {
-	return ksys_close(param->ioctlfd);
+	return close_fd(param->ioctlfd);
 }
 
 /*
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 9e055a0579f8..0f72f380db72 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1295,18 +1295,6 @@ static inline long ksys_ftruncate(unsigned int fd, loff_t length)
 	return do_sys_ftruncate(fd, length, 1);
 }
 
-extern int close_fd(unsigned int fd);
-
-/*
- * In contrast to sys_close(), this stub does not check whether the syscall
- * should or should not be restarted, but returns the raw error codes from
- * close_fd().
- */
-static inline int ksys_close(unsigned int fd)
-{
-	return close_fd(fd);
-}
-
 extern long do_sys_truncate(const char __user *pathname, loff_t length);
 
 static inline long ksys_truncate(const char __user *pathname, loff_t length)
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 23/24] file: Rename __close_fd_get_file close_fd_get_file
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (21 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 22/24] file: Replace ksys_close with close_fd Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-20 23:14 ` [PATCH v2 24/24] file: Remove get_files_struct Eric W. Biederman
  2020-11-21  0:05 ` [PATCH v2 00/24] exec: Move unshare_files and guarantee files_struct.count is correct Linus Torvalds
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

The function close_fd_get_file is explicitly a variant of
__close_fd[1].  Now that __close_fd has been renamed close_fd, rename
close_fd_get_file to be consistent with close_fd.

When __alloc_fd, __close_fd and __fd_install were introduced the
double underscore indicated that the function took a struct
files_struct parameter.  The function __close_fd_get_file never has so
the naming has always been inconsistent.  This just cleans things up
so there are not any lingering mentions or references __close_fd left
in the code.

[1] 80cd795630d6 ("binder: fix use-after-free due to ksys_close() during fdget()")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/android/binder.c | 2 +-
 fs/file.c                | 4 ++--
 fs/io_uring.c            | 2 +-
 include/linux/fdtable.h  | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index b5117576792b..5b2e3721ac27 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -2226,7 +2226,7 @@ static void binder_deferred_fd_close(int fd)
 	if (!twcb)
 		return;
 	init_task_work(&twcb->twork, binder_do_fd_close);
-	__close_fd_get_file(fd, &twcb->file);
+	close_fd_get_file(fd, &twcb->file);
 	if (twcb->file) {
 		filp_close(twcb->file, current->files);
 		task_work_add(current, &twcb->twork, TWA_RESUME);
diff --git a/fs/file.c b/fs/file.c
index 987ea51630b4..947ac6d5602f 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -721,11 +721,11 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
 }
 
 /*
- * variant of __close_fd that gets a ref on the file for later fput.
+ * variant of close_fd that gets a ref on the file for later fput.
  * The caller must ensure that filp_close() called on the file, and then
  * an fput().
  */
-int __close_fd_get_file(unsigned int fd, struct file **res)
+int close_fd_get_file(unsigned int fd, struct file **res)
 {
 	struct files_struct *files = current->files;
 	struct file *file;
diff --git a/fs/io_uring.c b/fs/io_uring.c
index b42dfa0243bf..2e6e51292eff 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4206,7 +4206,7 @@ static int io_close(struct io_kiocb *req, bool force_nonblock,
 
 	/* might be already done during nonblock submission */
 	if (!close->put_file) {
-		ret = __close_fd_get_file(close->fd, &close->put_file);
+		ret = close_fd_get_file(close->fd, &close->put_file);
 		if (ret < 0)
 			return (ret == -ENOENT) ? -EBADF : ret;
 	}
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index dad4a488ce60..4ed3589f9294 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -126,7 +126,7 @@ int iterate_fd(struct files_struct *, unsigned,
 
 extern int close_fd(unsigned int fd);
 extern int __close_range(unsigned int fd, unsigned int max_fd, unsigned int flags);
-extern int __close_fd_get_file(unsigned int fd, struct file **res);
+extern int close_fd_get_file(unsigned int fd, struct file **res);
 extern int unshare_fd(unsigned long unshare_flags, unsigned int max_fds,
 		      struct files_struct **new_fdp);
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v2 24/24] file: Remove get_files_struct
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (22 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 23/24] file: Rename __close_fd_get_file close_fd_get_file Eric W. Biederman
@ 2020-11-20 23:14 ` Eric W. Biederman
  2020-11-21  0:05 ` [PATCH v2 00/24] exec: Move unshare_files and guarantee files_struct.count is correct Linus Torvalds
  24 siblings, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-20 23:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Eric W. Biederman

When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Now that get_files_struct has no more users and can not cause the
problems for posix file locking and fget_light remove get_files_struct
so that it does not gain any new users.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-13-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c               | 13 -------------
 include/linux/fdtable.h |  1 -
 2 files changed, 14 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 947ac6d5602f..412033d8cfdf 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -411,19 +411,6 @@ static struct fdtable *close_files(struct files_struct * files)
 	return fdt;
 }
 
-struct files_struct *get_files_struct(struct task_struct *task)
-{
-	struct files_struct *files;
-
-	task_lock(task);
-	files = task->files;
-	if (files)
-		atomic_inc(&files->count);
-	task_unlock(task);
-
-	return files;
-}
-
 void put_files_struct(struct files_struct *files)
 {
 	if (atomic_dec_and_test(&files->count)) {
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 4ed3589f9294..d0e78174874a 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -115,7 +115,6 @@ struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *fd)
 
 struct task_struct;
 
-struct files_struct *get_files_struct(struct task_struct *);
 void put_files_struct(struct files_struct *fs);
 int unshare_files(void);
 struct files_struct *dup_fd(struct files_struct *, unsigned, int *) __latent_entropy;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 01/24] exec: Move unshare_files to fix posix file locking during exec
  2020-11-20 23:14 ` [PATCH v2 01/24] exec: Move unshare_files to fix posix file locking during exec Eric W. Biederman
@ 2020-11-20 23:58   ` Linus Torvalds
  2020-12-07 22:22   ` Al Viro
  1 sibling, 0 replies; 77+ messages in thread
From: Linus Torvalds @ 2020-11-20 23:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, linux-fsdevel, criu, bpf,
	Alexander Viro, Christian Brauner, Oleg Nesterov,
	Cyrill Gorcunov, Jann Horn, Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Fri, Nov 20, 2020 at 3:16 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> @@ -1257,6 +1258,13 @@ int begin_new_exec(struct linux_binprm * bprm)
>         if (retval)
>                 goto out;
>
> +       /* Ensure the files table is not shared. */
> +       retval = unshare_files(&displaced);
> +       if (retval)
> +               goto out;
> +       if (displaced)
> +               put_files_struct(displaced);

It's not obvious from the patch (not enough context), but the new
placement seems to make much more sense - and it's where we do the
de-thread and switch the vm and signals too.

So this does seem to be the much more logical place.

Ack.

      Linus

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 00/24] exec: Move unshare_files and guarantee files_struct.count is correct
       [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
                   ` (23 preceding siblings ...)
  2020-11-20 23:14 ` [PATCH v2 24/24] file: Remove get_files_struct Eric W. Biederman
@ 2020-11-21  0:05 ` Linus Torvalds
  2020-11-28  5:12   ` Al Viro
  24 siblings, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2020-11-21  0:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, linux-fsdevel, criu, bpf,
	Alexander Viro, Christian Brauner, Oleg Nesterov,
	Cyrill Gorcunov, Jann Horn, Kees Cook, Daniel P. Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Fri, Nov 20, 2020 at 3:11 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> This set of changes cleanups of the code in exec so hopefully this code
> will not regress again.  Then it adds helpers and fixes the users of
> files_struct so the reference count is only incremented if COPY_FILES is
> passed to clone (or if io_uring takes a reference).  Then it removes
> helpers (get_files_struct, __install_fd, __alloc_fd, __close_fd) that
> are no longer needed and if used would encourage code that increments
> the count of files_struct somewhere besides in clone when COPY_FILES is
> passed.

I'm not seeing anything that triggered me going "that looks dodgy". It
all looks like nice cleanups.

But that's just from reading the patches (and in some cases going and
looking at the context), so I didn't actually _test_ any of it. It all
looks sane to me, though, and the fact that it removes a fair number
of lines of code is always a good sign.

It would be good for people to review and test (Al? Oleg? others?),
but my gut feel is "this is good".

             Linus

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 11/24] file: Implement task_lookup_fd_rcu
  2020-11-20 23:14 ` [PATCH v2 11/24] file: Implement task_lookup_fd_rcu Eric W. Biederman
@ 2020-11-21 18:19   ` Cyrill Gorcunov
       [not found]     ` <87blfp1r8b.fsf@x220.int.ebiederm.org>
  0 siblings, 1 reply; 77+ messages in thread
From: Cyrill Gorcunov @ 2020-11-21 18:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Alexander Viro, Christian Brauner, Oleg Nesterov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Fri, Nov 20, 2020 at 05:14:28PM -0600, Eric W. Biederman wrote:
> 
> diff --git a/fs/file.c b/fs/file.c
> index 5861c4f89419..6448523ca29e 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -865,6 +865,21 @@ struct file *fget_task(struct task_struct *task, unsigned int fd)
>  	return file;
>  }
>  
> +struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
> +{
> +	/* Must be called with rcu_read_lock held */

Eric, maybe worth to have something like

	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
			 "suspicious task_lookup_fd_rcu() usage");

here?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 11/24] file: Implement task_lookup_fd_rcu
       [not found]     ` <87blfp1r8b.fsf@x220.int.ebiederm.org>
@ 2020-11-22 13:52       ` Cyrill Gorcunov
  0 siblings, 0 replies; 77+ messages in thread
From: Cyrill Gorcunov @ 2020-11-22 13:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Alexander Viro, Christian Brauner, Oleg Nesterov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Sun, Nov 22, 2020 at 07:00:20AM -0600, Eric W. Biederman wrote:
> Cyrill Gorcunov <gorcunov@gmail.com> writes:
...
> That is present in files_lookup_fd_rcu, so this code should
> be good from the warning side.

Indeed, thanks!

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 16/24] bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
  2020-11-20 23:14 ` [PATCH v2 16/24] bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu Eric W. Biederman
@ 2020-11-23 17:06   ` Yonghong Song
  0 siblings, 0 replies; 77+ messages in thread
From: Yonghong Song @ 2020-11-23 17:06 UTC (permalink / raw)
  To: Eric W. Biederman, linux-kernel
  Cc: linux-fsdevel, criu, bpf, Linus Torvalds, Alexander Viro,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Andrii Nakryiko,
	John Fastabend, KP Singh



On 11/20/20 3:14 PM, Eric W. Biederman wrote:
> When discussing[1] exec and posix file locks it was realized that none
> of the callers of get_files_struct fundamentally needed to call
> get_files_struct, and that by switching them to helper functions
> instead it will both simplify their code and remove unnecessary
> increments of files_struct.count.  Those unnecessary increments can
> result in exec unnecessarily unsharing files_struct which breaking
> posix locks, and it can result in fget_light having to fallback to
> fget reducing system performance.
> 
> Using task_lookup_next_fd_rcu simplifies task_file_seq_get_next, by
> moving the checking for the maximum file descritor into the generic
> code, and by remvoing the need for capturing and releasing a reference
> on files_struct.  As the reference count of files_struct no longer
> needs to be maintained bpf_iter_seq_task_file_info can have it's files
> member removed and task_file_seq_get_next no longer needs it's fstruct
> argument.
> 
> The curr_fd local variable does need to become unsigned to be used
> with fnext_task.  As curr_fd is assigned from and assigned a u32
> making curr_fd an unsigned int won't cause problems and might prevent
> them.
> 
> [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> v1: https://lkml.kernel.org/r/20200817220425.9389-11-ebiederm@xmission.com
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>   kernel/bpf/task_iter.c | 44 ++++++++++--------------------------------
>   1 file changed, 10 insertions(+), 34 deletions(-)


Just a heads-up. The following patch
   bpf-next: 91b2db27d3ff ("bpf: Simplify task_file_seq_get_next()")
 
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=91b2db27d3ff9ad29e8b3108dfbf1e2f49fe9bd3
has been merged in bpf-next tree.

It will have merge conflicts with this patch. The above patch
is a refactoring for simplification with no functionality change.

> 
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index 5ab2ccfb96cb..4ec63170c741 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -130,45 +130,33 @@ struct bpf_iter_seq_task_file_info {
>   	 */
>   	struct bpf_iter_seq_task_common common;
>   	struct task_struct *task;
> -	struct files_struct *files;
>   	u32 tid;
>   	u32 fd;
>   };
>   
>   static struct file *
>   task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info,
> -		       struct task_struct **task, struct files_struct **fstruct)
> +		       struct task_struct **task)
>   {
>   	struct pid_namespace *ns = info->common.ns;
> -	u32 curr_tid = info->tid, max_fds;
> -	struct files_struct *curr_files;
> +	u32 curr_tid = info->tid;
>   	struct task_struct *curr_task;
> -	int curr_fd = info->fd;
> +	unsigned int curr_fd = info->fd;
>   
>   	/* If this function returns a non-NULL file object,
> -	 * it held a reference to the task/files_struct/file.
> +	 * it held a reference to the task/file.
>   	 * Otherwise, it does not hold any reference.
>   	 */
>   again:
>   	if (*task) {
>   		curr_task = *task;
> -		curr_files = *fstruct;
>   		curr_fd = info->fd;
>   	} else {
>   		curr_task = task_seq_get_next(ns, &curr_tid, true);
>   		if (!curr_task)
>   			return NULL;
>   
> -		curr_files = get_files_struct(curr_task);
> -		if (!curr_files) {
> -			put_task_struct(curr_task);
> -			curr_tid = ++(info->tid);
> -			info->fd = 0;
> -			goto again;
> -		}
> -
> -		/* set *fstruct, *task and info->tid */
> -		*fstruct = curr_files;
> +		/* set *task and info->tid */
>   		*task = curr_task;
>   		if (curr_tid == info->tid) {
>   			curr_fd = info->fd;
> @@ -179,13 +167,11 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info,
>   	}
>   
>   	rcu_read_lock();
> -	max_fds = files_fdtable(curr_files)->max_fds;
> -	for (; curr_fd < max_fds; curr_fd++) {
> +	for (;; curr_fd++) {
>   		struct file *f;
> -
> -		f = files_lookup_fd_rcu(curr_files, curr_fd);
> +		f = task_lookup_next_fd_rcu(curr_task, &curr_fd);
>   		if (!f)
> -			continue;
> +			break;
>   		if (!get_file_rcu(f))
>   			continue;
>   
> @@ -197,10 +183,8 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info,
>   
>   	/* the current task is done, go to the next task */
>   	rcu_read_unlock();
> -	put_files_struct(curr_files);
>   	put_task_struct(curr_task);
>   	*task = NULL;
> -	*fstruct = NULL;
>   	info->fd = 0;
>   	curr_tid = ++(info->tid);
>   	goto again;
> @@ -209,13 +193,11 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info,
>   static void *task_file_seq_start(struct seq_file *seq, loff_t *pos)
>   {
>   	struct bpf_iter_seq_task_file_info *info = seq->private;
> -	struct files_struct *files = NULL;
>   	struct task_struct *task = NULL;
>   	struct file *file;
>   
> -	file = task_file_seq_get_next(info, &task, &files);
> +	file = task_file_seq_get_next(info, &task);
>   	if (!file) {
> -		info->files = NULL;
>   		info->task = NULL;
>   		return NULL;
>   	}
> @@ -223,7 +205,6 @@ static void *task_file_seq_start(struct seq_file *seq, loff_t *pos)
>   	if (*pos == 0)
>   		++*pos;
>   	info->task = task;
> -	info->files = files;
>   
>   	return file;
>   }
> @@ -231,22 +212,19 @@ static void *task_file_seq_start(struct seq_file *seq, loff_t *pos)
>   static void *task_file_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>   {
>   	struct bpf_iter_seq_task_file_info *info = seq->private;
> -	struct files_struct *files = info->files;
>   	struct task_struct *task = info->task;
>   	struct file *file;
>   
>   	++*pos;
>   	++info->fd;
>   	fput((struct file *)v);
> -	file = task_file_seq_get_next(info, &task, &files);
> +	file = task_file_seq_get_next(info, &task);
>   	if (!file) {
> -		info->files = NULL;
>   		info->task = NULL;
>   		return NULL;
>   	}
>   
>   	info->task = task;
> -	info->files = files;
>   
>   	return file;
>   }
> @@ -295,9 +273,7 @@ static void task_file_seq_stop(struct seq_file *seq, void *v)
>   		(void)__task_file_seq_show(seq, v, true);
>   	} else {
>   		fput((struct file *)v);
> -		put_files_struct(info->files);
>   		put_task_struct(info->task);
> -		info->files = NULL;
>   		info->task = NULL;
>   	}
>   }
> 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 02/24] exec: Simplify unshare_files
  2020-11-20 23:14 ` [PATCH v2 02/24] exec: Simplify unshare_files Eric W. Biederman
@ 2020-11-23 17:50   ` Oleg Nesterov
  2020-11-23 18:25     ` Linus Torvalds
  0 siblings, 1 reply; 77+ messages in thread
From: Oleg Nesterov @ 2020-11-23 17:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Alexander Viro, Christian Brauner, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

I'll try to actually read this series tomorrow but I see nothing wrong
after the quick glance.

Just one off-topic question,

On 11/20, Eric W. Biederman wrote:
>
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -585,7 +585,6 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  	int ispipe;
>  	size_t *argv = NULL;
>  	int argc = 0;
> -	struct files_struct *displaced;
>  	/* require nonrelative corefile path and be extra careful */
>  	bool need_suid_safe = false;
>  	bool core_dumped = false;
> @@ -791,11 +790,9 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  	}
>
>  	/* get us an unshared descriptor table; almost always a no-op */
> -	retval = unshare_files(&displaced);
> +	retval = unshare_files();

Can anyone explain why does do_coredump() need unshare_files() at all?

Oleg.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 02/24] exec: Simplify unshare_files
  2020-11-23 17:50   ` Oleg Nesterov
@ 2020-11-23 18:25     ` Linus Torvalds
       [not found]       ` <87im9vx08i.fsf@x220.int.ebiederm.org>
  2020-12-07 22:32       ` [PATCH v2 02/24] exec: Simplify unshare_files Al Viro
  0 siblings, 2 replies; 77+ messages in thread
From: Linus Torvalds @ 2020-11-23 18:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Linux Kernel Mailing List, linux-fsdevel,
	criu, bpf, Alexander Viro, Christian Brauner, Cyrill Gorcunov,
	Jann Horn, Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Mon, Nov 23, 2020 at 9:52 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Can anyone explain why does do_coredump() need unshare_files() at all?

Hmm. It goes back to 2012, and it's placed just before calling
"->core_dump()", so I assume some core dumping function messed with
the file table back when..

I can't see anything like that currently.

The alternative is that core-dumping just keeps the file table around
for a long while, and thus files don't actually close in a timely
manner. So it might not be a "correctness" issue as much as a latency
issue.

             Linus

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 13/24] kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
  2020-11-20 23:14 ` [PATCH v2 13/24] kcmp: In get_file_raw_ptr " Eric W. Biederman
@ 2020-11-23 21:05   ` Cyrill Gorcunov
  0 siblings, 0 replies; 77+ messages in thread
From: Cyrill Gorcunov @ 2020-11-23 21:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Alexander Viro, Christian Brauner, Oleg Nesterov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Fri, Nov 20, 2020 at 05:14:30PM -0600, Eric W. Biederman wrote:
> Modify get_file_raw_ptr to use task_lookup_fd_rcu.  The helper
> task_lookup_fd_rcu does the work of taking the task lock and verifying
> that task->files != NULL and then calls files_lookup_fd_rcu.  So let
> use the helper to make a simpler implementation of get_file_raw_ptr.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>

Since I wrote this kcmp code in first place. Thanks Eric!

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 02/24] exec: Simplify unshare_files
       [not found]         ` <87pn42r0n7.fsf@x220.int.ebiederm.org>
@ 2020-11-24 19:58           ` Linus Torvalds
  2020-11-24 20:14             ` Arnd Bergmann
  0 siblings, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2020-11-24 19:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Linux Kernel Mailing List, linux-fsdevel,
	Alexander Viro, Michael Ellerman, Arnd Bergmann

On Tue, Nov 24, 2020 at 11:55 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> If cell happens to be dead we can remove a fair amount of generic kernel
> code that only exists to support cell.

Even if some people might still use cell (which sounds unlikely), I
think we can remove the spu core dumping code.

       Linus

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 02/24] exec: Simplify unshare_files
  2020-11-24 19:58           ` Linus Torvalds
@ 2020-11-24 20:14             ` Arnd Bergmann
  2020-11-24 23:44               ` Geoff Levand
  0 siblings, 1 reply; 77+ messages in thread
From: Arnd Bergmann @ 2020-11-24 20:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Oleg Nesterov, Linux Kernel Mailing List,
	linux-fsdevel, Alexander Viro, Michael Ellerman, Arnd Bergmann,
	Geoff Levand

On Tue, Nov 24, 2020 at 8:58 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Nov 24, 2020 at 11:55 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > If cell happens to be dead we can remove a fair amount of generic kernel
> > code that only exists to support cell.
>
> Even if some people might still use cell (which sounds unlikely), I
> think we can remove the spu core dumping code.

The Cell blade hardware (arch/powerpc/platforms/cell/) that I'm listed
as a maintainer for is very much dead, but there is apparently still some
activity on the Playstation 3 that Geoff Levand maintains.

Eric correctly points out that the PS3 firmware no longer boots
Linux (OtherOS), but AFAIK there are both users with old firmware
and those that use a firmware exploit to run homebrew code including
Linux.

I would assume they still use the SPU and might also use the core
dump code in particular. Let's see what Geoff thinks.

       Arnd

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 02/24] exec: Simplify unshare_files
  2020-11-24 20:14             ` Arnd Bergmann
@ 2020-11-24 23:44               ` Geoff Levand
  2020-11-25  1:16                 ` Eric W. Biederman
  2020-11-25 21:51                 ` [RFC][PATCH] coredump: Document coredump code exclusively used by cell spufs Eric W. Biederman
  0 siblings, 2 replies; 77+ messages in thread
From: Geoff Levand @ 2020-11-24 23:44 UTC (permalink / raw)
  To: Arnd Bergmann, Linus Torvalds
  Cc: Eric W. Biederman, Oleg Nesterov, Linux Kernel Mailing List,
	linux-fsdevel, Alexander Viro, Michael Ellerman, Arnd Bergmann

On 11/24/20 12:14 PM, Arnd Bergmann wrote:
> On Tue, Nov 24, 2020 at 8:58 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> On Tue, Nov 24, 2020 at 11:55 AM Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>>
>>> If cell happens to be dead we can remove a fair amount of generic kernel
>>> code that only exists to support cell.
>>
>> Even if some people might still use cell (which sounds unlikely), I
>> think we can remove the spu core dumping code.
> 
> The Cell blade hardware (arch/powerpc/platforms/cell/) that I'm listed
> as a maintainer for is very much dead, but there is apparently still some
> activity on the Playstation 3 that Geoff Levand maintains.
> 
> Eric correctly points out that the PS3 firmware no longer boots
> Linux (OtherOS), but AFAIK there are both users with old firmware
> and those that use a firmware exploit to run homebrew code including
> Linux.
> 
> I would assume they still use the SPU and might also use the core
> dump code in particular. Let's see what Geoff thinks.

There are still PS3-Linux users out there.  They use 'Homebrew' firmware
released through 'Hacker' forums that allow them to run Linux on
non-supported systems.  They are generally hobbies who don't post to
Linux kernel mailing lists.  I get direct inquiries regularly asking
about how to update to a recent kernel.  One of the things that attract
them to the PS3 is the Cell processor and either using or programming
the SPUs.

It is difficult to judge how much use the SPU core dump support gets,
but if it is not a cause of major problems I feel we should consider
keeping it.

-Geoff


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 02/24] exec: Simplify unshare_files
  2020-11-24 23:44               ` Geoff Levand
@ 2020-11-25  1:16                 ` Eric W. Biederman
  2020-11-27 20:29                   ` Arnd Bergmann
  2020-11-25 21:51                 ` [RFC][PATCH] coredump: Document coredump code exclusively used by cell spufs Eric W. Biederman
  1 sibling, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-25  1:16 UTC (permalink / raw)
  To: Geoff Levand
  Cc: Arnd Bergmann, Linus Torvalds, Oleg Nesterov,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Michael Ellerman, Arnd Bergmann

Geoff Levand <geoff@infradead.org> writes:

> On 11/24/20 12:14 PM, Arnd Bergmann wrote:
>> On Tue, Nov 24, 2020 at 8:58 PM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> On Tue, Nov 24, 2020 at 11:55 AM Eric W. Biederman
>>> <ebiederm@xmission.com> wrote:
>>>>
>>>> If cell happens to be dead we can remove a fair amount of generic kernel
>>>> code that only exists to support cell.
>>>
>>> Even if some people might still use cell (which sounds unlikely), I
>>> think we can remove the spu core dumping code.
>> 
>> The Cell blade hardware (arch/powerpc/platforms/cell/) that I'm listed
>> as a maintainer for is very much dead, but there is apparently still some
>> activity on the Playstation 3 that Geoff Levand maintains.
>> 
>> Eric correctly points out that the PS3 firmware no longer boots
>> Linux (OtherOS), but AFAIK there are both users with old firmware
>> and those that use a firmware exploit to run homebrew code including
>> Linux.
>> 
>> I would assume they still use the SPU and might also use the core
>> dump code in particular. Let's see what Geoff thinks.
>
> There are still PS3-Linux users out there.  They use 'Homebrew' firmware
> released through 'Hacker' forums that allow them to run Linux on
> non-supported systems.  They are generally hobbies who don't post to
> Linux kernel mailing lists.  I get direct inquiries regularly asking
> about how to update to a recent kernel.  One of the things that attract
> them to the PS3 is the Cell processor and either using or programming
> the SPUs.
>
> It is difficult to judge how much use the SPU core dump support gets,
> but if it is not a cause of major problems I feel we should consider
> keeping it.

I just took a quick look to get a sense how much tool support there is.

In the gdb tree I found this 2019 commit abf516c6931a ("Remove Cell
Broadband Engine debugging support").  Which basically removes the code
in gdb that made sense of the spu coredumps.

I would not say the coredump support is a source major problems, but it
is a challenge to understand.  One of the pieces of code in there that
is necessary to make the coredump support work reliable, a call to
unshare_files, Oleg whole essentially maintains the ptrace and coredump
support did not know why it was there, and it was not at all obvious
when I looked at the code.

So we are certainly in maintainers loosing hours of time figuring out
what is going on and spending time fixing fuzzer bugs related to the
code.

At the minimum I will add a few more comments so people reading the code
can realize why it is there.   Perhaps putting the relevant code behind
a Kconfig so it is only built into the kernel when spufs is present.

I think we are at a point we we can start planning on removing the
coredump support.  The tools to read it are going away.  None of what is
there is bad, but it is definitely a special case, and it definitely has
a maintenance cost.

Eric



^ permalink raw reply	[flat|nested] 77+ messages in thread

* [RFC][PATCH] coredump: Document coredump code exclusively used by cell spufs
  2020-11-24 23:44               ` Geoff Levand
  2020-11-25  1:16                 ` Eric W. Biederman
@ 2020-11-25 21:51                 ` Eric W. Biederman
  2020-12-02 15:20                   ` Eric W. Biederman
  1 sibling, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-25 21:51 UTC (permalink / raw)
  To: Geoff Levand
  Cc: Arnd Bergmann, Linus Torvalds, Oleg Nesterov,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Michael Ellerman, Arnd Bergmann


Oleg Nesterov recently asked[1] why is there an unshare_files in
do_coredump.  After digging through all of the callers of lookup_fd it
turns out that it is
arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context
that needs the unshare_files in do_coredump.

Looking at the history[2] this code was also the only piece of coredump code
that required the unshare_files when the unshare_files was added.

Looking at that code it turns out that cell is also the only
architecture that implements elf_coredump_extra_notes_size and
elf_coredump_extra_notes_write.

I looked at the gdb repo[3] support for cell has been removed[4] in binutils
2.34.  Geoff Levand reports he is still getting questions on how to
run modern kernels on the PS3, from people using 3rd party firmware so
this code is not dead.  According to Wikipedia the last PS3 shipped in
Japan sometime in 2017.  So it will probably be a little while before
everyone's hardware dies.

Add some comments briefly documenting the coredump code that exists
only to support cell spufs to make it easier to understand the
coredump code.  Eventually the hardware will be dead, or their won't
be userspace tools, or the coredump code will be refactored and it
will be too difficult to update a dead architecture and these comments
make it easy to tell where to pull to remove cell spufs support.

[1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com
[2] 179e037fc137 ("do_coredump(): make sure that descriptor table isn't shared")
[3] git://sourceware.org/git/binutils-gdb.git
[4] abf516c6931a ("Remove Cell Broadband Engine debugging support").
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

Does this change look good to people?  I think it captures this state of
things and makes things clearer without breaking anything or removing
functionality for anyone.

 fs/binfmt_elf.c | 2 ++
 fs/coredump.c   | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index b6b3d052ca86..c1996f0aeaed 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2198,6 +2198,7 @@ static int elf_core_dump(struct coredump_params *cprm)
 	{
 		size_t sz = get_note_info_size(&info);
 
+		/* For cell spufs */
 		sz += elf_coredump_extra_notes_size();
 
 		phdr4note = kmalloc(sizeof(*phdr4note), GFP_KERNEL);
@@ -2261,6 +2262,7 @@ static int elf_core_dump(struct coredump_params *cprm)
 	if (!write_note_info(&info, cprm))
 		goto end_coredump;
 
+	/* For cell spufs */
 	if (elf_coredump_extra_notes_write(cprm))
 		goto end_coredump;
 
diff --git a/fs/coredump.c b/fs/coredump.c
index abf807235262..3ff17eea812e 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -790,6 +790,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 	}
 
 	/* get us an unshared descriptor table; almost always a no-op */
+	/* The cell spufs coredump code reads the file descriptor tables */
 	retval = unshare_files();
 	if (retval)
 		goto close_fail;
-- 
2.25.0

Eric

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 02/24] exec: Simplify unshare_files
  2020-11-25  1:16                 ` Eric W. Biederman
@ 2020-11-27 20:29                   ` Arnd Bergmann
  2020-11-30 21:37                     ` Eric W. Biederman
  2020-12-01  9:46                     ` Michael Ellerman
  0 siblings, 2 replies; 77+ messages in thread
From: Arnd Bergmann @ 2020-11-27 20:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Geoff Levand, Linus Torvalds, Oleg Nesterov,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Michael Ellerman, Arnd Bergmann

On Wed, Nov 25, 2020 at 2:16 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > On 11/24/20 12:14 PM, Arnd Bergmann wrote:
> >
> > There are still PS3-Linux users out there.  They use 'Homebrew' firmware
> > released through 'Hacker' forums that allow them to run Linux on
> > non-supported systems.  They are generally hobbies who don't post to
> > Linux kernel mailing lists.  I get direct inquiries regularly asking
> > about how to update to a recent kernel.  One of the things that attract
> > them to the PS3 is the Cell processor and either using or programming
> > the SPUs.
> >
> > It is difficult to judge how much use the SPU core dump support gets,
> > but if it is not a cause of major problems I feel we should consider
> > keeping it.
>
> I just took a quick look to get a sense how much tool support there is.
>
> In the gdb tree I found this 2019 commit abf516c6931a ("Remove Cell
> Broadband Engine debugging support").  Which basically removes the code
> in gdb that made sense of the spu coredumps.

Ah, I had not realized this was gone already. The code in gdb for
seamlessly debugging programs across CPU and SPU was clearly
more complex than the kernel portion for the coredump, so it makes
sense this was removed eventually.

> I would not say the coredump support is a source major problems, but it
> is a challenge to understand.  One of the pieces of code in there that
> is necessary to make the coredump support work reliable, a call to
> unshare_files, Oleg whole essentially maintains the ptrace and coredump
> support did not know why it was there, and it was not at all obvious
> when I looked at the code.
>
> So we are certainly in maintainers loosing hours of time figuring out
> what is going on and spending time fixing fuzzer bugs related to the
> code.

I also spent some amount of time on this code earlier this year Christoph
did some refactoring, and we could both have used that time better.

> At the minimum I will add a few more comments so people reading the code
> can realize why it is there.   Perhaps putting the relevant code behind
> a Kconfig so it is only built into the kernel when spufs is present.
>
> I think we are at a point we we can start planning on removing the
> coredump support.  The tools to read it are going away.  None of what is
> there is bad, but it is definitely a special case, and it definitely has
> a maintenance cost.

How about adding a comment in the coredump code so it can get
removed the next time someone comes across it during refactoring,
or when they find a bug that can't easily be worked around?

That way there is still a chance of using it where needed, but
hopefully it won't waste anyone's time when it gets in the way.

If there are no objections, I can also send a patch to remove
CONFIG_PPC_CELL_NATIVE, PPC_IBM_CELL_BLADE and
everything that depends on those symbols, leaving only the
bits needed by ps3 in the arch/powerpc/platforms/cell directory.

      Arnd

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 00/24] exec: Move unshare_files and guarantee files_struct.count is correct
  2020-11-21  0:05 ` [PATCH v2 00/24] exec: Move unshare_files and guarantee files_struct.count is correct Linus Torvalds
@ 2020-11-28  5:12   ` Al Viro
  2020-12-07 23:56     ` Al Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-11-28  5:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, linux-fsdevel,
	criu, bpf, Christian Brauner, Oleg Nesterov, Cyrill Gorcunov,
	Jann Horn, Kees Cook, Daniel P. Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Fri, Nov 20, 2020 at 04:05:47PM -0800, Linus Torvalds wrote:
> On Fri, Nov 20, 2020 at 3:11 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > This set of changes cleanups of the code in exec so hopefully this code
> > will not regress again.  Then it adds helpers and fixes the users of
> > files_struct so the reference count is only incremented if COPY_FILES is
> > passed to clone (or if io_uring takes a reference).  Then it removes
> > helpers (get_files_struct, __install_fd, __alloc_fd, __close_fd) that
> > are no longer needed and if used would encourage code that increments
> > the count of files_struct somewhere besides in clone when COPY_FILES is
> > passed.
> 
> I'm not seeing anything that triggered me going "that looks dodgy". It
> all looks like nice cleanups.
> 
> But that's just from reading the patches (and in some cases going and
> looking at the context), so I didn't actually _test_ any of it. It all
> looks sane to me, though, and the fact that it removes a fair number
> of lines of code is always a good sign.
> 
> It would be good for people to review and test (Al? Oleg? others?),
> but my gut feel is "this is good".

Will check (sorry, the last couple of weeks had been bloody awful -
off-net and very short on sleep); I'm digging through the piles of
email right now.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 02/24] exec: Simplify unshare_files
  2020-11-27 20:29                   ` Arnd Bergmann
@ 2020-11-30 21:37                     ` Eric W. Biederman
  2020-12-01  9:46                     ` Michael Ellerman
  1 sibling, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-11-30 21:37 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Geoff Levand, Linus Torvalds, Oleg Nesterov,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Michael Ellerman, Arnd Bergmann

Arnd Bergmann <arnd@kernel.org> writes:

> On Wed, Nov 25, 2020 at 2:16 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> > On 11/24/20 12:14 PM, Arnd Bergmann wrote:
>> >
>> > There are still PS3-Linux users out there.  They use 'Homebrew' firmware
>> > released through 'Hacker' forums that allow them to run Linux on
>> > non-supported systems.  They are generally hobbies who don't post to
>> > Linux kernel mailing lists.  I get direct inquiries regularly asking
>> > about how to update to a recent kernel.  One of the things that attract
>> > them to the PS3 is the Cell processor and either using or programming
>> > the SPUs.
>> >
>> > It is difficult to judge how much use the SPU core dump support gets,
>> > but if it is not a cause of major problems I feel we should consider
>> > keeping it.
>>
>> I just took a quick look to get a sense how much tool support there is.
>>
>> In the gdb tree I found this 2019 commit abf516c6931a ("Remove Cell
>> Broadband Engine debugging support").  Which basically removes the code
>> in gdb that made sense of the spu coredumps.
>
> Ah, I had not realized this was gone already. The code in gdb for
> seamlessly debugging programs across CPU and SPU was clearly
> more complex than the kernel portion for the coredump, so it makes
> sense this was removed eventually.
>
>> I would not say the coredump support is a source major problems, but it
>> is a challenge to understand.  One of the pieces of code in there that
>> is necessary to make the coredump support work reliable, a call to
>> unshare_files, Oleg whole essentially maintains the ptrace and coredump
>> support did not know why it was there, and it was not at all obvious
>> when I looked at the code.
>>
>> So we are certainly in maintainers loosing hours of time figuring out
>> what is going on and spending time fixing fuzzer bugs related to the
>> code.
>
> I also spent some amount of time on this code earlier this year Christoph
> did some refactoring, and we could both have used that time better.
>
>> At the minimum I will add a few more comments so people reading the code
>> can realize why it is there.   Perhaps putting the relevant code behind
>> a Kconfig so it is only built into the kernel when spufs is present.
>>
>> I think we are at a point we we can start planning on removing the
>> coredump support.  The tools to read it are going away.  None of what is
>> there is bad, but it is definitely a special case, and it definitely has
>> a maintenance cost.
>
> How about adding a comment in the coredump code so it can get
> removed the next time someone comes across it during refactoring,
> or when they find a bug that can't easily be worked around?

Did my proposed patch look ok?

> That way there is still a chance of using it where needed, but
> hopefully it won't waste anyone's time when it gets in the way.

Sounds good to me.

> If there are no objections, I can also send a patch to remove
> CONFIG_PPC_CELL_NATIVE, PPC_IBM_CELL_BLADE and
> everything that depends on those symbols, leaving only the
> bits needed by ps3 in the arch/powerpc/platforms/cell directory.

That also seems reasonable.  My read of the history suggests that
code has been out of commission for a decade or so, and not having it to
trip over (just present in the history) seems very reasonable.

Eric

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 02/24] exec: Simplify unshare_files
  2020-11-27 20:29                   ` Arnd Bergmann
  2020-11-30 21:37                     ` Eric W. Biederman
@ 2020-12-01  9:46                     ` Michael Ellerman
  1 sibling, 0 replies; 77+ messages in thread
From: Michael Ellerman @ 2020-12-01  9:46 UTC (permalink / raw)
  To: Arnd Bergmann, Eric W. Biederman
  Cc: Geoff Levand, Linus Torvalds, Oleg Nesterov,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Arnd Bergmann

Arnd Bergmann <arnd@kernel.org> writes:
...
>
> If there are no objections, I can also send a patch to remove
> CONFIG_PPC_CELL_NATIVE, PPC_IBM_CELL_BLADE and
> everything that depends on those symbols, leaving only the
> bits needed by ps3 in the arch/powerpc/platforms/cell directory.

I'm not sure I'd merge it.

The only way I am able to (easily) test Cell code is by using one of
those blades, a QS22 to be precise.

So if the blade support is removed then the rest of the Cell code is
likely to bitrot quickly. Which may be the goal.

I'd be more inclined to support removal of the core dump code. That
seems highly unlikely to be in active use, I don't have the impression
there are many folks doing active development on Cell.

cheers

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH] coredump: Document coredump code exclusively used by cell spufs
  2020-11-25 21:51                 ` [RFC][PATCH] coredump: Document coredump code exclusively used by cell spufs Eric W. Biederman
@ 2020-12-02 15:20                   ` Eric W. Biederman
  2020-12-02 15:58                     ` Arnd Bergmann
  0 siblings, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2020-12-02 15:20 UTC (permalink / raw)
  To: Geoff Levand
  Cc: Arnd Bergmann, Linus Torvalds, Oleg Nesterov,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Michael Ellerman, Arnd Bergmann

ebiederm@xmission.com (Eric W. Biederman) writes:

> Oleg Nesterov recently asked[1] why is there an unshare_files in
> do_coredump.  After digging through all of the callers of lookup_fd it
> turns out that it is
> arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context
> that needs the unshare_files in do_coredump.
>
> Looking at the history[2] this code was also the only piece of coredump code
> that required the unshare_files when the unshare_files was added.
>
> Looking at that code it turns out that cell is also the only
> architecture that implements elf_coredump_extra_notes_size and
> elf_coredump_extra_notes_write.
>
> I looked at the gdb repo[3] support for cell has been removed[4] in binutils
> 2.34.  Geoff Levand reports he is still getting questions on how to
> run modern kernels on the PS3, from people using 3rd party firmware so
> this code is not dead.  According to Wikipedia the last PS3 shipped in
> Japan sometime in 2017.  So it will probably be a little while before
> everyone's hardware dies.
>
> Add some comments briefly documenting the coredump code that exists
> only to support cell spufs to make it easier to understand the
> coredump code.  Eventually the hardware will be dead, or their won't
> be userspace tools, or the coredump code will be refactored and it
> will be too difficult to update a dead architecture and these comments
> make it easy to tell where to pull to remove cell spufs support.
>
> [1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com
> [2] 179e037fc137 ("do_coredump(): make sure that descriptor table isn't shared")
> [3] git://sourceware.org/git/binutils-gdb.git
> [4] abf516c6931a ("Remove Cell Broadband Engine debugging support").
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> ---
>
> Does this change look good to people?  I think it captures this state of
> things and makes things clearer without breaking anything or removing
> functionality for anyone.

I haven't heard anything except a general ack to the concept of
comments.  So I am applying this.

Eric

>
>  fs/binfmt_elf.c | 2 ++
>  fs/coredump.c   | 1 +
>  2 files changed, 3 insertions(+)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index b6b3d052ca86..c1996f0aeaed 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -2198,6 +2198,7 @@ static int elf_core_dump(struct coredump_params *cprm)
>  	{
>  		size_t sz = get_note_info_size(&info);
>  
> +		/* For cell spufs */
>  		sz += elf_coredump_extra_notes_size();
>  
>  		phdr4note = kmalloc(sizeof(*phdr4note), GFP_KERNEL);
> @@ -2261,6 +2262,7 @@ static int elf_core_dump(struct coredump_params *cprm)
>  	if (!write_note_info(&info, cprm))
>  		goto end_coredump;
>  
> +	/* For cell spufs */
>  	if (elf_coredump_extra_notes_write(cprm))
>  		goto end_coredump;
>  
> diff --git a/fs/coredump.c b/fs/coredump.c
> index abf807235262..3ff17eea812e 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -790,6 +790,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>  	}
>  
>  	/* get us an unshared descriptor table; almost always a no-op */
> +	/* The cell spufs coredump code reads the file descriptor tables */
>  	retval = unshare_files();
>  	if (retval)
>  		goto close_fail;

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [RFC][PATCH] coredump: Document coredump code exclusively used by cell spufs
  2020-12-02 15:20                   ` Eric W. Biederman
@ 2020-12-02 15:58                     ` Arnd Bergmann
  0 siblings, 0 replies; 77+ messages in thread
From: Arnd Bergmann @ 2020-12-02 15:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Geoff Levand, Linus Torvalds, Oleg Nesterov,
	Linux Kernel Mailing List, linux-fsdevel, Alexander Viro,
	Michael Ellerman, Arnd Bergmann

On Wed, Dec 2, 2020 at 4:20 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> ebiederm@xmission.com (Eric W. Biederman) writes:
>
> > Oleg Nesterov recently asked[1] why is there an unshare_files in
> > do_coredump.  After digging through all of the callers of lookup_fd it
> > turns out that it is
> > arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context
> > that needs the unshare_files in do_coredump.
> >
> > Looking at the history[2] this code was also the only piece of coredump code
> > that required the unshare_files when the unshare_files was added.
> >
> > Looking at that code it turns out that cell is also the only
> > architecture that implements elf_coredump_extra_notes_size and
> > elf_coredump_extra_notes_write.
> >
> > I looked at the gdb repo[3] support for cell has been removed[4] in binutils
> > 2.34.  Geoff Levand reports he is still getting questions on how to
> > run modern kernels on the PS3, from people using 3rd party firmware so
> > this code is not dead.  According to Wikipedia the last PS3 shipped in
> > Japan sometime in 2017.  So it will probably be a little while before
> > everyone's hardware dies.
> >
> > Add some comments briefly documenting the coredump code that exists
> > only to support cell spufs to make it easier to understand the
> > coredump code.  Eventually the hardware will be dead, or their won't
> > be userspace tools, or the coredump code will be refactored and it
> > will be too difficult to update a dead architecture and these comments
> > make it easy to tell where to pull to remove cell spufs support.
> >
> > [1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com
> > [2] 179e037fc137 ("do_coredump(): make sure that descriptor table isn't shared")
> > [3] git://sourceware.org/git/binutils-gdb.git
> > [4] abf516c6931a ("Remove Cell Broadband Engine debugging support").
> > Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
> > ---
> >
> > Does this change look good to people?  I think it captures this state of
> > things and makes things clearer without breaking anything or removing
> > functionality for anyone.
>
> I haven't heard anything except a general ack to the concept of
> comments.  So I am applying this.
>

Sorry I missed it when you originally sent it. Looks good ot me,

Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 01/24] exec: Move unshare_files to fix posix file locking during exec
  2020-11-20 23:14 ` [PATCH v2 01/24] exec: Move unshare_files to fix posix file locking during exec Eric W. Biederman
  2020-11-20 23:58   ` Linus Torvalds
@ 2020-12-07 22:22   ` Al Viro
       [not found]     ` <87zh2pusdw.fsf@x220.int.ebiederm.org>
  1 sibling, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-12-07 22:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Fri, Nov 20, 2020 at 05:14:18PM -0600, Eric W. Biederman wrote:
> @@ -1805,8 +1808,12 @@ static int bprm_execve(struct linux_binprm *bprm,
>  	bprm->file = file;
>  	/*
>  	 * Record that a name derived from an O_CLOEXEC fd will be
> -	 * inaccessible after exec. Relies on having exclusive access to
> -	 * current->files (due to unshare_files above).
> +	 * inaccessible after exec.  This allows the code in exec to
> +	 * choose to fail when the executable is not mmaped into the
> +	 * interpreter and an open file descriptor is not passed to
> +	 * the interpreter.  This makes for a better user experience
> +	 * than having the interpreter start and then immediately fail
> +	 * when it finds the executable is inaccessible.
>  	 */
>  	if (bprm->fdpath &&
>  	    close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))

We do not have rcu_read_lock() here.  What would happen if
	* we get fdt
	* another thread does e.g. dup2() with target descriptor greater than
the current size
	* old fdt gets copied and (RCU-delayed) freed
	* nobody is holding rcu_read_lock(), so it gets really freed
	* we read a bitmap from the already freed sucker

It's a narrow window, but on SMP KVM it's not impossible to hit; if you
have preemption enabled, the race window is not so narrow even when
running on bare metal.  In the mainline it is safe, but only because
the damn thing is guaranteed to be _not_ shared with any other thread
(which is what the comment had been about).  Why not simply say
	if (bprm->fdpath && get_close_on_exec(fd))
anyway?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 02/24] exec: Simplify unshare_files
  2020-11-23 18:25     ` Linus Torvalds
       [not found]       ` <87im9vx08i.fsf@x220.int.ebiederm.org>
@ 2020-12-07 22:32       ` Al Viro
  1 sibling, 0 replies; 77+ messages in thread
From: Al Viro @ 2020-12-07 22:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Eric W. Biederman, Linux Kernel Mailing List,
	linux-fsdevel, criu, bpf, Christian Brauner, Cyrill Gorcunov,
	Jann Horn, Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Mon, Nov 23, 2020 at 10:25:13AM -0800, Linus Torvalds wrote:
> On Mon, Nov 23, 2020 at 9:52 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Can anyone explain why does do_coredump() need unshare_files() at all?
> 
> Hmm. It goes back to 2012, and it's placed just before calling
> "->core_dump()", so I assume some core dumping function messed with
> the file table back when..
> 
> I can't see anything like that currently.
> 
> The alternative is that core-dumping just keeps the file table around
> for a long while, and thus files don't actually close in a timely
> manner. So it might not be a "correctness" issue as much as a latency
> issue.

IIRC, it was "weird architecture hooks might be playing silly buggers
with some per-descriptor information they want in coredumps, better
make sure it can't change under them"; it doesn't cost much and
it reduced the analysis surface nicely.

Had been a while ago, so the memories might be faulty...  Anyway, that
reasoning seems to be applicable right now - rather than keeping an
eye on coredump logics on random architectures that might be looking
at descriptor table in unsafe way, just make sure they have a stable
private table and be done with that.

How much is simplified by not doing it there, anyway?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 09/24] file: Replace fcheck_files with files_lookup_fd_rcu
  2020-11-20 23:14 ` [PATCH v2 09/24] file: Replace fcheck_files with files_lookup_fd_rcu Eric W. Biederman
@ 2020-12-07 22:46   ` Al Viro
  2020-12-07 22:49     ` Al Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-12-07 22:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Fri, Nov 20, 2020 at 05:14:26PM -0600, Eric W. Biederman wrote:

>  /*
>   * Check whether the specified fd has an open file.
>   */
> -#define fcheck(fd)	fcheck_files(current->files, fd)
> +#define fcheck(fd)	files_lookup_fd_rcu(current->files, fd)

Huh?
fs/file.c:1113: file = fcheck(oldfd);
	dup3(), under ->file_lock, no rcu_read_lock() in sight

fs/locks.c:2548:                f = fcheck(fd);
	fcntl_setlk(), ditto

fs/locks.c:2679:                f = fcheck(fd);
	fcntl_setlk64(), ditto

fs/notify/dnotify/dnotify.c:330:        f = fcheck(fd);
	fcntl_dirnotify(); this one _is_ under rcu_read_lock().


IOW, unless I've missed something earlier in the series, this is wrong.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 09/24] file: Replace fcheck_files with files_lookup_fd_rcu
  2020-12-07 22:46   ` Al Viro
@ 2020-12-07 22:49     ` Al Viro
  2020-12-09 16:54       ` Al Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-12-07 22:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Mon, Dec 07, 2020 at 10:46:57PM +0000, Al Viro wrote:
> On Fri, Nov 20, 2020 at 05:14:26PM -0600, Eric W. Biederman wrote:
> 
> >  /*
> >   * Check whether the specified fd has an open file.
> >   */
> > -#define fcheck(fd)	fcheck_files(current->files, fd)
> > +#define fcheck(fd)	files_lookup_fd_rcu(current->files, fd)
> 
> Huh?
> fs/file.c:1113: file = fcheck(oldfd);
> 	dup3(), under ->file_lock, no rcu_read_lock() in sight
> 
> fs/locks.c:2548:                f = fcheck(fd);
> 	fcntl_setlk(), ditto
> 
> fs/locks.c:2679:                f = fcheck(fd);
> 	fcntl_setlk64(), ditto
> 
> fs/notify/dnotify/dnotify.c:330:        f = fcheck(fd);
> 	fcntl_dirnotify(); this one _is_ under rcu_read_lock().
> 
> 
> IOW, unless I've missed something earlier in the series, this is wrong.

I have missed something, all right.  Ignore that comment...

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 15/24] proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
  2020-11-20 23:14 ` [PATCH v2 15/24] proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu Eric W. Biederman
@ 2020-12-07 23:29   ` Al Viro
       [not found]     ` <877dprvs8e.fsf@x220.int.ebiederm.org>
  0 siblings, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-12-07 23:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Andy Lavr

On Fri, Nov 20, 2020 at 05:14:32PM -0600, Eric W. Biederman wrote:
> When discussing[1] exec and posix file locks it was realized that none
> of the callers of get_files_struct fundamentally needed to call
> get_files_struct, and that by switching them to helper functions
> instead it will both simplify their code and remove unnecessary
> increments of files_struct.count.  Those unnecessary increments can
> result in exec unnecessarily unsharing files_struct which breaking
> posix locks, and it can result in fget_light having to fallback to
> fget reducing system performance.
> 
> Using task_lookup_next_fd_rcu simplifies proc_readfd_common, by moving
> the checking for the maximum file descritor into the generic code, and
> by remvoing the need for capturing and releasing a reference on
> files_struct.
> 
> As task_lookup_fd_rcu may update the fd ctx->pos has been changed
> to be the fd +2 after task_lookup_fd_rcu returns.


> +	for (fd = ctx->pos - 2;; fd++) {
>  		struct file *f;
>  		struct fd_data data;
>  		char name[10 + 1];
>  		unsigned int len;
>  
> -		f = files_lookup_fd_rcu(files, fd);
> +		f = task_lookup_next_fd_rcu(p, &fd);

Ugh...  That makes for a massive cacheline pingpong on task_lock -
instead of grabbing/dropping task_lock() once in the beginning, we do
that for every damn descriptor.

I really don't like this one.  If anything, I would rather have
a helper that would collect a bunch of pairs (fd,mode) into an
array and have lookups batched into it.  With the loop in that
sucker grabbing a reasonable amount into a local array, then
doing proc_fill_cache() for each collected.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 01/24] exec: Move unshare_files to fix posix file locking during exec
       [not found]     ` <87zh2pusdw.fsf@x220.int.ebiederm.org>
@ 2020-12-07 23:35       ` Al Viro
  0 siblings, 0 replies; 77+ messages in thread
From: Al Viro @ 2020-12-07 23:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Mon, Dec 07, 2020 at 05:07:55PM -0600, Eric W. Biederman wrote:

> My mistake.  I missed that the actual code was highly optimized and only
> safe in the presence of an unshared files struct.

That's a polite way to spell "overoptimized for no good reason" ;-)

> What I saw and what I thought the comment was talking about was that the
> result of the test isn't guaranteed to be stable without having an
> exclusive access to the files.  I figured and if something changes later
> and it becomes safe to pass the file name to a script later we don't
> really care.
> 
> In any event thank you for the review.  I will queue up a follow on
> patch that makes this use get_close_on_exec.

I would rather put that switch to get_close_on_exec() first in the queue
(or fold it into this patch)...

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 00/24] exec: Move unshare_files and guarantee files_struct.count is correct
  2020-11-28  5:12   ` Al Viro
@ 2020-12-07 23:56     ` Al Viro
  0 siblings, 0 replies; 77+ messages in thread
From: Al Viro @ 2020-12-07 23:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, linux-fsdevel,
	criu, bpf, Christian Brauner, Oleg Nesterov, Cyrill Gorcunov,
	Jann Horn, Kees Cook, Daniel P. Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Sat, Nov 28, 2020 at 05:12:26AM +0000, Al Viro wrote:
> On Fri, Nov 20, 2020 at 04:05:47PM -0800, Linus Torvalds wrote:
> > On Fri, Nov 20, 2020 at 3:11 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > >
> > > This set of changes cleanups of the code in exec so hopefully this code
> > > will not regress again.  Then it adds helpers and fixes the users of
> > > files_struct so the reference count is only incremented if COPY_FILES is
> > > passed to clone (or if io_uring takes a reference).  Then it removes
> > > helpers (get_files_struct, __install_fd, __alloc_fd, __close_fd) that
> > > are no longer needed and if used would encourage code that increments
> > > the count of files_struct somewhere besides in clone when COPY_FILES is
> > > passed.
> > 
> > I'm not seeing anything that triggered me going "that looks dodgy". It
> > all looks like nice cleanups.
> > 
> > But that's just from reading the patches (and in some cases going and
> > looking at the context), so I didn't actually _test_ any of it. It all
> > looks sane to me, though, and the fact that it removes a fair number
> > of lines of code is always a good sign.
> > 
> > It would be good for people to review and test (Al? Oleg? others?),
> > but my gut feel is "this is good".
> 
> Will check (sorry, the last couple of weeks had been bloody awful -
> off-net and very short on sleep); I'm digging through the piles of
> email right now.

	TBH, the thing that makes me uneasy about that series is handling
the task_lock().  Holding it for longer periods in general, plus boinking
it a _lot_ at least on /proc/*/fd getdents(2).  There might be other
places like that I'd missed - I'm rereading the series now that I've
noticed that one.

	Other than that I mostly like it.  I would rather reduce the
access to other processes descriptor tables (in particular, slapped
a very clear "don't use those helpers unless you really need it; we
will be watching for new call sites and you *will* have to explain
yourself if you add such" on them), but that's not a new problem -
fcheck_files() et.al. had to be watched anyway.

	One thing I would like to achieve is fewer places grabbing
struct file references from descriptor tables other than current->files,
but that's mostly orthogonal to this series.  I'll need to resurrect
a local branch trying to do that; there are conflicts (unsurprisingly),
but due to the amount of bitrot it has accumulated it'll have to be
redone anyway.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 15/24] proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
       [not found]     ` <877dprvs8e.fsf@x220.int.ebiederm.org>
@ 2020-12-09  4:07       ` Al Viro
       [not found]         ` <877dprtxly.fsf@x220.int.ebiederm.org>
  0 siblings, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-12-09  4:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Andy Lavr

On Tue, Dec 08, 2020 at 04:38:09PM -0600, Eric W. Biederman wrote:

> Is there any reason we don't simply rcu free the files_struct?
> That would remove the need for the task_lock entirely.

Umm...  Potentially interesting part here is the interaction with
close_files(); currently that can't overlap with any of those
3rd-party accesses to descriptor table, but with your changes
here it's very much possible.

OTOH, it's not like close_files() did much beyond the effects of already
possible close(2) done by one of the threads sharing that sucker.
It's _probably_ safe (at least for proc_readfd_common()), but I'll need
to look at the other places doing that kind of access.  Especially the
BPF foulness...

Oh, and in any case, the trick with repurposing ->rcu of embedded
fdtable deserves a comment.  It's not hard to explain, so...

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 15/24] proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
       [not found]         ` <877dprtxly.fsf@x220.int.ebiederm.org>
@ 2020-12-09 14:23           ` Al Viro
  2020-12-09 18:04             ` [PATCH] files: rcu free files_struct Eric W. Biederman
  0 siblings, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-12-09 14:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Andy Lavr

On Tue, Dec 08, 2020 at 10:24:57PM -0600, Eric W. Biederman wrote:
> Al Viro <viro@zeniv.linux.org.uk> writes:
> 
> > On Tue, Dec 08, 2020 at 04:38:09PM -0600, Eric W. Biederman wrote:
> >
> >> Is there any reason we don't simply rcu free the files_struct?
> >> That would remove the need for the task_lock entirely.
> >
> > Umm...  Potentially interesting part here is the interaction with
> > close_files(); currently that can't overlap with any of those
> > 3rd-party accesses to descriptor table, but with your changes
> > here it's very much possible.
> 
> Good point.
> 
> I was worried there might have been a concern about the overhead
> introduced by always rcu freeing files table.
> 
> > OTOH, it's not like close_files() did much beyond the effects of already
> > possible close(2) done by one of the threads sharing that sucker.
> > It's _probably_ safe (at least for proc_readfd_common()), but I'll need
> > to look at the other places doing that kind of access.  Especially the
> > BPF foulness...

Still digging, unfortunately ;-/

> > Oh, and in any case, the trick with repurposing ->rcu of embedded
> > fdtable deserves a comment.  It's not hard to explain, so...
> 
> Agreed.  Something like fdtable.rcu isn't used so use it so by reusing
> it we keep from wasting memory in files_struct to have a dedicated
> rcu_head.

I'd probably go for something along the lines of "we can avoid adding
a separate rcu_head into files_struct, since rcu_head in struct fdtable
is only used for separately allocated instances, allowing us to repurpose
files_struct->fdtab.rcu for RCU-delayed freeing of files_struct"...

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 09/24] file: Replace fcheck_files with files_lookup_fd_rcu
  2020-12-07 22:49     ` Al Viro
@ 2020-12-09 16:54       ` Al Viro
  2020-12-09 17:44         ` Paul E. McKenney
  0 siblings, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-12-09 16:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, linux-fsdevel, criu, bpf, Linus Torvalds,
	Christian Brauner, Oleg Nesterov, Cyrill Gorcunov, Jann Horn,
	Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh, Paul E. McKenney

[paulmck Cc'd]

On Mon, Dec 07, 2020 at 10:49:04PM +0000, Al Viro wrote:
> On Mon, Dec 07, 2020 at 10:46:57PM +0000, Al Viro wrote:
> > On Fri, Nov 20, 2020 at 05:14:26PM -0600, Eric W. Biederman wrote:
> > 
> > >  /*
> > >   * Check whether the specified fd has an open file.
> > >   */
> > > -#define fcheck(fd)	fcheck_files(current->files, fd)
> > > +#define fcheck(fd)	files_lookup_fd_rcu(current->files, fd)
> > 
> > Huh?
> > fs/file.c:1113: file = fcheck(oldfd);
> > 	dup3(), under ->file_lock, no rcu_read_lock() in sight
> > 
> > fs/locks.c:2548:                f = fcheck(fd);
> > 	fcntl_setlk(), ditto
> > 
> > fs/locks.c:2679:                f = fcheck(fd);
> > 	fcntl_setlk64(), ditto
> > 
> > fs/notify/dnotify/dnotify.c:330:        f = fcheck(fd);
> > 	fcntl_dirnotify(); this one _is_ under rcu_read_lock().
> > 
> > 
> > IOW, unless I've missed something earlier in the series, this is wrong.
> 
> I have missed something, all right.  Ignore that comment...

Actually, I take that back - the use of fcheck() in dnotify _is_ interesting.
At the very least it needs to be commented upon; what that code is trying
to prevent is a race between fcntl_dirnotify() and close(2)/dup2(2) closing
the descriptor in question.  The problem is, dnotify marks are removed
when we detach from descriptor table; that's done by filp_close() calling
dnotify_flush().

Suppose fcntl(fd, F_NOTIFY, 0) in one thread races with close(fd) in another
(both sharing the same descriptor table).  If the former had created and
inserted a mark *after* the latter has gotten past dnotify_flush(), there
would be nothing to evict that mark.

That's the reason that fcheck() is there.  rcu_read_lock() used to be
sufficient, but the locking has changed since then and even if it is
still enough, that's not at all obvious.

Exclusion is not an issue; barriers, OTOH...  Back then we had
->i_lock taken both by dnotify_flush() before any checks and
by fcntl_dirnotify() around the fcheck+insertion.  So on close
side we had
	store NULL into descriptor table
	acquire ->i_lock
	fetch ->i_dnotify
	...
	release ->i_lock
while on fcntl() side we had
	acquire ->i_lock
	fetch from descriptor table, sod off if not our file
	...
	store ->i_dnotify
	...
	release ->i_lock
Storing NULL into descriptor table could get reordered into
->i_lock-protected area in dnotify_flush(), but it could not
get reordered past releasing ->i_lock.  So fcntl_dirnotify()
either grabbed ->i_lock before dnotify_flush() (in which case
missing the store of NULL into descriptor table wouldn't
matter, since dnotify_flush() would've observed the mark
we'd inserted) or it would've seen that store to descriptor
table.

Nowadays it's nowhere near as straightforward; in fcntl_dirnotify()
we have
        /* this is needed to prevent the fcntl/close race described below */
        mutex_lock(&dnotify_group->mark_mutex);
and it would appear to be similar to the original situation, with
->mark_mutex serving in place of ->i_lock.  However, dnotify_flush()
might not take that mutex at all - it has
        fsn_mark = fsnotify_find_mark(&inode->i_fsnotify_marks, dnotify_group);
        if (!fsn_mark)
                return;
before grabbing that thing.  So the things are trickier - we actually
rely upon the barriers in fsnotify_find_mark().  And those are complicated.
The case when it returns non-NULL is not a problem - there we have that
mutex providing the barriers we need.  NULL can be returned in two cases:
	a) ->i_fsnotify_marks is not empty, but it contains no
dnotify marks.  In that case we have ->i_fsnotify_marks.lock acquired
and released.  By the time it gets around to fcheck(), fcntl_dirnotify() has
either found or created and inserted a dnotify mark into that list, with
->i_fsnotify_marks.lock acquired/released around the insertion, so we
are fine - either fcntl_dirnotify() gets there first (in which case
dnotify_flush() will observe it), or release of that lock by
fsnotify_find_mark() called by dnotify_flush() will serve as a barrier,
making sure that store to descriptor table will be observed.
	b) fsnotify_find_mark() (fsnotify_grab_connector() it calls,
actually) finds ->i_fsnotify_marks empty.  That's where the things
get interesting; we have
        idx = srcu_read_lock(&fsnotify_mark_srcu);
        conn = srcu_dereference(*connp, &fsnotify_mark_srcu);
        if (!conn)
                goto out;
on dnotify_flush() side.  The matching store of fcntl_dirnotify()
side would be in fsnotify_attach_connector_to_object(), where
we have
        /*
         * cmpxchg() provides the barrier so that readers of *connp can see
         * only initialized structure
         */
        if (cmpxchg(connp, NULL, conn)) {
                /* Someone else created list structure for us */

So we have
A:
	store NULL to descriptor table
	srcu_read_lock
	srcu_dereference fetches NULL from ->i_fsnotify_marks
vs.
B:
	cmpxchg replaces NULL with non-NULL in ->i_fsnotify_marks
	fetch from descriptor table, can't miss the store done by A

Which might be safe, but the whole thing *RELLY* needs to be discussed
in fcntl_dirnotify() in more details.  fs/notify/* guts are convoluted
enough to confuse anyone unfamiliar with them.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v2 09/24] file: Replace fcheck_files with files_lookup_fd_rcu
  2020-12-09 16:54       ` Al Viro
@ 2020-12-09 17:44         ` Paul E. McKenney
  0 siblings, 0 replies; 77+ messages in thread
From: Paul E. McKenney @ 2020-12-09 17:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, linux-kernel, linux-fsdevel, criu, bpf,
	Linus Torvalds, Christian Brauner, Oleg Nesterov,
	Cyrill Gorcunov, Jann Horn, Kees Cook, Daniel P . Berrangé,
	Jeff Layton, Miklos Szeredi, Matthew Wilcox, J. Bruce Fields,
	Trond Myklebust, Chris Wright, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, John Fastabend, KP Singh

On Wed, Dec 09, 2020 at 04:54:11PM +0000, Al Viro wrote:
> [paulmck Cc'd]
> 
> On Mon, Dec 07, 2020 at 10:49:04PM +0000, Al Viro wrote:
> > On Mon, Dec 07, 2020 at 10:46:57PM +0000, Al Viro wrote:
> > > On Fri, Nov 20, 2020 at 05:14:26PM -0600, Eric W. Biederman wrote:
> > > 
> > > >  /*
> > > >   * Check whether the specified fd has an open file.
> > > >   */
> > > > -#define fcheck(fd)	fcheck_files(current->files, fd)
> > > > +#define fcheck(fd)	files_lookup_fd_rcu(current->files, fd)
> > > 
> > > Huh?
> > > fs/file.c:1113: file = fcheck(oldfd);
> > > 	dup3(), under ->file_lock, no rcu_read_lock() in sight
> > > 
> > > fs/locks.c:2548:                f = fcheck(fd);
> > > 	fcntl_setlk(), ditto
> > > 
> > > fs/locks.c:2679:                f = fcheck(fd);
> > > 	fcntl_setlk64(), ditto
> > > 
> > > fs/notify/dnotify/dnotify.c:330:        f = fcheck(fd);
> > > 	fcntl_dirnotify(); this one _is_ under rcu_read_lock().
> > > 
> > > 
> > > IOW, unless I've missed something earlier in the series, this is wrong.
> > 
> > I have missed something, all right.  Ignore that comment...
> 
> Actually, I take that back - the use of fcheck() in dnotify _is_ interesting.
> At the very least it needs to be commented upon; what that code is trying
> to prevent is a race between fcntl_dirnotify() and close(2)/dup2(2) closing
> the descriptor in question.  The problem is, dnotify marks are removed
> when we detach from descriptor table; that's done by filp_close() calling
> dnotify_flush().
> 
> Suppose fcntl(fd, F_NOTIFY, 0) in one thread races with close(fd) in another
> (both sharing the same descriptor table).  If the former had created and
> inserted a mark *after* the latter has gotten past dnotify_flush(), there
> would be nothing to evict that mark.
> 
> That's the reason that fcheck() is there.  rcu_read_lock() used to be
> sufficient, but the locking has changed since then and even if it is
> still enough, that's not at all obvious.
> 
> Exclusion is not an issue; barriers, OTOH...  Back then we had
> ->i_lock taken both by dnotify_flush() before any checks and
> by fcntl_dirnotify() around the fcheck+insertion.  So on close
> side we had
> 	store NULL into descriptor table
> 	acquire ->i_lock
> 	fetch ->i_dnotify
> 	...
> 	release ->i_lock
> while on fcntl() side we had
> 	acquire ->i_lock
> 	fetch from descriptor table, sod off if not our file
> 	...
> 	store ->i_dnotify
> 	...
> 	release ->i_lock
> Storing NULL into descriptor table could get reordered into
> ->i_lock-protected area in dnotify_flush(), but it could not
> get reordered past releasing ->i_lock.  So fcntl_dirnotify()
> either grabbed ->i_lock before dnotify_flush() (in which case
> missing the store of NULL into descriptor table wouldn't
> matter, since dnotify_flush() would've observed the mark
> we'd inserted) or it would've seen that store to descriptor
> table.
> 
> Nowadays it's nowhere near as straightforward; in fcntl_dirnotify()
> we have
>         /* this is needed to prevent the fcntl/close race described below */
>         mutex_lock(&dnotify_group->mark_mutex);
> and it would appear to be similar to the original situation, with
> ->mark_mutex serving in place of ->i_lock.  However, dnotify_flush()
> might not take that mutex at all - it has
>         fsn_mark = fsnotify_find_mark(&inode->i_fsnotify_marks, dnotify_group);
>         if (!fsn_mark)
>                 return;
> before grabbing that thing.  So the things are trickier - we actually
> rely upon the barriers in fsnotify_find_mark().  And those are complicated.
> The case when it returns non-NULL is not a problem - there we have that
> mutex providing the barriers we need.  NULL can be returned in two cases:
> 	a) ->i_fsnotify_marks is not empty, but it contains no
> dnotify marks.  In that case we have ->i_fsnotify_marks.lock acquired
> and released.  By the time it gets around to fcheck(), fcntl_dirnotify() has
> either found or created and inserted a dnotify mark into that list, with
> ->i_fsnotify_marks.lock acquired/released around the insertion, so we
> are fine - either fcntl_dirnotify() gets there first (in which case
> dnotify_flush() will observe it), or release of that lock by
> fsnotify_find_mark() called by dnotify_flush() will serve as a barrier,
> making sure that store to descriptor table will be observed.
> 	b) fsnotify_find_mark() (fsnotify_grab_connector() it calls,
> actually) finds ->i_fsnotify_marks empty.  That's where the things
> get interesting; we have
>         idx = srcu_read_lock(&fsnotify_mark_srcu);
>         conn = srcu_dereference(*connp, &fsnotify_mark_srcu);
>         if (!conn)
>                 goto out;
> on dnotify_flush() side.  The matching store of fcntl_dirnotify()
> side would be in fsnotify_attach_connector_to_object(), where
> we have
>         /*
>          * cmpxchg() provides the barrier so that readers of *connp can see
>          * only initialized structure
>          */
>         if (cmpxchg(connp, NULL, conn)) {
>                 /* Someone else created list structure for us */
> 
> So we have
> A:
> 	store NULL to descriptor table
> 	srcu_read_lock
> 	srcu_dereference fetches NULL from ->i_fsnotify_marks
> vs.
> B:
> 	cmpxchg replaces NULL with non-NULL in ->i_fsnotify_marks
> 	fetch from descriptor table, can't miss the store done by A
> 
> Which might be safe, but the whole thing *RELLY* needs to be discussed
> in fcntl_dirnotify() in more details.  fs/notify/* guts are convoluted
> enough to confuse anyone unfamiliar with them.

This ordering relies on the full barrier in srcu_read_lock().  There was
a time when srcu_read_lock() did not have a full barrier, and there
might well be a time in the future when srcu_read_lock() no longer has a
full barrier.  So please add smp_mb__after_srcu_read_unlock() immediately
after the srcu_read_lock() if you are relying on that full barrier.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH] files: rcu free files_struct
  2020-12-09 14:23           ` Al Viro
@ 2020-12-09 18:04             ` Eric W. Biederman
  2020-12-09 19:13               ` Linus Torvalds
  2020-12-09 19:49               ` Matthew Wilcox
  0 siblings, 2 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-12-09 18:04 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-kernel, linux-fsdevel, Linus Torvalds, Christian Brauner,
	Oleg Nesterov, Jann Horn


This makes it safe to deference task->files with just rcu_read_lock
held, removing the need to take task_lock.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

I have cleaned up my patch a little more and made this a little
more explicitly rcu.  I took a completley different approach to
documenting the use of rcu_head than we described that seems a little
more robust.

 fs/file.c               | 53 ++++++++++++++++++++++++++---------------
 include/linux/fdtable.h |  1 +
 kernel/fork.c           |  4 ++--
 3 files changed, 37 insertions(+), 21 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 412033d8cfdf..88064f185560 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -379,7 +379,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 	return NULL;
 }
 
-static struct fdtable *close_files(struct files_struct * files)
+static void close_files(struct files_struct * files)
 {
 	/*
 	 * It is safe to dereference the fd table without RCU or
@@ -397,8 +397,9 @@ static struct fdtable *close_files(struct files_struct * files)
 		set = fdt->open_fds[j++];
 		while (set) {
 			if (set & 1) {
-				struct file * file = xchg(&fdt->fd[i], NULL);
+				struct file * file = fdt->fd[i];
 				if (file) {
+					rcu_assign_pointer(fdt->fd[i], NULL);
 					filp_close(file, files);
 					cond_resched();
 				}
@@ -407,19 +408,35 @@ static struct fdtable *close_files(struct files_struct * files)
 			set >>= 1;
 		}
 	}
+}
 
-	return fdt;
+static void free_files_struct_rcu(struct rcu_head *rcu)
+{
+	struct files_struct *files =
+		container_of(rcu, struct files_struct, fdtab.rcu);
+	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+
+	/* free the arrays if they are not embedded */
+	if (fdt != &files->fdtab)
+		__free_fdtable(fdt);
+	kmem_cache_free(files_cachep, files);
 }
 
 void put_files_struct(struct files_struct *files)
 {
 	if (atomic_dec_and_test(&files->count)) {
-		struct fdtable *fdt = close_files(files);
-
-		/* free the arrays if they are not embedded */
-		if (fdt != &files->fdtab)
-			__free_fdtable(fdt);
-		kmem_cache_free(files_cachep, files);
+		close_files(files);
+		/*
+		 * The rules for using the rcu_head in fdtable:
+		 *
+		 * If the fdtable is being freed independently
+		 * of the files_struct fdtable->rcu is used.
+		 *
+		 * When the fdtable and files_struct are freed
+		 * together the rcu_head from the fdtable embedded in
+		 * files_struct is used.
+		 */
+		call_rcu(&files->fdtab.rcu, free_files_struct_rcu);
 	}
 }
 
@@ -822,12 +839,14 @@ EXPORT_SYMBOL(fget_raw);
 
 struct file *fget_task(struct task_struct *task, unsigned int fd)
 {
+	struct files_struct *files;
 	struct file *file = NULL;
 
-	task_lock(task);
-	if (task->files)
-		file = __fget_files(task->files, fd, 0, 1);
-	task_unlock(task);
+	rcu_read_lock();
+	files = rcu_dereference(task->files);
+	if (files)
+		file = __fget_files(files, fd, 0, 1);
+	rcu_read_unlock();
 
 	return file;
 }
@@ -838,11 +857,9 @@ struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
 	struct files_struct *files;
 	struct file *file = NULL;
 
-	task_lock(task);
-	files = task->files;
+	files = rcu_dereference(task->files);
 	if (files)
 		file = files_lookup_fd_rcu(files, fd);
-	task_unlock(task);
 
 	return file;
 }
@@ -854,8 +871,7 @@ struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret
 	unsigned int fd = *ret_fd;
 	struct file *file = NULL;
 
-	task_lock(task);
-	files = task->files;
+	files = rcu_dereference(task->files);
 	if (files) {
 		for (; fd < files_fdtable(files)->max_fds; fd++) {
 			file = files_lookup_fd_rcu(files, fd);
@@ -863,7 +879,6 @@ struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret
 				break;
 		}
 	}
-	task_unlock(task);
 	*ret_fd = fd;
 	return file;
 }
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index d0e78174874a..37dcface020f 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -30,6 +30,7 @@ struct fdtable {
 	unsigned long *close_on_exec;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
+	/* Used for freeing fdtable and any containing files_struct */
 	struct rcu_head rcu;
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 4d0ae6f827df..eca474a1586a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2982,7 +2982,7 @@ int ksys_unshare(unsigned long unshare_flags)
 
 		if (new_fd) {
 			fd = current->files;
-			current->files = new_fd;
+			rcu_assign_pointer(current->files, new_fd);
 			new_fd = fd;
 		}
 
@@ -3035,7 +3035,7 @@ int unshare_files(void)
 
 	old = task->files;
 	task_lock(task);
-	task->files = copy;
+	rcu_assign_pointer(task->files, copy);
 	task_unlock(task);
 	put_files_struct(old);
 	return 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 18:04             ` [PATCH] files: rcu free files_struct Eric W. Biederman
@ 2020-12-09 19:13               ` Linus Torvalds
  2020-12-09 19:50                 ` Al Viro
  2020-12-09 22:04                 ` [PATCH] files: rcu free files_struct Eric W. Biederman
  2020-12-09 19:49               ` Matthew Wilcox
  1 sibling, 2 replies; 77+ messages in thread
From: Linus Torvalds @ 2020-12-09 19:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel,
	Christian Brauner, Oleg Nesterov, Jann Horn

On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> -                               struct file * file = xchg(&fdt->fd[i], NULL);
> +                               struct file * file = fdt->fd[i];
>                                 if (file) {
> +                                       rcu_assign_pointer(fdt->fd[i], NULL);

This makes me nervous. Why did we use to do that xchg() there? That
has atomicity guarantees that now are gone.

Now, this whole thing should be called for just the last ref of the fd
table, so presumably that atomicity was never needed in the first
place. But the fact that we did that very expensive xchg() then makes
me go "there's some reason for it".

Is this xchg() just bogus historical leftover? It kind of looks that
way. But maybe that change should be done separately?

              Linus

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 18:04             ` [PATCH] files: rcu free files_struct Eric W. Biederman
  2020-12-09 19:13               ` Linus Torvalds
@ 2020-12-09 19:49               ` Matthew Wilcox
  2020-12-09 22:06                 ` Eric W. Biederman
  2020-12-09 22:58                 ` Al Viro
  1 sibling, 2 replies; 77+ messages in thread
From: Matthew Wilcox @ 2020-12-09 19:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, linux-kernel, linux-fsdevel, Linus Torvalds,
	Christian Brauner, Oleg Nesterov, Jann Horn

On Wed, Dec 09, 2020 at 12:04:38PM -0600, Eric W. Biederman wrote:
> @@ -397,8 +397,9 @@ static struct fdtable *close_files(struct files_struct * files)
>  		set = fdt->open_fds[j++];
>  		while (set) {
>  			if (set & 1) {
> -				struct file * file = xchg(&fdt->fd[i], NULL);
> +				struct file * file = fdt->fd[i];
>  				if (file) {
> +					rcu_assign_pointer(fdt->fd[i], NULL);

Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
storing NULL, so you don't need the wmb() before storing the pointer.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 19:13               ` Linus Torvalds
@ 2020-12-09 19:50                 ` Al Viro
  2020-12-09 21:32                   ` Eric W. Biederman
  2020-12-09 21:42                   ` [PATCH -1/24] exec: Don't open code get_close_on_exec Eric W. Biederman
  2020-12-09 22:04                 ` [PATCH] files: rcu free files_struct Eric W. Biederman
  1 sibling, 2 replies; 77+ messages in thread
From: Al Viro @ 2020-12-09 19:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Linux Kernel Mailing List, linux-fsdevel,
	Christian Brauner, Oleg Nesterov, Jann Horn

On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote:
> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > -                               struct file * file = xchg(&fdt->fd[i], NULL);
> > +                               struct file * file = fdt->fd[i];
> >                                 if (file) {
> > +                                       rcu_assign_pointer(fdt->fd[i], NULL);
> 
> This makes me nervous. Why did we use to do that xchg() there? That
> has atomicity guarantees that now are gone.
> 
> Now, this whole thing should be called for just the last ref of the fd
> table, so presumably that atomicity was never needed in the first
> place. But the fact that we did that very expensive xchg() then makes
> me go "there's some reason for it".
> 
> Is this xchg() just bogus historical leftover? It kind of looks that
> way. But maybe that change should be done separately?

I'm still not convinced that exposing close_files() to parallel
3rd-party accesses is safe in all cases, so this patch still needs
more analysis.  And I'm none too happy about "we'll fix the things
up at the tail of the series" - the changes are subtle enough and
the area affected is rather fundamental.  So if we end up returning
to that several years from now while debugging something, I would
very much prefer to have the transformation series as clean and
understandable as possible.  It's not just about bisect hazard -
asking yourself "WTF had it been done that way, is there anything
subtle I'm missing here?" can cost many hours of head-scratching,
IME.

Eric, I understand that you want to avoid reordering/folding, but
in this case it _is_ needed.  It's not as if there had been any
serious objections to the overall direction of changes; it's
just that we need to get that as understandable as possible.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 19:50                 ` Al Viro
@ 2020-12-09 21:32                   ` Eric W. Biederman
  2020-12-10  6:13                     ` Al Viro
  2020-12-09 21:42                   ` [PATCH -1/24] exec: Don't open code get_close_on_exec Eric W. Biederman
  1 sibling, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2020-12-09 21:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Christian Brauner, Oleg Nesterov, Jann Horn

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote:
>> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> >
>> > -                               struct file * file = xchg(&fdt->fd[i], NULL);
>> > +                               struct file * file = fdt->fd[i];
>> >                                 if (file) {
>> > +                                       rcu_assign_pointer(fdt->fd[i], NULL);
>> 
>> This makes me nervous. Why did we use to do that xchg() there? That
>> has atomicity guarantees that now are gone.
>> 
>> Now, this whole thing should be called for just the last ref of the fd
>> table, so presumably that atomicity was never needed in the first
>> place. But the fact that we did that very expensive xchg() then makes
>> me go "there's some reason for it".
>> 
>> Is this xchg() just bogus historical leftover? It kind of looks that
>> way. But maybe that change should be done separately?
>
> I'm still not convinced that exposing close_files() to parallel
> 3rd-party accesses is safe in all cases, so this patch still needs
> more analysis.

That is fine.  I just wanted to post the latest version so we could
continue the discussion.  Especially with comments etc.

> And I'm none too happy about "we'll fix the things
> up at the tail of the series" - the changes are subtle enough and
> the area affected is rather fundamental.  So if we end up returning
> to that several years from now while debugging something, I would
> very much prefer to have the transformation series as clean and
> understandable as possible.  It's not just about bisect hazard -
> asking yourself "WTF had it been done that way, is there anything
> subtle I'm missing here?" can cost many hours of head-scratching,
> IME.

Fair enough.  I don't expect anyone is basing anything on that branch,
so a rebase is possible.

Now removing the pounding on task_lock isn't about correctness, and it
is not fixing a performance problem anyone has measured at this point.
So I do think it should be a follow on.  If for no other reason than to
keep the problem small enough it can fit in heads.

Similarly the dnotify stuff.  Your description certain makes it look
fishy but that the questionable parts are orthogonal to my patches.

> Eric, I understand that you want to avoid reordering/folding, but
> in this case it _is_ needed.  It's not as if there had been any
> serious objections to the overall direction of changes; it's
> just that we need to get that as understandable as possible.

I will post the patch that will become -1/24 in a moment.

Eric




^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH -1/24] exec: Don't open code get_close_on_exec
  2020-12-09 19:50                 ` Al Viro
  2020-12-09 21:32                   ` Eric W. Biederman
@ 2020-12-09 21:42                   ` Eric W. Biederman
  1 sibling, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-12-09 21:42 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Christian Brauner, Oleg Nesterov, Jann Horn


Al Viro pointed out that using the phrase "close_on_exec(fd,
rcu_dereference_raw(current->files->fdt))" instead of wrapping it in
rcu_read_lock(), rcu_read_unlock() is a very questionable
optimization[1].

Once wrapped with rcu_read_lock()/rcu_read_unlock() that phrase
becomes equivalent the helper function get_close_on_exec so
simplify the code and make it more robust by simply using
get_close_on_exec.

[1] https://lkml.kernel.org/r/20201207222214.GA4115853@ZenIV.linux.org.uk
Suggested-by: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

I aim to rebase my exec branch and apply this tomorrow.

 fs/exec.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 547a2390baf5..9aabf6e8c904 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1808,8 +1808,7 @@ static int bprm_execve(struct linux_binprm *bprm,
 	 * inaccessible after exec. Relies on having exclusive access to
 	 * current->files (due to unshare_files above).
 	 */
-	if (bprm->fdpath &&
-	    close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
+	if (bprm->fdpath && get_close_on_exec(fd))
 		bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
 
 	/* Set the unchanging part of bprm->cred */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 19:13               ` Linus Torvalds
  2020-12-09 19:50                 ` Al Viro
@ 2020-12-09 22:04                 ` Eric W. Biederman
  1 sibling, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-12-09 22:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Linux Kernel Mailing List, linux-fsdevel,
	Christian Brauner, Oleg Nesterov, Jann Horn

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> -                               struct file * file = xchg(&fdt->fd[i], NULL);
>> +                               struct file * file = fdt->fd[i];
>>                                 if (file) {
>> +                                       rcu_assign_pointer(fdt->fd[i], NULL);
>
> This makes me nervous. Why did we use to do that xchg() there? That
> has atomicity guarantees that now are gone.
>
> Now, this whole thing should be called for just the last ref of the fd
> table, so presumably that atomicity was never needed in the first
> place. But the fact that we did that very expensive xchg() then makes
> me go "there's some reason for it".
>
> Is this xchg() just bogus historical leftover? It kind of looks that
> way. But maybe that change should be done separately?

Removing the xchg in a separate patch seems reasonable.  Just to make
the review easier.

I traced the xchg back to 7cf4dc3c8dbf ("move files_struct-related bits
from kernel/exit.c to fs/file.c") when put_files_struct was introduced.
The xchg did not exist before that change.

There were many other xchgs in the code back then so I suspect was left
over from some way an earlier version of the change worked and simply
was not removed when the patch was updated.

Eric

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 19:49               ` Matthew Wilcox
@ 2020-12-09 22:06                 ` Eric W. Biederman
  2020-12-09 22:58                 ` Al Viro
  1 sibling, 0 replies; 77+ messages in thread
From: Eric W. Biederman @ 2020-12-09 22:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Al Viro, linux-kernel, linux-fsdevel, Linus Torvalds,
	Christian Brauner, Oleg Nesterov, Jann Horn

Matthew Wilcox <willy@infradead.org> writes:

> On Wed, Dec 09, 2020 at 12:04:38PM -0600, Eric W. Biederman wrote:
>> @@ -397,8 +397,9 @@ static struct fdtable *close_files(struct files_struct * files)
>>  		set = fdt->open_fds[j++];
>>  		while (set) {
>>  			if (set & 1) {
>> -				struct file * file = xchg(&fdt->fd[i], NULL);
>> +				struct file * file = fdt->fd[i];
>>  				if (file) {
>> +					rcu_assign_pointer(fdt->fd[i], NULL);
>
> Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
> storing NULL, so you don't need the wmb() before storing the pointer.

Thanks.  I was remembering there was a special case like and I had
forgotten what the rule was.

Eric


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 19:49               ` Matthew Wilcox
  2020-12-09 22:06                 ` Eric W. Biederman
@ 2020-12-09 22:58                 ` Al Viro
  2020-12-09 23:01                   ` Linus Torvalds
  1 sibling, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-12-09 22:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Eric W. Biederman, linux-kernel, linux-fsdevel, Linus Torvalds,
	Christian Brauner, Oleg Nesterov, Jann Horn

On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote:
> On Wed, Dec 09, 2020 at 12:04:38PM -0600, Eric W. Biederman wrote:
> > @@ -397,8 +397,9 @@ static struct fdtable *close_files(struct files_struct * files)
> >  		set = fdt->open_fds[j++];
> >  		while (set) {
> >  			if (set & 1) {
> > -				struct file * file = xchg(&fdt->fd[i], NULL);
> > +				struct file * file = fdt->fd[i];
> >  				if (file) {
> > +					rcu_assign_pointer(fdt->fd[i], NULL);
> 
> Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
> storing NULL, so you don't need the wmb() before storing the pointer.

fs/file.c:pick_file() would make more interesting target for the same treatment...

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 22:58                 ` Al Viro
@ 2020-12-09 23:01                   ` Linus Torvalds
  2020-12-09 23:04                     ` Linus Torvalds
  2020-12-09 23:07                     ` Matthew Wilcox
  0 siblings, 2 replies; 77+ messages in thread
From: Linus Torvalds @ 2020-12-09 23:01 UTC (permalink / raw)
  To: Al Viro
  Cc: Matthew Wilcox, Eric W. Biederman, Linux Kernel Mailing List,
	linux-fsdevel, Christian Brauner, Oleg Nesterov, Jann Horn

On Wed, Dec 9, 2020 at 2:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote:
> >
> > Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
> > storing NULL, so you don't need the wmb() before storing the pointer.
>
> fs/file.c:pick_file() would make more interesting target for the same treatment...

Actually, don't.

rcu_assign_pointer() itself already does the optimization for the case
of a constant NULL pointer assignment.

So there's no need to manually change things to RCU_INIT_POINTER().

           Linus

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 23:01                   ` Linus Torvalds
@ 2020-12-09 23:04                     ` Linus Torvalds
  2020-12-09 23:07                     ` Matthew Wilcox
  1 sibling, 0 replies; 77+ messages in thread
From: Linus Torvalds @ 2020-12-09 23:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Matthew Wilcox, Eric W. Biederman, Linux Kernel Mailing List,
	linux-fsdevel, Christian Brauner, Oleg Nesterov, Jann Horn

On Wed, Dec 9, 2020 at 3:01 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> rcu_assign_pointer() itself already does the optimization for the case
> of a constant NULL pointer assignment.
>
> So there's no need to manually change things to RCU_INIT_POINTER().

Side note: what should be done instead is to delete the stale comment.
It should have been removed back when the optimization was done in
2016 - see commit 3a37f7275cda ("rcu: No ordering for
rcu_assign_pointer() of NULL")

            Linus

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 23:01                   ` Linus Torvalds
  2020-12-09 23:04                     ` Linus Torvalds
@ 2020-12-09 23:07                     ` Matthew Wilcox
  2020-12-09 23:26                       ` Paul E. McKenney
  2020-12-09 23:29                       ` Linus Torvalds
  1 sibling, 2 replies; 77+ messages in thread
From: Matthew Wilcox @ 2020-12-09 23:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Eric W. Biederman, Linux Kernel Mailing List,
	linux-fsdevel, Christian Brauner, Oleg Nesterov, Jann Horn,
	Paul E. McKenney

On Wed, Dec 09, 2020 at 03:01:36PM -0800, Linus Torvalds wrote:
> On Wed, Dec 9, 2020 at 2:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote:
> > >
> > > Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
> > > storing NULL, so you don't need the wmb() before storing the pointer.
> >
> > fs/file.c:pick_file() would make more interesting target for the same treatment...
> 
> Actually, don't.
> 
> rcu_assign_pointer() itself already does the optimization for the case
> of a constant NULL pointer assignment.
> 
> So there's no need to manually change things to RCU_INIT_POINTER().

I missed that, and the documentation wasn't updated by
3a37f7275cda5ad25c1fe9be8f20c76c60d175fa.

Paul, how about this?

+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -1668,8 +1668,10 @@ against mishaps and misuse:
    this purpose.
 #. It is not necessary to use rcu_assign_pointer() when creating
    linked structures that are to be published via a single external
-   pointer. The RCU_INIT_POINTER() macro is provided for this task
-   and also for assigning ``NULL`` pointers at runtime.
+   pointer. The RCU_INIT_POINTER() macro is provided for this task.
+   It used to be more efficient to use RCU_INIT_POINTER() to store a
+   ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant
+   ``NULL`` pointer itself.
 
 This not a hard-and-fast list: RCU's diagnostic capabilities will
 continue to be guided by the number and type of usage bugs found in


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 23:07                     ` Matthew Wilcox
@ 2020-12-09 23:26                       ` Paul E. McKenney
  2020-12-09 23:29                       ` Linus Torvalds
  1 sibling, 0 replies; 77+ messages in thread
From: Paul E. McKenney @ 2020-12-09 23:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Al Viro, Eric W. Biederman,
	Linux Kernel Mailing List, linux-fsdevel, Christian Brauner,
	Oleg Nesterov, Jann Horn

On Wed, Dec 09, 2020 at 11:07:55PM +0000, Matthew Wilcox wrote:
> On Wed, Dec 09, 2020 at 03:01:36PM -0800, Linus Torvalds wrote:
> > On Wed, Dec 9, 2020 at 2:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote:
> > > >
> > > > Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
> > > > storing NULL, so you don't need the wmb() before storing the pointer.
> > >
> > > fs/file.c:pick_file() would make more interesting target for the same treatment...
> > 
> > Actually, don't.
> > 
> > rcu_assign_pointer() itself already does the optimization for the case
> > of a constant NULL pointer assignment.
> > 
> > So there's no need to manually change things to RCU_INIT_POINTER().
> 
> I missed that, and the documentation wasn't updated by
> 3a37f7275cda5ad25c1fe9be8f20c76c60d175fa.

Can't trust the author of that patch!  ;-)

> Paul, how about this?
> 
> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
> @@ -1668,8 +1668,10 @@ against mishaps and misuse:
>     this purpose.
>  #. It is not necessary to use rcu_assign_pointer() when creating
>     linked structures that are to be published via a single external
> -   pointer. The RCU_INIT_POINTER() macro is provided for this task
> -   and also for assigning ``NULL`` pointers at runtime.
> +   pointer. The RCU_INIT_POINTER() macro is provided for this task.
> +   It used to be more efficient to use RCU_INIT_POINTER() to store a
> +   ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant
> +   ``NULL`` pointer itself.
>  
>  This not a hard-and-fast list: RCU's diagnostic capabilities will
>  continue to be guided by the number and type of usage bugs found in

Looks good to me!  If you send a complete patch, I will be happy to pull
it in.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 23:07                     ` Matthew Wilcox
  2020-12-09 23:26                       ` Paul E. McKenney
@ 2020-12-09 23:29                       ` Linus Torvalds
  2020-12-10  0:39                         ` Paul E. McKenney
  1 sibling, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2020-12-09 23:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Al Viro, Eric W. Biederman, Linux Kernel Mailing List,
	linux-fsdevel, Christian Brauner, Oleg Nesterov, Jann Horn,
	Paul E. McKenney

On Wed, Dec 9, 2020 at 3:07 PM Matthew Wilcox <willy@infradead.org> wrote:
>
>  #. It is not necessary to use rcu_assign_pointer() when creating
>     linked structures that are to be published via a single external
> -   pointer. The RCU_INIT_POINTER() macro is provided for this task
> -   and also for assigning ``NULL`` pointers at runtime.
> +   pointer. The RCU_INIT_POINTER() macro is provided for this task.
> +   It used to be more efficient to use RCU_INIT_POINTER() to store a
> +   ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant
> +   ``NULL`` pointer itself.

I would just remove the historical note about "It used to be more
efficient ..". It's not really helpful.

If somebody wants to dig into the history, it's there in git.

Maybe the note could be part of the commit message for the comment
update, explaining that it used to be more efficient but no longer is.
Because commit messages are about the explanation for the change.

But I don't feel it makes any sense in the source code.

             Linus

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 23:29                       ` Linus Torvalds
@ 2020-12-10  0:39                         ` Paul E. McKenney
  2020-12-10  0:41                           ` Linus Torvalds
  0 siblings, 1 reply; 77+ messages in thread
From: Paul E. McKenney @ 2020-12-10  0:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Al Viro, Eric W. Biederman,
	Linux Kernel Mailing List, linux-fsdevel, Christian Brauner,
	Oleg Nesterov, Jann Horn

On Wed, Dec 09, 2020 at 03:29:09PM -0800, Linus Torvalds wrote:
> On Wed, Dec 9, 2020 at 3:07 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> >  #. It is not necessary to use rcu_assign_pointer() when creating
> >     linked structures that are to be published via a single external
> > -   pointer. The RCU_INIT_POINTER() macro is provided for this task
> > -   and also for assigning ``NULL`` pointers at runtime.
> > +   pointer. The RCU_INIT_POINTER() macro is provided for this task.
> > +   It used to be more efficient to use RCU_INIT_POINTER() to store a
> > +   ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant
> > +   ``NULL`` pointer itself.
> 
> I would just remove the historical note about "It used to be more
> efficient ..". It's not really helpful.
> 
> If somebody wants to dig into the history, it's there in git.
> 
> Maybe the note could be part of the commit message for the comment
> update, explaining that it used to be more efficient but no longer is.
> Because commit messages are about the explanation for the change.
> 
> But I don't feel it makes any sense in the source code.

Like this, then?

							Thanx, Paul

 #. It is not necessary to use rcu_assign_pointer() when creating
    linked structures that are to be published via a single external
-   pointer. The RCU_INIT_POINTER() macro is provided for this task
-   and also for assigning ``NULL`` pointers at runtime.
+   pointer. The RCU_INIT_POINTER() macro is provided for this task.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-10  0:39                         ` Paul E. McKenney
@ 2020-12-10  0:41                           ` Linus Torvalds
  2020-12-10  0:57                             ` Paul E. McKenney
  0 siblings, 1 reply; 77+ messages in thread
From: Linus Torvalds @ 2020-12-10  0:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Matthew Wilcox, Al Viro, Eric W. Biederman,
	Linux Kernel Mailing List, linux-fsdevel, Christian Brauner,
	Oleg Nesterov, Jann Horn

On Wed, Dec 9, 2020 at 4:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> Like this, then?

Ack.

           Linus

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-10  0:41                           ` Linus Torvalds
@ 2020-12-10  0:57                             ` Paul E. McKenney
  0 siblings, 0 replies; 77+ messages in thread
From: Paul E. McKenney @ 2020-12-10  0:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Al Viro, Eric W. Biederman,
	Linux Kernel Mailing List, linux-fsdevel, Christian Brauner,
	Oleg Nesterov, Jann Horn

On Wed, Dec 09, 2020 at 04:41:06PM -0800, Linus Torvalds wrote:
> On Wed, Dec 9, 2020 at 4:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > Like this, then?
> 
> Ack.

Queued with Matthew's Reported-by and your Acked-by, thank you all!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-09 21:32                   ` Eric W. Biederman
@ 2020-12-10  6:13                     ` Al Viro
  2020-12-10 19:29                       ` Eric W. Biederman
  0 siblings, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-12-10  6:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Christian Brauner, Oleg Nesterov, Jann Horn

On Wed, Dec 09, 2020 at 03:32:38PM -0600, Eric W. Biederman wrote:
> Al Viro <viro@zeniv.linux.org.uk> writes:
> 
> > On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote:
> >> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> >
> >> > -                               struct file * file = xchg(&fdt->fd[i], NULL);
> >> > +                               struct file * file = fdt->fd[i];
> >> >                                 if (file) {
> >> > +                                       rcu_assign_pointer(fdt->fd[i], NULL);
> >> 
> >> This makes me nervous. Why did we use to do that xchg() there? That
> >> has atomicity guarantees that now are gone.
> >> 
> >> Now, this whole thing should be called for just the last ref of the fd
> >> table, so presumably that atomicity was never needed in the first
> >> place. But the fact that we did that very expensive xchg() then makes
> >> me go "there's some reason for it".
> >> 
> >> Is this xchg() just bogus historical leftover? It kind of looks that
> >> way. But maybe that change should be done separately?
> >
> > I'm still not convinced that exposing close_files() to parallel
> > 3rd-party accesses is safe in all cases, so this patch still needs
> > more analysis.
> 
> That is fine.  I just wanted to post the latest version so we could
> continue the discussion.  Especially with comments etc.

It's probably safe.  I've spent today digging through the mess in
fs/notify and kernel/bpf, and while I'm disgusted with both, at
that point I believe that close_files() exposure is not going to
create problems with either.  And xchg() in there _is_ useless.

Said that, BPF "file iterator" stuff is potentially very unpleasant -
it allows to pin a struct file found in any process' descriptor table
indefinitely long.  Temporary struct file references grabbed by procfs
code, while unfortunate, are at least short-lived; with this stuff sky's
the limit.

I'm not happy about having that available, especially if it's a user-visible
primitive we can't withdraw at zero notice ;-/

What are the users of that thing and is there any chance to replace it
with something saner?  IOW, what *is* realistically called for each
struct file by the users of that iterator?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-10  6:13                     ` Al Viro
@ 2020-12-10 19:29                       ` Eric W. Biederman
  2020-12-10 21:36                         ` Al Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Eric W. Biederman @ 2020-12-10 19:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Christian Brauner, Oleg Nesterov, Jann Horn

Al Viro <viro@zeniv.linux.org.uk> writes:

> On Wed, Dec 09, 2020 at 03:32:38PM -0600, Eric W. Biederman wrote:
>> Al Viro <viro@zeniv.linux.org.uk> writes:
>> 
>> > On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote:
>> >> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> >> >
>> >> > -                               struct file * file = xchg(&fdt->fd[i], NULL);
>> >> > +                               struct file * file = fdt->fd[i];
>> >> >                                 if (file) {
>> >> > +                                       rcu_assign_pointer(fdt->fd[i], NULL);
>> >> 
>> >> This makes me nervous. Why did we use to do that xchg() there? That
>> >> has atomicity guarantees that now are gone.
>> >> 
>> >> Now, this whole thing should be called for just the last ref of the fd
>> >> table, so presumably that atomicity was never needed in the first
>> >> place. But the fact that we did that very expensive xchg() then makes
>> >> me go "there's some reason for it".
>> >> 
>> >> Is this xchg() just bogus historical leftover? It kind of looks that
>> >> way. But maybe that change should be done separately?
>> >
>> > I'm still not convinced that exposing close_files() to parallel
>> > 3rd-party accesses is safe in all cases, so this patch still needs
>> > more analysis.
>> 
>> That is fine.  I just wanted to post the latest version so we could
>> continue the discussion.  Especially with comments etc.
>
> It's probably safe.  I've spent today digging through the mess in
> fs/notify and kernel/bpf, and while I'm disgusted with both, at
> that point I believe that close_files() exposure is not going to
> create problems with either.  And xchg() in there _is_ useless.

Then I will work on a cleaned up version.

> Said that, BPF "file iterator" stuff is potentially very unpleasant -
> it allows to pin a struct file found in any process' descriptor table
> indefinitely long.  Temporary struct file references grabbed by procfs
> code, while unfortunate, are at least short-lived; with this stuff sky's
> the limit.
>
> I'm not happy about having that available, especially if it's a user-visible
> primitive we can't withdraw at zero notice ;-/
>
> What are the users of that thing and is there any chance to replace it
> with something saner?  IOW, what *is* realistically called for each
> struct file by the users of that iterator?

The bpf guys are no longer Cc'd and they can probably answer better than
I.

In a previous conversation it was mentioned that task_iter was supposed
to be a high performance interface for getting proc like data out of the
kernel using bpf.

If so I think that handles the lifetime issues as bpf programs are
supposed to be short-lived and can not pass references anywhere.

On the flip side it raises the question did the BPF guys just make the
current layout of task_struct and struct file part of the linux kernel
user space ABI?

Eric



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-10 19:29                       ` Eric W. Biederman
@ 2020-12-10 21:36                         ` Al Viro
  2020-12-10 22:30                           ` Christian Brauner
  0 siblings, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-12-10 21:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel,
	Christian Brauner, Oleg Nesterov, Jann Horn

On Thu, Dec 10, 2020 at 01:29:01PM -0600, Eric W. Biederman wrote:
> Al Viro <viro@zeniv.linux.org.uk> writes:

> > What are the users of that thing and is there any chance to replace it
> > with something saner?  IOW, what *is* realistically called for each
> > struct file by the users of that iterator?
> 
> The bpf guys are no longer Cc'd and they can probably answer better than
> I.
> 
> In a previous conversation it was mentioned that task_iter was supposed
> to be a high performance interface for getting proc like data out of the
> kernel using bpf.
> 
> If so I think that handles the lifetime issues as bpf programs are
> supposed to be short-lived and can not pass references anywhere.
> 
> On the flip side it raises the question did the BPF guys just make the
> current layout of task_struct and struct file part of the linux kernel
> user space ABI?

An interesting question, that...  For the record: anybody coming to
complain about a removed/renamed/replaced with something else field
in struct file will be refered to Figure 1.

None of the VFS data structures has any layout stability warranties.
If BPF folks want access to something in that, they are welcome to come
and discuss the set of accessors; so far nothing of that sort has happened.

Direct access to any fields of any of those structures is subject to
being broken at zero notice.

IMO we need some notation for a structure being off-limits for BPF, tracing,
etc., along the lines of "don't access any field directly".

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-10 21:36                         ` Al Viro
@ 2020-12-10 22:30                           ` Christian Brauner
  2020-12-10 22:54                             ` Al Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Christian Brauner @ 2020-12-10 22:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric W. Biederman, Linus Torvalds, Linux Kernel Mailing List,
	linux-fsdevel, Oleg Nesterov, Jann Horn

On Thu, Dec 10, 2020 at 09:36:24PM +0000, Al Viro wrote:
> On Thu, Dec 10, 2020 at 01:29:01PM -0600, Eric W. Biederman wrote:
> > Al Viro <viro@zeniv.linux.org.uk> writes:
> 
> > > What are the users of that thing and is there any chance to replace it
> > > with something saner?  IOW, what *is* realistically called for each
> > > struct file by the users of that iterator?
> > 
> > The bpf guys are no longer Cc'd and they can probably answer better than
> > I.
> > 
> > In a previous conversation it was mentioned that task_iter was supposed
> > to be a high performance interface for getting proc like data out of the
> > kernel using bpf.
> > 
> > If so I think that handles the lifetime issues as bpf programs are
> > supposed to be short-lived and can not pass references anywhere.
> > 
> > On the flip side it raises the question did the BPF guys just make the
> > current layout of task_struct and struct file part of the linux kernel
> > user space ABI?
> 
> An interesting question, that...  For the record: anybody coming to

Imho, they did. An example from the BPF LSM: a few weeks ago someone
asked me whether it would be possible to use the BPF LSM to enforce you
can't open files when they are on a given filesystem. Sine this bpf lsm
allows to attach to lsm hooks, say security_file_open(), you can get at
the superblock and check the filesyste type in a bpf program
(requiring btf), i.e. security_file_open, then follow
file->f_inode->i_sb->s_type->s_magic. If we change the say struct
super_block I'd expect these bpf programs to break. I'm sure there's
something clever that they came up with but it is nonetheless
uncomfortably close to making internal kernel structures part of
userspace ABI indeed.

> complain about a removed/renamed/replaced with something else field
> in struct file will be refered to Figure 1.
> 
> None of the VFS data structures has any layout stability warranties.
> If BPF folks want access to something in that, they are welcome to come
> and discuss the set of accessors; so far nothing of that sort has happened.
> 
> Direct access to any fields of any of those structures is subject to
> being broken at zero notice.
> 
> IMO we need some notation for a structure being off-limits for BPF, tracing,
> etc., along the lines of "don't access any field directly".

Indeed. I would also like to see a list where changes need to be sent
that are technically specific to a subsystem but will necessarily have
kernel-wide impact prime example: a lot of bpf.

Christian

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-10 22:30                           ` Christian Brauner
@ 2020-12-10 22:54                             ` Al Viro
  2020-12-10 23:10                               ` Al Viro
  0 siblings, 1 reply; 77+ messages in thread
From: Al Viro @ 2020-12-10 22:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric W. Biederman, Linus Torvalds, Linux Kernel Mailing List,
	linux-fsdevel, Oleg Nesterov, Jann Horn

On Thu, Dec 10, 2020 at 11:30:24PM +0100, Christian Brauner wrote:
> (requiring btf), i.e. security_file_open, then follow
> file->f_inode->i_sb->s_type->s_magic. If we change the say struct
> super_block I'd expect these bpf programs to break.

To break they would need to have compiled in the first place;
->s_type is struct file_system_type and it contains no ->s_magic
(nor would it be possible, really - ->s_magic can vary between
filesystems that *do* share ->s_type).

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH] files: rcu free files_struct
  2020-12-10 22:54                             ` Al Viro
@ 2020-12-10 23:10                               ` Al Viro
  0 siblings, 0 replies; 77+ messages in thread
From: Al Viro @ 2020-12-10 23:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric W. Biederman, Linus Torvalds, Linux Kernel Mailing List,
	linux-fsdevel, Oleg Nesterov, Jann Horn

On Thu, Dec 10, 2020 at 10:54:05PM +0000, Al Viro wrote:
> On Thu, Dec 10, 2020 at 11:30:24PM +0100, Christian Brauner wrote:
> > (requiring btf), i.e. security_file_open, then follow
> > file->f_inode->i_sb->s_type->s_magic. If we change the say struct
> > super_block I'd expect these bpf programs to break.
> 
> To break they would need to have compiled in the first place;
> ->s_type is struct file_system_type and it contains no ->s_magic
> (nor would it be possible, really - ->s_magic can vary between
> filesystems that *do* share ->s_type).

Incidentally, a lot of things in e.g. struct dentry need care when
accessing; the fields are there, but e.g. blind access to name or
parent really can oops.  Moreover, blindly following a chain of
->d_parent pointers without taking appropriate precautions might
end up reading from arbitrary kernel address, including iomem ones.
I don't see anything that would prevent that...

TAINT_BPF would probably be too impractical, since there's a lot
of boxen using it more reasonably on the networking side.  But
it really looks like we *do* need annotations with their violation
triggering a taint, so that BS bug reports could be discarded.

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2020-12-10 23:12 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87r1on1v62.fsf@x220.int.ebiederm.org>
2020-11-20 23:14 ` [PATCH v2 01/24] exec: Move unshare_files to fix posix file locking during exec Eric W. Biederman
2020-11-20 23:58   ` Linus Torvalds
2020-12-07 22:22   ` Al Viro
     [not found]     ` <87zh2pusdw.fsf@x220.int.ebiederm.org>
2020-12-07 23:35       ` Al Viro
2020-11-20 23:14 ` [PATCH v2 02/24] exec: Simplify unshare_files Eric W. Biederman
2020-11-23 17:50   ` Oleg Nesterov
2020-11-23 18:25     ` Linus Torvalds
     [not found]       ` <87im9vx08i.fsf@x220.int.ebiederm.org>
     [not found]         ` <87pn42r0n7.fsf@x220.int.ebiederm.org>
2020-11-24 19:58           ` Linus Torvalds
2020-11-24 20:14             ` Arnd Bergmann
2020-11-24 23:44               ` Geoff Levand
2020-11-25  1:16                 ` Eric W. Biederman
2020-11-27 20:29                   ` Arnd Bergmann
2020-11-30 21:37                     ` Eric W. Biederman
2020-12-01  9:46                     ` Michael Ellerman
2020-11-25 21:51                 ` [RFC][PATCH] coredump: Document coredump code exclusively used by cell spufs Eric W. Biederman
2020-12-02 15:20                   ` Eric W. Biederman
2020-12-02 15:58                     ` Arnd Bergmann
2020-12-07 22:32       ` [PATCH v2 02/24] exec: Simplify unshare_files Al Viro
2020-11-20 23:14 ` [PATCH v2 03/24] exec: Remove reset_files_struct Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 04/24] kcmp: In kcmp_epoll_target use fget_task Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 05/24] bpf: In bpf_task_fd_query " Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 06/24] proc/fd: In proc_fd_link " Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 07/24] file: Rename __fcheck_files to files_lookup_fd_raw Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 08/24] file: Factor files_lookup_fd_locked out of fcheck_files Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 09/24] file: Replace fcheck_files with files_lookup_fd_rcu Eric W. Biederman
2020-12-07 22:46   ` Al Viro
2020-12-07 22:49     ` Al Viro
2020-12-09 16:54       ` Al Viro
2020-12-09 17:44         ` Paul E. McKenney
2020-11-20 23:14 ` [PATCH v2 10/24] file: Rename fcheck lookup_fd_rcu Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 11/24] file: Implement task_lookup_fd_rcu Eric W. Biederman
2020-11-21 18:19   ` Cyrill Gorcunov
     [not found]     ` <87blfp1r8b.fsf@x220.int.ebiederm.org>
2020-11-22 13:52       ` Cyrill Gorcunov
2020-11-20 23:14 ` [PATCH v2 12/24] proc/fd: In tid_fd_mode use task_lookup_fd_rcu Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 13/24] kcmp: In get_file_raw_ptr " Eric W. Biederman
2020-11-23 21:05   ` Cyrill Gorcunov
2020-11-20 23:14 ` [PATCH v2 14/24] file: Implement task_lookup_next_fd_rcu Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 15/24] proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu Eric W. Biederman
2020-12-07 23:29   ` Al Viro
     [not found]     ` <877dprvs8e.fsf@x220.int.ebiederm.org>
2020-12-09  4:07       ` Al Viro
     [not found]         ` <877dprtxly.fsf@x220.int.ebiederm.org>
2020-12-09 14:23           ` Al Viro
2020-12-09 18:04             ` [PATCH] files: rcu free files_struct Eric W. Biederman
2020-12-09 19:13               ` Linus Torvalds
2020-12-09 19:50                 ` Al Viro
2020-12-09 21:32                   ` Eric W. Biederman
2020-12-10  6:13                     ` Al Viro
2020-12-10 19:29                       ` Eric W. Biederman
2020-12-10 21:36                         ` Al Viro
2020-12-10 22:30                           ` Christian Brauner
2020-12-10 22:54                             ` Al Viro
2020-12-10 23:10                               ` Al Viro
2020-12-09 21:42                   ` [PATCH -1/24] exec: Don't open code get_close_on_exec Eric W. Biederman
2020-12-09 22:04                 ` [PATCH] files: rcu free files_struct Eric W. Biederman
2020-12-09 19:49               ` Matthew Wilcox
2020-12-09 22:06                 ` Eric W. Biederman
2020-12-09 22:58                 ` Al Viro
2020-12-09 23:01                   ` Linus Torvalds
2020-12-09 23:04                     ` Linus Torvalds
2020-12-09 23:07                     ` Matthew Wilcox
2020-12-09 23:26                       ` Paul E. McKenney
2020-12-09 23:29                       ` Linus Torvalds
2020-12-10  0:39                         ` Paul E. McKenney
2020-12-10  0:41                           ` Linus Torvalds
2020-12-10  0:57                             ` Paul E. McKenney
2020-11-20 23:14 ` [PATCH v2 16/24] bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu Eric W. Biederman
2020-11-23 17:06   ` Yonghong Song
2020-11-20 23:14 ` [PATCH v2 17/24] proc/fd: In fdinfo seq_show don't use get_files_struct Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 18/24] file: Merge __fd_install into fd_install Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 19/24] file: In f_dupfd read RLIMIT_NOFILE once Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 20/24] file: Merge __alloc_fd into alloc_fd Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 21/24] file: Rename __close_fd to close_fd and remove the files parameter Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 22/24] file: Replace ksys_close with close_fd Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 23/24] file: Rename __close_fd_get_file close_fd_get_file Eric W. Biederman
2020-11-20 23:14 ` [PATCH v2 24/24] file: Remove get_files_struct Eric W. Biederman
2020-11-21  0:05 ` [PATCH v2 00/24] exec: Move unshare_files and guarantee files_struct.count is correct Linus Torvalds
2020-11-28  5:12   ` Al Viro
2020-12-07 23:56     ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).