linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] exec: fix passing of file locks across execve in multithreaded processes
@ 2018-09-14 10:53 Jeff Layton
  2018-09-14 10:53 ` [PATCH v3 1/3] exec: separate thread_count for files_struct Jeff Layton
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Jeff Layton @ 2018-09-14 10:53 UTC (permalink / raw)
  To: viro; +Cc: ebiederm, berrange, linux-fsdevel, linux-kernel

v2: fix displaced_files cleanup in __do_execve_file
v3: fix thread_count handling in unshare syscall
    add new release_files_struct helper

The main difference from the earlier set is some cleanup and fixes for
the thread_count handling in the files_struct. The original looked a
bit more ugly without the new helper function, and didn't get the
refcounting right in the unshare case, which was discovered by the
kernel test robot seeing the new WARN pop up during one test.

At this point, I'm mainly interested in whether this is the correct
approach to fix this at all. It is along the lines of what Eric B.
suggested, but I'm not sure whether it may break other cases that I
haven't considered.

-------------------------8<----------------------

A few months ago, Dan reported that when you call execve in process that
is multithreaded, any traditional POSIX locks are silently dropped.

The problem is that we end up unsharing the files_struct from the
process very early during exec, when it looks like it's shared between
tasks. Eventually, when the other, non-exec'ing tasks are killed, we
tear down the old files_struct. That ends up tearing down the old
files_struct, which ends up looking like a close() was issues on each
open fd and that causes the locks to be dropped.

This patchset is a second stab at fixing this issue, this time following
the method suggested by Eric Biederman. The idea here is to move the
unshare_files() call after de_thread(), which helps ensure that we only
unshare the files_struct when it's truly shared between different
processes, and not just when the exec'ing process is multithreaded.

This seems to fix the originally reported problem (now known as xfstest
generic/484), and basic testing doesn't seem to show any issues.

During the original discussion though, Al had mentioned that this could
be problematic due to the fdtable being modifiable by other threads
(or even processes) during the binfmt probe. That may make this idea
non-viable.

I'm also not terribly thrilled with the way this sprinkles the
files_struct->file_lock all over the place. It may be possible to do
some of this with atomic ops if the basic approach turns out to be
reasonable.

Comments and suggestions welcome.

Jeff Layton (3):
  exec: separate thread_count for files_struct
  exec: delay clone(CLONE_FILES) if task associated with current
    files_struct is exec'ing
  exec: do unshare_files after de_thread

 fs/coredump.c           |  2 +-
 fs/exec.c               | 27 +++++++++++++++++++--------
 fs/file.c               | 19 +++++++++++++++++--
 include/linux/binfmts.h |  1 +
 include/linux/fdtable.h |  3 +++
 kernel/fork.c           | 25 ++++++++++++++++++++-----
 6 files changed, 61 insertions(+), 16 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 1/3] exec: separate thread_count for files_struct
  2018-09-14 10:53 [PATCH v3 0/3] exec: fix passing of file locks across execve in multithreaded processes Jeff Layton
@ 2018-09-14 10:53 ` Jeff Layton
  2018-09-15 16:04   ` Oleg Nesterov
  2018-09-14 10:53 ` [PATCH v3 2/3] exec: delay clone(CLONE_FILES) if task associated with current files_struct is exec'ing Jeff Layton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Jeff Layton @ 2018-09-14 10:53 UTC (permalink / raw)
  To: viro; +Cc: ebiederm, berrange, linux-fsdevel, linux-kernel

Currently, we have a single refcount variable inside the files_struct.
When we go to unshare the files_struct, we check this counter and if
it's elevated, then we allocate a new files_struct instead of just
repurposing the old one, under the assumption that that indicates that
it's shared between tasks.

This is not necessarily the case, however. Each task associated with the
files_struct does get a long-held reference, but the refcount can be
elevated for other reasons as well, by callers of get_files_struct.

This means that we can end up allocating a new files_struct if we just
happen to race with a call to get_files_struct. Fix this by adding a new
counter to track the number associated threads, and use that to
determine whether to allocate a new files_struct when unsharing.

We protect this new counter with the file_lock instead of doing it with
atomics, as a later patch will need to track an "in_exec" flag that will
need to be handled in conjunction with the thread counter.

Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/coredump.c           |  2 +-
 fs/exec.c               |  2 +-
 fs/file.c               | 18 ++++++++++++++++--
 include/linux/fdtable.h |  2 ++
 kernel/fork.c           | 16 ++++++++++++----
 5 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 1e2c87acac9b..02374a4febc4 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -751,7 +751,7 @@ void do_coredump(const siginfo_t *siginfo)
 	if (retval)
 		goto close_fail;
 	if (displaced)
-		put_files_struct(displaced);
+		release_files_struct(displaced);
 	if (!dump_interrupted()) {
 		file_start_write(cprm.file);
 		core_dumped = binfmt->core_dump(&cprm);
diff --git a/fs/exec.c b/fs/exec.c
index 1ebf6e5a521d..e71a70d70158 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1832,7 +1832,7 @@ static int __do_execve_file(int fd, struct filename *filename,
 	if (filename)
 		putname(filename);
 	if (displaced)
-		put_files_struct(displaced);
+		release_files_struct(displaced);
 	return retval;
 
 out:
diff --git a/fs/file.c b/fs/file.c
index 7ffd6e9d103d..e35eeff4593c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -284,6 +284,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	atomic_set(&newf->count, 1);
 
 	spin_lock_init(&newf->file_lock);
+	newf->thread_count = 1;
 	newf->resize_in_progress = false;
 	init_waitqueue_head(&newf->resize_wait);
 	newf->next_fd = 0;
@@ -415,6 +416,8 @@ void put_files_struct(struct files_struct *files)
 	if (atomic_dec_and_test(&files->count)) {
 		struct fdtable *fdt = close_files(files);
 
+		WARN_ON_ONCE(files->thread_count);
+
 		/* free the arrays if they are not embedded */
 		if (fdt != &files->fdtab)
 			__free_fdtable(fdt);
@@ -422,16 +425,26 @@ void put_files_struct(struct files_struct *files)
 	}
 }
 
+void release_files_struct(struct files_struct *files)
+{
+	spin_lock(&files->file_lock);
+	--files->thread_count;
+	spin_unlock(&files->file_lock);
+	put_files_struct(files);
+}
+
 void reset_files_struct(struct files_struct *files)
 {
 	struct task_struct *tsk = current;
 	struct files_struct *old;
 
 	old = tsk->files;
+
 	task_lock(tsk);
 	tsk->files = files;
 	task_unlock(tsk);
-	put_files_struct(old);
+
+	release_files_struct(old);
 }
 
 void exit_files(struct task_struct *tsk)
@@ -442,7 +455,7 @@ void exit_files(struct task_struct *tsk)
 		task_lock(tsk);
 		tsk->files = NULL;
 		task_unlock(tsk);
-		put_files_struct(files);
+		release_files_struct(files);
 	}
 }
 
@@ -457,6 +470,7 @@ struct files_struct init_files = {
 		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
 	.file_lock	= __SPIN_LOCK_UNLOCKED(init_files.file_lock),
+	.thread_count	= 1,
 };
 
 static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 41615f38bcff..2cb02f438201 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -59,6 +59,7 @@ struct files_struct {
    * written part on a separate cache line in SMP
    */
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
+	int thread_count;
 	unsigned int next_fd;
 	unsigned long close_on_exec_init[1];
 	unsigned long open_fds_init[1];
@@ -107,6 +108,7 @@ struct task_struct;
 
 struct files_struct *get_files_struct(struct task_struct *);
 void put_files_struct(struct files_struct *fs);
+void release_files_struct(struct files_struct *fs);
 void reset_files_struct(struct files_struct *);
 int unshare_files(struct files_struct **);
 struct files_struct *dup_fd(struct files_struct *, int *) __latent_entropy;
diff --git a/kernel/fork.c b/kernel/fork.c
index f0b58479534f..c7eb63e4d5c1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1373,6 +1373,9 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
 
 	if (clone_flags & CLONE_FILES) {
 		atomic_inc(&oldf->count);
+		spin_lock(&oldf->file_lock);
+		oldf->thread_count++;
+		spin_unlock(&oldf->file_lock);
 		goto out;
 	}
 
@@ -2416,10 +2419,15 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
 static int unshare_fd(unsigned long unshare_flags, struct files_struct **new_fdp)
 {
 	struct files_struct *fd = current->files;
-	int error = 0;
+	int error, count;
+
+	if (!(unshare_flags & CLONE_FILES) || !fd)
+		return 0;
 
-	if ((unshare_flags & CLONE_FILES) &&
-	    (fd && atomic_read(&fd->count) > 1)) {
+	spin_lock(&fd->file_lock);
+	count = fd->thread_count;
+	spin_unlock(&fd->file_lock);
+	if (count > 1) {
 		*new_fdp = dup_fd(fd, &error);
 		if (!*new_fdp)
 			return error;
@@ -2542,7 +2550,7 @@ int ksys_unshare(unsigned long unshare_flags)
 		put_cred(new_cred);
 bad_unshare_cleanup_fd:
 	if (new_fd)
-		put_files_struct(new_fd);
+		release_files_struct(new_fd);
 
 bad_unshare_cleanup_fs:
 	if (new_fs)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 2/3] exec: delay clone(CLONE_FILES) if task associated with current files_struct is exec'ing
  2018-09-14 10:53 [PATCH v3 0/3] exec: fix passing of file locks across execve in multithreaded processes Jeff Layton
  2018-09-14 10:53 ` [PATCH v3 1/3] exec: separate thread_count for files_struct Jeff Layton
@ 2018-09-14 10:53 ` Jeff Layton
  2018-09-14 10:53 ` [PATCH v3 3/3] exec: do unshare_files after de_thread Jeff Layton
  2018-09-16 17:38 ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Eric W. Biederman
  3 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2018-09-14 10:53 UTC (permalink / raw)
  To: viro; +Cc: ebiederm, berrange, linux-fsdevel, linux-kernel

In a later patch, we're going to move the unshare_files call in
__do_execve_file to later in the process, but this opens up a potential
race with clone(CLONE_FILES). We could end up bumping the refcount on
the files_struct after we've checked to see if it was shared. What we
really want to do in that case is have the clone return -EAGAIN, much
like the handling of the similar situation in copy_fs.

Add a new in_exec boolean to files_struct that is protected by the
file_lock. Set it if the files_struct turns out not to be shared, and
clear at the end of __do_execve_file. If it's set when we go to
copy_process, then just return -EAGAIN.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/exec.c               | 14 ++++++++++++--
 fs/file.c               |  1 +
 include/linux/fdtable.h |  1 +
 kernel/fork.c           | 21 ++++++++++++++-------
 4 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index e71a70d70158..c8a68481d7eb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1831,8 +1831,13 @@ static int __do_execve_file(int fd, struct filename *filename,
 	kfree(pathbuf);
 	if (filename)
 		putname(filename);
-	if (displaced)
+	if (displaced) {
 		release_files_struct(displaced);
+	} else {
+		spin_lock(&current->files->file_lock);
+		current->files->in_exec = false;
+		spin_unlock(&current->files->file_lock);
+	}
 	return retval;
 
 out:
@@ -1850,8 +1855,13 @@ static int __do_execve_file(int fd, struct filename *filename,
 	kfree(pathbuf);
 
 out_files:
-	if (displaced)
+	if (displaced) {
 		reset_files_struct(displaced);
+	} else {
+		spin_lock(&current->files->file_lock);
+		current->files->in_exec = false;
+		spin_unlock(&current->files->file_lock);
+	}
 out_ret:
 	if (filename)
 		putname(filename);
diff --git a/fs/file.c b/fs/file.c
index e35eeff4593c..535321846425 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -286,6 +286,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	spin_lock_init(&newf->file_lock);
 	newf->thread_count = 1;
 	newf->resize_in_progress = false;
+	newf->in_exec = false;
 	init_waitqueue_head(&newf->resize_wait);
 	newf->next_fd = 0;
 	new_fdt = &newf->fdtab;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 2cb02f438201..a09ba1b20693 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -51,6 +51,7 @@ struct files_struct {
    */
 	atomic_t count;
 	bool resize_in_progress;
+	bool in_exec;
 	wait_queue_head_t resize_wait;
 
 	struct fdtable __rcu *fdt;
diff --git a/kernel/fork.c b/kernel/fork.c
index c7eb63e4d5c1..a3786efb39f1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1372,10 +1372,15 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
 		goto out;
 
 	if (clone_flags & CLONE_FILES) {
-		atomic_inc(&oldf->count);
 		spin_lock(&oldf->file_lock);
-		oldf->thread_count++;
-		spin_unlock(&oldf->file_lock);
+		if (oldf->in_exec) {
+			spin_unlock(&oldf->file_lock);
+			error = -EAGAIN;
+		} else {
+			oldf->thread_count++;
+			spin_unlock(&oldf->file_lock);
+			atomic_inc(&oldf->count);
+		}
 		goto out;
 	}
 
@@ -2419,18 +2424,20 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
 static int unshare_fd(unsigned long unshare_flags, struct files_struct **new_fdp)
 {
 	struct files_struct *fd = current->files;
-	int error, count;
+	int error;
 
 	if (!(unshare_flags & CLONE_FILES) || !fd)
 		return 0;
 
 	spin_lock(&fd->file_lock);
-	count = fd->thread_count;
-	spin_unlock(&fd->file_lock);
-	if (count > 1) {
+	if (fd->thread_count > 1) {
+		spin_unlock(&fd->file_lock);
 		*new_fdp = dup_fd(fd, &error);
 		if (!*new_fdp)
 			return error;
+	} else {
+		fd->in_exec = true;
+		spin_unlock(&fd->file_lock);
 	}
 
 	return 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 3/3] exec: do unshare_files after de_thread
  2018-09-14 10:53 [PATCH v3 0/3] exec: fix passing of file locks across execve in multithreaded processes Jeff Layton
  2018-09-14 10:53 ` [PATCH v3 1/3] exec: separate thread_count for files_struct Jeff Layton
  2018-09-14 10:53 ` [PATCH v3 2/3] exec: delay clone(CLONE_FILES) if task associated with current files_struct is exec'ing Jeff Layton
@ 2018-09-14 10:53 ` Jeff Layton
  2018-09-15 16:37   ` Oleg Nesterov
  2018-09-16 16:59   ` Eric W. Biederman
  2018-09-16 17:38 ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Eric W. Biederman
  3 siblings, 2 replies; 22+ messages in thread
From: Jeff Layton @ 2018-09-14 10:53 UTC (permalink / raw)
  To: viro; +Cc: ebiederm, berrange, linux-fsdevel, linux-kernel

POSIX mandates that open fds and their associated file locks should be
preserved across an execve. This works, unless the process is
multithreaded at the time that execve is called.

In that case, we'll end up unsharing the files_struct but the locks will
still have their fl_owner set to the address of the old one. Eventually,
when the other threads die and the last reference to the old
files_struct is put, any POSIX locks get torn down since it looks like
a close occurred on them.

The result is that all of your open files will be intact with none of
the locks you held before execve. The simple answer to this is "use OFD
locks", but this is a nasty surprise and it violates the spec.

Fix this by doing unshare_files later during exec, after we've already
killed off the other threads in the process. This helps ensure that we
only unshare the files_struct during exec when it is truly shared with
other processes.

Note that because the unshare_files call is now done just after
de_thread, we need a mechanism to pass the displaced files_struct back
up to __do_execve_file. This is done via a new displaced_files field
inside the linux_binprm.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/exec.c               | 11 ++++++-----
 include/linux/binfmts.h |  1 +
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index c8a68481d7eb..b4a7e659d908 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1262,6 +1262,10 @@ int flush_old_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;
 
+	retval = unshare_files(&bprm->displaced_files);
+	if (retval)
+		goto out;
+
 	/*
 	 * Must be called _before_ exec_mmap() as bprm->mm is
 	 * not visibile until then. This also enables the update
@@ -1713,7 +1717,7 @@ static int __do_execve_file(int fd, struct filename *filename,
 {
 	char *pathbuf = NULL;
 	struct linux_binprm *bprm;
-	struct files_struct *displaced;
+	struct files_struct *displaced = NULL;
 	int retval;
 
 	if (IS_ERR(filename))
@@ -1735,10 +1739,6 @@ static int __do_execve_file(int fd, struct filename *filename,
 	 * further execve() calls fail. */
 	current->flags &= ~PF_NPROC_EXCEEDED;
 
-	retval = unshare_files(&displaced);
-	if (retval)
-		goto out_ret;
-
 	retval = -ENOMEM;
 	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
 	if (!bprm)
@@ -1817,6 +1817,7 @@ static int __do_execve_file(int fd, struct filename *filename,
 	would_dump(bprm, bprm->file);
 
 	retval = exec_binprm(bprm);
+	displaced = bprm->displaced_files;
 	if (retval < 0)
 		goto out;
 
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index c05f24fac4f6..d7ec384bb1b0 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -49,6 +49,7 @@ struct linux_binprm {
 	unsigned int taso:1;
 #endif
 	unsigned int recursion_depth; /* only for search_binary_handler() */
+	struct files_struct * displaced_files;
 	struct file * file;
 	struct cred *cred;	/* new credentials */
 	int unsafe;		/* how unsafe this exec is (mask of LSM_UNSAFE_*) */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/3] exec: separate thread_count for files_struct
  2018-09-14 10:53 ` [PATCH v3 1/3] exec: separate thread_count for files_struct Jeff Layton
@ 2018-09-15 16:04   ` Oleg Nesterov
  2018-09-16 16:10     ` Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2018-09-15 16:04 UTC (permalink / raw)
  To: Jeff Layton
  Cc: viro, ebiederm, berrange, linux-fsdevel, linux-kernel, Andrew Morton

On 09/14, Jeff Layton wrote:
>
> Currently, we have a single refcount variable inside the files_struct.
> When we go to unshare the files_struct, we check this counter and if
> it's elevated, then we allocate a new files_struct instead of just
> repurposing the old one, under the assumption that that indicates that
> it's shared between tasks.
>
> This is not necessarily the case, however. Each task associated with the
> files_struct does get a long-held reference, but the refcount can be
> elevated for other reasons as well, by callers of get_files_struct.
>
> This means that we can end up allocating a new files_struct if we just
> happen to race with a call to get_files_struct. Fix this by adding a new
> counter to track the number associated threads, and use that to
> determine whether to allocate a new files_struct when unsharing.

But who actually needs get_files_struct() ? All users except binder.c need
struct file, not struct files_struct.

See the (completely untested) patch below. It adds the new fget_task() helper
and converts all other users to use it.

As for binder.c, in this case we probably actually want to unshare ->files
on exec so we can ignore it?

Oleg.
---

 fs/file.c            |  29 ++++++++++++-
 fs/proc/fd.c         | 117 +++++++++++++++++++--------------------------------
 include/linux/file.h |   3 ++
 kernel/bpf/syscall.c |  23 +++-------
 kernel/kcmp.c        |  20 ++-------
 5 files changed, 83 insertions(+), 109 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 7ffd6e9..a685cc0 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -676,9 +676,9 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static struct file *__fget(unsigned int fd, fmode_t mask)
+static struct file *__fget_files(struct files_struct *files,
+				unsigned int fd, fmode_t mask)
 {
-	struct files_struct *files = current->files;
 	struct file *file;
 
 	rcu_read_lock();
@@ -699,12 +699,37 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 	return file;
 }
 
+static inline struct file *__fget(unsigned int fd, fmode_t mask)
+{
+	return __fget_files(current->files, fd, mask);
+}
+
 struct file *fget(unsigned int fd)
 {
 	return __fget(fd, FMODE_PATH);
 }
 EXPORT_SYMBOL(fget);
 
+struct file *fget_task(struct task_struct *task, unsigned int fd)
+{
+	struct files_struct *files;
+	struct file *file = ERR_PTR(-EBADF);
+
+	task_lock(task);
+	/*
+	 * __fget_files() checks max_fds itself but we want -EBADF
+	 * if fd is too big; and files_fdtable() needs rcu lock.
+	 */
+	rcu_read_lock();
+	files = task->files;
+	if (files && fd < files_fdtable(files)->max_fds)
+		file = __fget_files(files, fd, FMODE_PATH);
+	rcu_read_unlock();
+	task_unlock(task);
+
+	return file;
+}
+
 struct file *fget_raw(unsigned int fd)
 {
 	return __fget(fd, 0);
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 81882a1..bb61890 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -19,54 +19,45 @@
 
 static int seq_show(struct seq_file *m, void *v)
 {
-	struct files_struct *files = NULL;
+	struct task_struct *task = get_proc_task(m->private);
+	unsigned int fd = proc_fd(m->private);
 	int f_flags = 0, ret = -ENOENT;
 	struct file *file = NULL;
-	struct task_struct *task;
 
-	task = get_proc_task(m->private);
 	if (!task)
-		return -ENOENT;
-
-	files = get_files_struct(task);
-	put_task_struct(task);
-
-	if (files) {
-		unsigned int fd = proc_fd(m->private);
-
-		spin_lock(&files->file_lock);
-		file = fcheck_files(files, fd);
-		if (file) {
-			struct fdtable *fdt = files_fdtable(files);
+		return ret;
 
-			f_flags = file->f_flags;
-			if (close_on_exec(fd, fdt))
-				f_flags |= O_CLOEXEC;
+	file = fget_task(task, fd);
+	if (IS_ERR_OR_NULL(file))
+		goto out;
 
-			get_file(file);
-			ret = 0;
-		}
-		spin_unlock(&files->file_lock);
-		put_files_struct(files);
+	/* TODO: add another helper ? */
+	task_lock(task);
+	rcu_read_lock();
+	if (task->files) {
+		struct fdtable *fdt = files_fdtable(task->files);
+		if (fd < fdt->max_fds && close_on_exec(fd, fdt))
+			f_flags |= O_CLOEXEC;
 	}
-
-	if (ret)
-		return ret;
+	rcu_read_unlock();
+	task_unlock(task);
 
 	seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\n",
 		   (long long)file->f_pos, f_flags,
 		   real_mount(file->f_path.mnt)->mnt_id);
 
-	show_fd_locks(m, file, files);
+	/* show_fd_locks() never dereferences file, NULL is fine too */
+	show_fd_locks(m, file, task->files);
 	if (seq_has_overflowed(m))
 		goto out;
 
 	if (file->f_op->show_fdinfo)
 		file->f_op->show_fdinfo(m, file);
 
-out:
 	fput(file);
-	return 0;
+out:
+	put_task_struct(task);
+	return ret;
 }
 
 static int seq_fdinfo_open(struct inode *inode, struct file *file)
@@ -83,19 +74,14 @@ static const struct file_operations proc_fdinfo_file_operations = {
 
 static bool tid_fd_mode(struct task_struct *task, unsigned fd, fmode_t *mode)
 {
-	struct files_struct *files = get_files_struct(task);
-	struct file *file;
+	struct file *file = fget_task(task, fd);
 
-	if (!files)
+	if (IS_ERR_OR_NULL(file))
 		return false;
 
-	rcu_read_lock();
-	file = fcheck_files(files, fd);
-	if (file)
-		*mode = file->f_mode;
-	rcu_read_unlock();
-	put_files_struct(files);
-	return !!file;
+	*mode = file->f_mode;
+	fput(file);
+	return true;
 }
 
 static void tid_fd_update_inode(struct task_struct *task, struct inode *inode,
@@ -146,31 +132,24 @@ static const struct dentry_operations tid_fd_dentry_operations = {
 
 static int proc_fd_link(struct dentry *dentry, struct path *path)
 {
-	struct files_struct *files = NULL;
-	struct task_struct *task;
+	struct task_struct *task = get_proc_task(d_inode(dentry));
+	unsigned int fd = proc_fd(d_inode(dentry));
+	struct file *fd_file;
 	int ret = -ENOENT;
 
 	task = get_proc_task(d_inode(dentry));
-	if (task) {
-		files = get_files_struct(task);
-		put_task_struct(task);
-	}
-
-	if (files) {
-		unsigned int fd = proc_fd(d_inode(dentry));
-		struct file *fd_file;
+	if (!task)
+		return ret;
 
-		spin_lock(&files->file_lock);
-		fd_file = fcheck_files(files, fd);
-		if (fd_file) {
-			*path = fd_file->f_path;
-			path_get(&fd_file->f_path);
-			ret = 0;
-		}
-		spin_unlock(&files->file_lock);
-		put_files_struct(files);
+	fd_file = fget_task(task, fd);
+	if (!IS_ERR_OR_NULL(fd_file)) {
+		*path = fd_file->f_path;
+		path_get(&fd_file->f_path);
+		fput(fd_file);
+		ret = 0;
 	}
 
+	put_task_struct(task);
 	return ret;
 }
 
@@ -229,7 +208,6 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 			      instantiate_t instantiate)
 {
 	struct task_struct *p = get_proc_task(file_inode(file));
-	struct files_struct *files;
 	unsigned int fd;
 
 	if (!p)
@@ -237,37 +215,30 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 
 	if (!dir_emit_dots(file, ctx))
 		goto out;
-	files = get_files_struct(p);
-	if (!files)
-		goto out;
 
-	rcu_read_lock();
-	for (fd = ctx->pos - 2;
-	     fd < files_fdtable(files)->max_fds;
-	     fd++, ctx->pos++) {
+	for (fd = ctx->pos - 2;; fd++, ctx->pos++) {
 		struct file *f;
 		struct fd_data data;
 		char name[10 + 1];
 		unsigned int len;
 
-		f = fcheck_files(files, fd);
+		f = fget_task(p, fd);
 		if (!f)
 			continue;
+		if (IS_ERR(f))
+			break;
+
 		data.mode = f->f_mode;
-		rcu_read_unlock();
 		data.fd = fd;
+		fput(f);
 
 		len = snprintf(name, sizeof(name), "%u", fd);
 		if (!proc_fill_cache(file, ctx,
 				     name, len, instantiate, p,
 				     &data))
-			goto out_fd_loop;
+			goto out;
 		cond_resched();
-		rcu_read_lock();
 	}
-	rcu_read_unlock();
-out_fd_loop:
-	put_files_struct(files);
 out:
 	put_task_struct(p);
 	return 0;
diff --git a/include/linux/file.h b/include/linux/file.h
index 6b2fb03..8b7abdb 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -43,6 +43,9 @@ static inline void fdput(struct fd fd)
 		fput(fd.file);
 }
 
+struct task_struct;
+extern struct file *fget_task(struct task_struct *task, unsigned int fd);
+
 extern struct file *fget(unsigned int fd);
 extern struct file *fget_raw(unsigned int fd);
 extern unsigned long __fdget(unsigned int fd);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8339d81..2cbf26d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2260,7 +2260,6 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 	pid_t pid = attr->task_fd_query.pid;
 	u32 fd = attr->task_fd_query.fd;
 	const struct perf_event *event;
-	struct files_struct *files;
 	struct task_struct *task;
 	struct file *file;
 	int err;
@@ -2278,24 +2277,13 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 	if (!task)
 		return -ENOENT;
 
-	files = get_files_struct(task);
-	put_task_struct(task);
-	if (!files)
-		return -ENOENT;
-
-	err = 0;
-	spin_lock(&files->file_lock);
-	file = fcheck_files(files, fd);
-	if (!file)
-		err = -EBADF;
-	else
-		get_file(file);
-	spin_unlock(&files->file_lock);
-	put_files_struct(files);
-
-	if (err)
+	err = -EBADF;
+	file = fget_task(task, fd);
+	if (IS_ERR_OR_NULL(file))
 		goto out;
 
+	err = -ENOTSUPP;
+
 	if (file->f_op == &bpf_raw_tp_fops) {
 		struct bpf_raw_tracepoint *raw_tp = file->private_data;
 		struct bpf_raw_event_map *btp = raw_tp->btp;
@@ -2324,7 +2312,6 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
 		goto put_file;
 	}
 
-	err = -ENOTSUPP;
 put_file:
 	fput(file);
 out:
diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index a0e3d7a..067639e 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -107,7 +107,6 @@ static int kcmp_epoll_target(struct task_struct *task1,
 {
 	struct file *filp, *filp_epoll, *filp_tgt;
 	struct kcmp_epoll_slot slot;
-	struct files_struct *files;
 
 	if (copy_from_user(&slot, uslot, sizeof(slot)))
 		return -EFAULT;
@@ -116,23 +115,12 @@ static int kcmp_epoll_target(struct task_struct *task1,
 	if (!filp)
 		return -EBADF;
 
-	files = get_files_struct(task2);
-	if (!files)
+	filp_epoll = fget_task(task2, slot.efd);
+	if (IS_ERR_OR_NULL(filp_epoll))
 		return -EBADF;
 
-	spin_lock(&files->file_lock);
-	filp_epoll = fcheck_files(files, slot.efd);
-	if (filp_epoll)
-		get_file(filp_epoll);
-	else
-		filp_tgt = ERR_PTR(-EBADF);
-	spin_unlock(&files->file_lock);
-	put_files_struct(files);
-
-	if (filp_epoll) {
-		filp_tgt = get_epoll_tfile_raw_ptr(filp_epoll, slot.tfd, slot.toff);
-		fput(filp_epoll);
-	}
+	filp_tgt = get_epoll_tfile_raw_ptr(filp_epoll, slot.tfd, slot.toff);
+	fput(filp_epoll);
 
 	if (IS_ERR(filp_tgt))
 		return PTR_ERR(filp_tgt);


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/3] exec: do unshare_files after de_thread
  2018-09-14 10:53 ` [PATCH v3 3/3] exec: do unshare_files after de_thread Jeff Layton
@ 2018-09-15 16:37   ` Oleg Nesterov
  2018-09-16 16:49     ` Eric W. Biederman
  2018-09-16 16:59   ` Eric W. Biederman
  1 sibling, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2018-09-15 16:37 UTC (permalink / raw)
  To: Jeff Layton
  Cc: viro, ebiederm, berrange, linux-fsdevel, linux-kernel, Andrew Morton

On 09/14, Jeff Layton wrote:
>
> POSIX mandates that open fds and their associated file locks should be
> preserved across an execve. This works, unless the process is
> multithreaded at the time that execve is called.
>
> In that case, we'll end up unsharing the files_struct but the locks will
> still have their fl_owner set to the address of the old one. Eventually,
> when the other threads die and the last reference to the old
> files_struct is put, any POSIX locks get torn down since it looks like
> a close occurred on them.
>
> The result is that all of your open files will be intact with none of
> the locks you held before execve. The simple answer to this is "use OFD
> locks", but this is a nasty surprise and it violates the spec.
>
> Fix this by doing unshare_files later during exec,

See my reply to 1/3... if we can forget about the races with get_files_struct()
we can probably make a much simpler patch, plus we do not need 2/2, afaics.

What I really can't understand is why we need to _change_ current->files
early in do_execve().

IOW. Lets ignore do_close_on_exec(), lets ignore the fact that unshare_fd()
can fail and thus it makes sense to call it before point-of-no-return.

Any other reason why we can't simply call unshare_files() at the end of
__do_execve_file() on success?

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/3] exec: separate thread_count for files_struct
  2018-09-15 16:04   ` Oleg Nesterov
@ 2018-09-16 16:10     ` Eric W. Biederman
  2018-09-17 15:24       ` Oleg Nesterov
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2018-09-16 16:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jeff Layton, viro, berrange, linux-fsdevel, linux-kernel, Andrew Morton

Oleg Nesterov <oleg@redhat.com> writes:

> On 09/14, Jeff Layton wrote:
>>
>> Currently, we have a single refcount variable inside the files_struct.
>> When we go to unshare the files_struct, we check this counter and if
>> it's elevated, then we allocate a new files_struct instead of just
>> repurposing the old one, under the assumption that that indicates that
>> it's shared between tasks.
>>
>> This is not necessarily the case, however. Each task associated with the
>> files_struct does get a long-held reference, but the refcount can be
>> elevated for other reasons as well, by callers of get_files_struct.
>>
>> This means that we can end up allocating a new files_struct if we just
>> happen to race with a call to get_files_struct. Fix this by adding a new
>> counter to track the number associated threads, and use that to
>> determine whether to allocate a new files_struct when unsharing.
>
> But who actually needs get_files_struct() ? All users except binder.c need
> struct file, not struct files_struct.

I think the difficult case is flock64_to_posix_lock.  Where it sets
fl_owner to current->files.

There is also mandatory locking.

Ah.  I see you are talking about the cases that increment the count on
files_struct.  So that a count on files_struct is unambiguously the
number of threads sharing it.

> See the (completely untested) patch below. It adds the new fget_task() helper
> and converts all other users to use it.
>
> As for binder.c, in this case we probably actually want to unshare ->files
> on exec so we can ignore it?

Looking at the binder case it only captures ->files on mmap.  Exec
ditches the mmap.  So if the order of operations are correct than
the dropping of the old mm will also drop the count on files_struct
held by binder.

So semantically binder should not effect locks on exec, nor should
binder effect the files_count.  So I think as long as our order of
operations are correct we are good.

I have concerns about binder.  It is not pid namespace safe.
It assumes but does not enforce all threads share a single fs struct.

Binder at least limits mmapers to the thread group of the opener of
the binder file.  This limits the worst of the shenanighans with
file descriptor passing that I can imagine.

In short as long as we get the oder of operations correct we should be
able to safely ignore binder, and not have binder affect the results of
this code.

Eric

> Oleg.
> ---
>
>  fs/file.c            |  29 ++++++++++++-
>  fs/proc/fd.c         | 117 +++++++++++++++++++--------------------------------
>  include/linux/file.h |   3 ++
>  kernel/bpf/syscall.c |  23 +++-------
>  kernel/kcmp.c        |  20 ++-------
>  5 files changed, 83 insertions(+), 109 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index 7ffd6e9..a685cc0 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -676,9 +676,9 @@ void do_close_on_exec(struct files_struct *files)
>  	spin_unlock(&files->file_lock);
>  }
>  
> -static struct file *__fget(unsigned int fd, fmode_t mask)
> +static struct file *__fget_files(struct files_struct *files,
> +				unsigned int fd, fmode_t mask)
>  {
> -	struct files_struct *files = current->files;
>  	struct file *file;
>  
>  	rcu_read_lock();
> @@ -699,12 +699,37 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
>  	return file;
>  }
>  
> +static inline struct file *__fget(unsigned int fd, fmode_t mask)
> +{
> +	return __fget_files(current->files, fd, mask);
> +}
> +
>  struct file *fget(unsigned int fd)
>  {
>  	return __fget(fd, FMODE_PATH);
>  }
>  EXPORT_SYMBOL(fget);
>  
> +struct file *fget_task(struct task_struct *task, unsigned int fd)
> +{
> +	struct files_struct *files;
> +	struct file *file = ERR_PTR(-EBADF);
> +
> +	task_lock(task);
> +	/*
> +	 * __fget_files() checks max_fds itself but we want -EBADF
> +	 * if fd is too big; and files_fdtable() needs rcu lock.
> +	 */
> +	rcu_read_lock();
> +	files = task->files;
> +	if (files && fd < files_fdtable(files)->max_fds)
> +		file = __fget_files(files, fd, FMODE_PATH);
> +	rcu_read_unlock();
> +	task_unlock(task);
> +
> +	return file;
> +}
> +
>  struct file *fget_raw(unsigned int fd)
>  {
>  	return __fget(fd, 0);
> diff --git a/fs/proc/fd.c b/fs/proc/fd.c
> index 81882a1..bb61890 100644
> --- a/fs/proc/fd.c
> +++ b/fs/proc/fd.c
> @@ -19,54 +19,45 @@
>  
>  static int seq_show(struct seq_file *m, void *v)
>  {
> -	struct files_struct *files = NULL;
> +	struct task_struct *task = get_proc_task(m->private);
> +	unsigned int fd = proc_fd(m->private);
>  	int f_flags = 0, ret = -ENOENT;
>  	struct file *file = NULL;
> -	struct task_struct *task;
>  
> -	task = get_proc_task(m->private);
>  	if (!task)
> -		return -ENOENT;
> -
> -	files = get_files_struct(task);
> -	put_task_struct(task);
> -
> -	if (files) {
> -		unsigned int fd = proc_fd(m->private);
> -
> -		spin_lock(&files->file_lock);
> -		file = fcheck_files(files, fd);
> -		if (file) {
> -			struct fdtable *fdt = files_fdtable(files);
> +		return ret;
>  
> -			f_flags = file->f_flags;
> -			if (close_on_exec(fd, fdt))
> -				f_flags |= O_CLOEXEC;
> +	file = fget_task(task, fd);
> +	if (IS_ERR_OR_NULL(file))
> +		goto out;
>  
> -			get_file(file);
> -			ret = 0;
> -		}
> -		spin_unlock(&files->file_lock);
> -		put_files_struct(files);
> +	/* TODO: add another helper ? */
> +	task_lock(task);
> +	rcu_read_lock();
> +	if (task->files) {
> +		struct fdtable *fdt = files_fdtable(task->files);
> +		if (fd < fdt->max_fds && close_on_exec(fd, fdt))
> +			f_flags |= O_CLOEXEC;
>  	}
> -
> -	if (ret)
> -		return ret;
> +	rcu_read_unlock();
> +	task_unlock(task);
>  
>  	seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\n",
>  		   (long long)file->f_pos, f_flags,
>  		   real_mount(file->f_path.mnt)->mnt_id);
>  
> -	show_fd_locks(m, file, files);
> +	/* show_fd_locks() never dereferences file, NULL is fine too */
> +	show_fd_locks(m, file, task->files);
>  	if (seq_has_overflowed(m))
>  		goto out;
>  
>  	if (file->f_op->show_fdinfo)
>  		file->f_op->show_fdinfo(m, file);
>  
> -out:
>  	fput(file);
> -	return 0;
> +out:
> +	put_task_struct(task);
> +	return ret;
>  }
>  
>  static int seq_fdinfo_open(struct inode *inode, struct file *file)
> @@ -83,19 +74,14 @@ static const struct file_operations proc_fdinfo_file_operations = {
>  
>  static bool tid_fd_mode(struct task_struct *task, unsigned fd, fmode_t *mode)
>  {
> -	struct files_struct *files = get_files_struct(task);
> -	struct file *file;
> +	struct file *file = fget_task(task, fd);
>  
> -	if (!files)
> +	if (IS_ERR_OR_NULL(file))
>  		return false;
>  
> -	rcu_read_lock();
> -	file = fcheck_files(files, fd);
> -	if (file)
> -		*mode = file->f_mode;
> -	rcu_read_unlock();
> -	put_files_struct(files);
> -	return !!file;
> +	*mode = file->f_mode;
> +	fput(file);
> +	return true;
>  }
>  
>  static void tid_fd_update_inode(struct task_struct *task, struct inode *inode,
> @@ -146,31 +132,24 @@ static const struct dentry_operations tid_fd_dentry_operations = {
>  
>  static int proc_fd_link(struct dentry *dentry, struct path *path)
>  {
> -	struct files_struct *files = NULL;
> -	struct task_struct *task;
> +	struct task_struct *task = get_proc_task(d_inode(dentry));
> +	unsigned int fd = proc_fd(d_inode(dentry));
> +	struct file *fd_file;
>  	int ret = -ENOENT;
>  
>  	task = get_proc_task(d_inode(dentry));
> -	if (task) {
> -		files = get_files_struct(task);
> -		put_task_struct(task);
> -	}
> -
> -	if (files) {
> -		unsigned int fd = proc_fd(d_inode(dentry));
> -		struct file *fd_file;
> +	if (!task)
> +		return ret;
>  
> -		spin_lock(&files->file_lock);
> -		fd_file = fcheck_files(files, fd);
> -		if (fd_file) {
> -			*path = fd_file->f_path;
> -			path_get(&fd_file->f_path);
> -			ret = 0;
> -		}
> -		spin_unlock(&files->file_lock);
> -		put_files_struct(files);
> +	fd_file = fget_task(task, fd);
> +	if (!IS_ERR_OR_NULL(fd_file)) {
> +		*path = fd_file->f_path;
> +		path_get(&fd_file->f_path);
> +		fput(fd_file);
> +		ret = 0;
>  	}
>  
> +	put_task_struct(task);
>  	return ret;
>  }
>  
> @@ -229,7 +208,6 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
>  			      instantiate_t instantiate)
>  {
>  	struct task_struct *p = get_proc_task(file_inode(file));
> -	struct files_struct *files;
>  	unsigned int fd;
>  
>  	if (!p)
> @@ -237,37 +215,30 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
>  
>  	if (!dir_emit_dots(file, ctx))
>  		goto out;
> -	files = get_files_struct(p);
> -	if (!files)
> -		goto out;
>  
> -	rcu_read_lock();
> -	for (fd = ctx->pos - 2;
> -	     fd < files_fdtable(files)->max_fds;
> -	     fd++, ctx->pos++) {
> +	for (fd = ctx->pos - 2;; fd++, ctx->pos++) {
>  		struct file *f;
>  		struct fd_data data;
>  		char name[10 + 1];
>  		unsigned int len;
>  
> -		f = fcheck_files(files, fd);
> +		f = fget_task(p, fd);
>  		if (!f)
>  			continue;
> +		if (IS_ERR(f))
> +			break;
> +
>  		data.mode = f->f_mode;
> -		rcu_read_unlock();
>  		data.fd = fd;
> +		fput(f);
>  
>  		len = snprintf(name, sizeof(name), "%u", fd);
>  		if (!proc_fill_cache(file, ctx,
>  				     name, len, instantiate, p,
>  				     &data))
> -			goto out_fd_loop;
> +			goto out;
>  		cond_resched();
> -		rcu_read_lock();
>  	}
> -	rcu_read_unlock();
> -out_fd_loop:
> -	put_files_struct(files);
>  out:
>  	put_task_struct(p);
>  	return 0;
> diff --git a/include/linux/file.h b/include/linux/file.h
> index 6b2fb03..8b7abdb 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -43,6 +43,9 @@ static inline void fdput(struct fd fd)
>  		fput(fd.file);
>  }
>  
> +struct task_struct;
> +extern struct file *fget_task(struct task_struct *task, unsigned int fd);
> +
>  extern struct file *fget(unsigned int fd);
>  extern struct file *fget_raw(unsigned int fd);
>  extern unsigned long __fdget(unsigned int fd);
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8339d81..2cbf26d 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2260,7 +2260,6 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
>  	pid_t pid = attr->task_fd_query.pid;
>  	u32 fd = attr->task_fd_query.fd;
>  	const struct perf_event *event;
> -	struct files_struct *files;
>  	struct task_struct *task;
>  	struct file *file;
>  	int err;
> @@ -2278,24 +2277,13 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
>  	if (!task)
>  		return -ENOENT;
>  
> -	files = get_files_struct(task);
> -	put_task_struct(task);
> -	if (!files)
> -		return -ENOENT;
> -
> -	err = 0;
> -	spin_lock(&files->file_lock);
> -	file = fcheck_files(files, fd);
> -	if (!file)
> -		err = -EBADF;
> -	else
> -		get_file(file);
> -	spin_unlock(&files->file_lock);
> -	put_files_struct(files);
> -
> -	if (err)
> +	err = -EBADF;
> +	file = fget_task(task, fd);
> +	if (IS_ERR_OR_NULL(file))
>  		goto out;
>  
> +	err = -ENOTSUPP;
> +
>  	if (file->f_op == &bpf_raw_tp_fops) {
>  		struct bpf_raw_tracepoint *raw_tp = file->private_data;
>  		struct bpf_raw_event_map *btp = raw_tp->btp;
> @@ -2324,7 +2312,6 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
>  		goto put_file;
>  	}
>  
> -	err = -ENOTSUPP;
>  put_file:
>  	fput(file);
>  out:
> diff --git a/kernel/kcmp.c b/kernel/kcmp.c
> index a0e3d7a..067639e 100644
> --- a/kernel/kcmp.c
> +++ b/kernel/kcmp.c
> @@ -107,7 +107,6 @@ static int kcmp_epoll_target(struct task_struct *task1,
>  {
>  	struct file *filp, *filp_epoll, *filp_tgt;
>  	struct kcmp_epoll_slot slot;
> -	struct files_struct *files;
>  
>  	if (copy_from_user(&slot, uslot, sizeof(slot)))
>  		return -EFAULT;
> @@ -116,23 +115,12 @@ static int kcmp_epoll_target(struct task_struct *task1,
>  	if (!filp)
>  		return -EBADF;
>  
> -	files = get_files_struct(task2);
> -	if (!files)
> +	filp_epoll = fget_task(task2, slot.efd);
> +	if (IS_ERR_OR_NULL(filp_epoll))
>  		return -EBADF;
>  
> -	spin_lock(&files->file_lock);
> -	filp_epoll = fcheck_files(files, slot.efd);
> -	if (filp_epoll)
> -		get_file(filp_epoll);
> -	else
> -		filp_tgt = ERR_PTR(-EBADF);
> -	spin_unlock(&files->file_lock);
> -	put_files_struct(files);
> -
> -	if (filp_epoll) {
> -		filp_tgt = get_epoll_tfile_raw_ptr(filp_epoll, slot.tfd, slot.toff);
> -		fput(filp_epoll);
> -	}
> +	filp_tgt = get_epoll_tfile_raw_ptr(filp_epoll, slot.tfd, slot.toff);
> +	fput(filp_epoll);
>  
>  	if (IS_ERR(filp_tgt))
>  		return PTR_ERR(filp_tgt);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/3] exec: do unshare_files after de_thread
  2018-09-15 16:37   ` Oleg Nesterov
@ 2018-09-16 16:49     ` Eric W. Biederman
  2018-09-17 15:28       ` Oleg Nesterov
  0 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2018-09-16 16:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jeff Layton, viro, berrange, linux-fsdevel, linux-kernel, Andrew Morton

Oleg Nesterov <oleg@redhat.com> writes:

> On 09/14, Jeff Layton wrote:
>>
>> POSIX mandates that open fds and their associated file locks should be
>> preserved across an execve. This works, unless the process is
>> multithreaded at the time that execve is called.
>>
>> In that case, we'll end up unsharing the files_struct but the locks will
>> still have their fl_owner set to the address of the old one. Eventually,
>> when the other threads die and the last reference to the old
>> files_struct is put, any POSIX locks get torn down since it looks like
>> a close occurred on them.
>>
>> The result is that all of your open files will be intact with none of
>> the locks you held before execve. The simple answer to this is "use OFD
>> locks", but this is a nasty surprise and it violates the spec.
>>
>> Fix this by doing unshare_files later during exec,
>
> See my reply to 1/3... if we can forget about the races with get_files_struct()
> we can probably make a much simpler patch, plus we do not need 2/2, afaics.
>
> What I really can't understand is why we need to _change_ current->files
> early in do_execve().
>
> IOW. Lets ignore do_close_on_exec(), lets ignore the fact that unshare_fd()
> can fail and thus it makes sense to call it before point-of-no-return.
>
> Any other reason why we can't simply call unshare_files() at the end of
> __do_execve_file() on success?

The reason we call we call unshare_files is in case the files are shared
with another process.  AKA old style linux threads, or someone being
clever.  In that case we need a private copy of files for close on exec
because we should not close the files of the other process that has not
called exec.

The only reason for calling unshare_files before the point of no return
is so that we can get a good error message to the calling process if
unshare_files fails.

Given that "files->count > 1" should only exist in rare and crazy cases.
I expect we can legitimately have exec fail hard if we -ENOMEM in that
case and kill the calling process.

AKA it would be reasonable to move unshare_files to just above
do_close_on_exec in flush_old_exec.  We could further make the
unshare_files not return displaced and just drop it.

Thinking about Jeff's version already by necessity places unshare_files
after de_thread.  So it is already after the point of no return.  So
there really is no point in getting trying hard with displaced files.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/3] exec: do unshare_files after de_thread
  2018-09-14 10:53 ` [PATCH v3 3/3] exec: do unshare_files after de_thread Jeff Layton
  2018-09-15 16:37   ` Oleg Nesterov
@ 2018-09-16 16:59   ` Eric W. Biederman
  1 sibling, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2018-09-16 16:59 UTC (permalink / raw)
  To: Jeff Layton; +Cc: viro, berrange, linux-fsdevel, linux-kernel

Jeff Layton <jlayton@kernel.org> writes:

> POSIX mandates that open fds and their associated file locks should be
> preserved across an execve. This works, unless the process is
> multithreaded at the time that execve is called.
>
> In that case, we'll end up unsharing the files_struct but the locks will
> still have their fl_owner set to the address of the old one. Eventually,
> when the other threads die and the last reference to the old
> files_struct is put, any POSIX locks get torn down since it looks like
> a close occurred on them.
>
> The result is that all of your open files will be intact with none of
> the locks you held before execve. The simple answer to this is "use OFD
> locks", but this is a nasty surprise and it violates the spec.
>
> Fix this by doing unshare_files later during exec, after we've already
> killed off the other threads in the process. This helps ensure that we
> only unshare the files_struct during exec when it is truly shared with
> other processes.
>
> Note that because the unshare_files call is now done just after
> de_thread, we need a mechanism to pass the displaced files_struct back
> up to __do_execve_file. This is done via a new displaced_files field
> inside the linux_binprm.

Actually because unshare_files happens after de_thread (the point of no
return) we don't need to do anything with displaced files.  A failing
exec will clear brpm->mm, and search_binary_handler will convert any
error into force_sigsegv.

> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Reported-by: Daniel P. Berrangé <berrange@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/exec.c               | 11 ++++++-----
>  include/linux/binfmts.h |  1 +
>  2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index c8a68481d7eb..b4a7e659d908 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1262,6 +1262,10 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	if (retval)
>  		goto out;
>  
> +	retval = unshare_files(&bprm->displaced_files);
> +	if (retval)
> +		goto out;
> +

We can place this just above do_close_on_exec.  And simply modify
unshare_files to instead of returning displaced simply put it.

We gain nothing be returning displaced_files as we are already past
the point of no return, and userspace will get an unblockable SIGSEGV
and exit.

>  	/*
>  	 * Must be called _before_ exec_mmap() as bprm->mm is
>  	 * not visibile until then. This also enables the update
> @@ -1713,7 +1717,7 @@ static int __do_execve_file(int fd, struct filename *filename,
>  {
>  	char *pathbuf = NULL;
>  	struct linux_binprm *bprm;
> -	struct files_struct *displaced;
> +	struct files_struct *displaced = NULL;
>  	int retval;
>  
>  	if (IS_ERR(filename))
> @@ -1735,10 +1739,6 @@ static int __do_execve_file(int fd, struct filename *filename,
>  	 * further execve() calls fail. */
>  	current->flags &= ~PF_NPROC_EXCEEDED;
>  
> -	retval = unshare_files(&displaced);
> -	if (retval)
> -		goto out_ret;
> -
>  	retval = -ENOMEM;
>  	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
>  	if (!bprm)
> @@ -1817,6 +1817,7 @@ static int __do_execve_file(int fd, struct filename *filename,
>  	would_dump(bprm, bprm->file);
>  
>  	retval = exec_binprm(bprm);
> +	displaced = bprm->displaced_files;
>  	if (retval < 0)
>  		goto out;
>  
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index c05f24fac4f6..d7ec384bb1b0 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -49,6 +49,7 @@ struct linux_binprm {
>  	unsigned int taso:1;
>  #endif
>  	unsigned int recursion_depth; /* only for search_binary_handler() */
> +	struct files_struct * displaced_files;
>  	struct file * file;
>  	struct cred *cred;	/* new credentials */
>  	int unsafe;		/* how unsafe this exec is (mask of LSM_UNSAFE_*) */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 0/3] exec: Moving unshare_files_struct
  2018-09-14 10:53 [PATCH v3 0/3] exec: fix passing of file locks across execve in multithreaded processes Jeff Layton
                   ` (2 preceding siblings ...)
  2018-09-14 10:53 ` [PATCH v3 3/3] exec: do unshare_files after de_thread Jeff Layton
@ 2018-09-16 17:38 ` Eric W. Biederman
  2018-09-16 17:39   ` [RFC][PATCH 1/3] exec: Move unshare_files down to avoid locks being dropped on exec Eric W. Biederman
                     ` (4 more replies)
  3 siblings, 5 replies; 22+ messages in thread
From: Eric W. Biederman @ 2018-09-16 17:38 UTC (permalink / raw)
  To: Jeff Layton; +Cc: viro, berrange, linux-fsdevel, linux-kernel, Oleg Nesterov


Paired with Oleg's patch to reduce the number of callers of
get_files_struct it looks like we can simplify the basic idea of moving
unshare_files in exec by quite a bit so that in net we have fewer lines
of code.

The big simplification from Jeff's verion is that we take advantage
of calling unshare_files past the point of no return.  Which removes
the need for cleanup, and restoring ->files.  Which removes the
need for blocking clone and unshare.

Oleg's patch to remove get_files_struct from proc means we don't need
two counts in files_struct.

Which leaves us with the question of what are the races in fs/exec.c
with respect to accessing files.  Semantically I don't think we care
but we do need to be certain the implementation of exec is still robust.

These patches are still rough and ready and only compile tested but I
believe they demonstrate what is possible.

Eric W. Biederman (3):
      exec: Move unshare_files down to avoid locks being dropped on exec.
      exec: Simplify unshare_files
      exec: Remove reset_files_struct

 fs/coredump.c           |  5 +----
 fs/exec.c               | 16 +++++-----------
 fs/file.c               | 12 ------------
 include/linux/fdtable.h |  3 +--
 kernel/fork.c           | 12 ++++++------
 5 files changed, 13 insertions(+), 35 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 1/3] exec: Move unshare_files down to avoid locks being dropped on exec.
  2018-09-16 17:38 ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Eric W. Biederman
@ 2018-09-16 17:39   ` Eric W. Biederman
  2018-09-17 15:49     ` Oleg Nesterov
  2018-09-16 17:40   ` [RFC][PATCH 2/3] exec: Simplify unshare_files Eric W. Biederman
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2018-09-16 17:39 UTC (permalink / raw)
  To: Jeff Layton; +Cc: viro, berrange, linux-fsdevel, linux-kernel, Oleg Nesterov


Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 1ebf6e5a521d..6f6167ec08eb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1252,6 +1252,7 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
  */
 int flush_old_exec(struct linux_binprm * bprm)
 {
+	struct files_struct *displaced;
 	int retval;
 
 	/*
@@ -1291,6 +1292,12 @@ int flush_old_exec(struct linux_binprm * bprm)
 	flush_thread();
 	current->personality &= ~bprm->per_clear;
 
+	retval = unshare_files(&displaced);
+	if (retval)
+		goto out;
+	if (displaced)
+		put_files_struct(displaced);
+
 	/*
 	 * We have to apply CLOEXEC before we change whether the process is
 	 * dumpable (in setup_new_exec) to avoid a race with a process in userspace
@@ -1713,7 +1720,6 @@ static int __do_execve_file(int fd, struct filename *filename,
 {
 	char *pathbuf = NULL;
 	struct linux_binprm *bprm;
-	struct files_struct *displaced;
 	int retval;
 
 	if (IS_ERR(filename))
@@ -1735,14 +1741,10 @@ static int __do_execve_file(int fd, struct filename *filename,
 	 * further execve() calls fail. */
 	current->flags &= ~PF_NPROC_EXCEEDED;
 
-	retval = unshare_files(&displaced);
-	if (retval)
-		goto out_ret;
-
 	retval = -ENOMEM;
 	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
 	if (!bprm)
-		goto out_files;
+		goto out_ret;
 
 	retval = prepare_bprm_creds(bprm);
 	if (retval)
@@ -1831,8 +1833,6 @@ static int __do_execve_file(int fd, struct filename *filename,
 	kfree(pathbuf);
 	if (filename)
 		putname(filename);
-	if (displaced)
-		put_files_struct(displaced);
 	return retval;
 
 out:
@@ -1849,9 +1849,6 @@ static int __do_execve_file(int fd, struct filename *filename,
 	free_bprm(bprm);
 	kfree(pathbuf);
 
-out_files:
-	if (displaced)
-		reset_files_struct(displaced);
 out_ret:
 	if (filename)
 		putname(filename);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][PATCH 2/3] exec: Simplify unshare_files
  2018-09-16 17:38 ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Eric W. Biederman
  2018-09-16 17:39   ` [RFC][PATCH 1/3] exec: Move unshare_files down to avoid locks being dropped on exec Eric W. Biederman
@ 2018-09-16 17:40   ` Eric W. Biederman
  2018-09-17 16:23     ` Oleg Nesterov
  2018-09-16 17:41   ` [RFC][PATCH 3/3] exec: Remove reset_files_struct Eric W. Biederman
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Eric W. Biederman @ 2018-09-16 17:40 UTC (permalink / raw)
  To: Jeff Layton; +Cc: viro, berrange, linux-fsdevel, linux-kernel, Oleg Nesterov


Now that exec calls unshare_files after the point of no return there
is no reason to return displaced.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/coredump.c           |  5 +----
 fs/exec.c               |  5 +----
 include/linux/fdtable.h |  2 +-
 kernel/fork.c           | 12 ++++++------
 4 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 1e2c87acac9b..968ee5744bf9 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -546,7 +546,6 @@ void do_coredump(const siginfo_t *siginfo)
 	struct cred *cred;
 	int retval = 0;
 	int ispipe;
-	struct files_struct *displaced;
 	/* require nonrelative corefile path and be extra careful */
 	bool need_suid_safe = false;
 	bool core_dumped = false;
@@ -747,11 +746,9 @@ void do_coredump(const siginfo_t *siginfo)
 	}
 
 	/* get us an unshared descriptor table; almost always a no-op */
-	retval = unshare_files(&displaced);
+	retval = unshare_files();
 	if (retval)
 		goto close_fail;
-	if (displaced)
-		put_files_struct(displaced);
 	if (!dump_interrupted()) {
 		file_start_write(cprm.file);
 		core_dumped = binfmt->core_dump(&cprm);
diff --git a/fs/exec.c b/fs/exec.c
index 6f6167ec08eb..7eeffa7d98c6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1252,7 +1252,6 @@ void __set_task_comm(struct task_struct *tsk, const char *buf, bool exec)
  */
 int flush_old_exec(struct linux_binprm * bprm)
 {
-	struct files_struct *displaced;
 	int retval;
 
 	/*
@@ -1292,11 +1291,9 @@ int flush_old_exec(struct linux_binprm * bprm)
 	flush_thread();
 	current->personality &= ~bprm->per_clear;
 
-	retval = unshare_files(&displaced);
+	retval = unshare_files();
 	if (retval)
 		goto out;
-	if (displaced)
-		put_files_struct(displaced);
 
 	/*
 	 * We have to apply CLOEXEC before we change whether the process is
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 41615f38bcff..e65379336c4c 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -108,7 +108,7 @@ struct task_struct;
 struct files_struct *get_files_struct(struct task_struct *);
 void put_files_struct(struct files_struct *fs);
 void reset_files_struct(struct files_struct *);
-int unshare_files(struct files_struct **);
+int unshare_files(void);
 struct files_struct *dup_fd(struct files_struct *, int *) __latent_entropy;
 void do_close_on_exec(struct files_struct *);
 int iterate_fd(struct files_struct *, unsigned,
diff --git a/kernel/fork.c b/kernel/fork.c
index d896e9ca38b0..a06a609075eb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2564,21 +2564,21 @@ SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
  *	the exec layer of the kernel.
  */
 
-int unshare_files(struct files_struct **displaced)
+int unshare_files(void)
 {
 	struct task_struct *task = current;
-	struct files_struct *copy = NULL;
+	struct files_struct *files, *copy = NULL;
 	int error;
 
 	error = unshare_fd(CLONE_FILES, &copy);
-	if (error || !copy) {
-		*displaced = NULL;
+	if (error || !copy)
 		return error;
-	}
-	*displaced = task->files;
+
+	files = task->files;
 	task_lock(task);
 	task->files = copy;
 	task_unlock(task);
+	put_files_struct(files);
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC][PATCH 3/3] exec: Remove reset_files_struct
  2018-09-16 17:38 ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Eric W. Biederman
  2018-09-16 17:39   ` [RFC][PATCH 1/3] exec: Move unshare_files down to avoid locks being dropped on exec Eric W. Biederman
  2018-09-16 17:40   ` [RFC][PATCH 2/3] exec: Simplify unshare_files Eric W. Biederman
@ 2018-09-16 17:41   ` Eric W. Biederman
  2018-09-17 15:59   ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Oleg Nesterov
  2018-09-17 16:24   ` Jeff Layton
  4 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2018-09-16 17:41 UTC (permalink / raw)
  To: Jeff Layton; +Cc: viro, berrange, linux-fsdevel, linux-kernel, Oleg Nesterov


Now that unshare_files is called after the point of no return there
are no more callers of reset_files_struct so remove it.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/file.c               | 12 ------------
 include/linux/fdtable.h |  1 -
 2 files changed, 13 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 7ffd6e9d103d..eed29e034a47 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -422,18 +422,6 @@ void put_files_struct(struct files_struct *files)
 	}
 }
 
-void reset_files_struct(struct files_struct *files)
-{
-	struct task_struct *tsk = current;
-	struct files_struct *old;
-
-	old = tsk->files;
-	task_lock(tsk);
-	tsk->files = files;
-	task_unlock(tsk);
-	put_files_struct(old);
-}
-
 void exit_files(struct task_struct *tsk)
 {
 	struct files_struct * files = tsk->files;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index e65379336c4c..68858f620e87 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -107,7 +107,6 @@ struct task_struct;
 
 struct files_struct *get_files_struct(struct task_struct *);
 void put_files_struct(struct files_struct *fs);
-void reset_files_struct(struct files_struct *);
 int unshare_files(void);
 struct files_struct *dup_fd(struct files_struct *, int *) __latent_entropy;
 void do_close_on_exec(struct files_struct *);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/3] exec: separate thread_count for files_struct
  2018-09-16 16:10     ` Eric W. Biederman
@ 2018-09-17 15:24       ` Oleg Nesterov
  2018-09-17 20:45         ` Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2018-09-17 15:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeff Layton, viro, berrange, linux-fsdevel, linux-kernel, Andrew Morton

On 09/16, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > As for binder.c, in this case we probably actually want to unshare ->files
> > on exec so we can ignore it?
>
> Looking at the binder case it only captures ->files on mmap.  Exec
> ditches the mmap.  So if the order of operations are correct than
> the dropping of the old mm will also drop the count on files_struct
> held by binder.
>
> So semantically binder should not effect locks on exec,

Agreed, but it does.

Before your "[PATCH 0/3] exec: Moving unshare_files_struct" unshare_files()
is called before exec_mmap().

And even with this series we can have another CLONE_VM process.

Howver, I think this doesn't really matter. binder does __fd_install(files),
so if it actually has a reference to execing_task->files, I think it should
be unshared anyway.

> In short as long as we get the oder of operations correct we should be
> able to safely ignore binder, and not have binder affect the results of
> this code.

Agreed.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/3] exec: do unshare_files after de_thread
  2018-09-16 16:49     ` Eric W. Biederman
@ 2018-09-17 15:28       ` Oleg Nesterov
  0 siblings, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2018-09-17 15:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeff Layton, viro, berrange, linux-fsdevel, linux-kernel, Andrew Morton

On 09/16, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > IOW. Lets ignore do_close_on_exec(), lets ignore the fact that unshare_fd()
> > can fail and thus it makes sense to call it before point-of-no-return.
> >
> > Any other reason why we can't simply call unshare_files() at the end of
> > __do_execve_file() on success?
>
> The reason we call we call unshare_files is in case the files are shared
> with another process.  AKA old style linux threads, or someone being
> clever.  In that case we need a private copy of files for close on exec
> because we should not close the files of the other process that has not
> called exec.

This is clear,

> The only reason for calling unshare_files before the point of no return
> is so that we can get a good error message to the calling process if
> unshare_files fails.

OK, so you too think there are no other reasons.

> AKA it would be reasonable to move unshare_files to just above
> do_close_on_exec in flush_old_exec.  We could further make the
> unshare_files not return displaced and just drop it.

Yes, this is exactly what I had in mind.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 1/3] exec: Move unshare_files down to avoid locks being dropped on exec.
  2018-09-16 17:39   ` [RFC][PATCH 1/3] exec: Move unshare_files down to avoid locks being dropped on exec Eric W. Biederman
@ 2018-09-17 15:49     ` Oleg Nesterov
  0 siblings, 0 replies; 22+ messages in thread
From: Oleg Nesterov @ 2018-09-17 15:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeff Layton, viro, berrange, linux-fsdevel, linux-kernel

On 09/16, Eric W. Biederman wrote:
>
> @@ -1291,6 +1292,12 @@ int flush_old_exec(struct linux_binprm * bprm)
>  	flush_thread();
>  	current->personality &= ~bprm->per_clear;
>
> +	retval = unshare_files(&displaced);

I was going to sugget basically the same changes, please feel free to add
my reviewed-by to 1-3.


Just for record. If we should really worry about unshare_files() failure
after de_thread() (imo we shouldn't), we can do another change:

	__do_execve_file:

		unshare_fd(CLONE_FILES, &bprm->unshared_copy);
		...


	flush_old_exec:

		de_thread();

		if (bprm->unshared_copy) {
			// now that we killed sub-threads recheck
			if (current->files->count > 1) {
				put_files_struct(current->files);
				current->files = bprm->unshared_copy;
			} else {
				put_files_struct(bprm->unshared_copy);
			}
				
		}

but again, I think your series is fine.

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 0/3] exec: Moving unshare_files_struct
  2018-09-16 17:38 ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Eric W. Biederman
                     ` (2 preceding siblings ...)
  2018-09-16 17:41   ` [RFC][PATCH 3/3] exec: Remove reset_files_struct Eric W. Biederman
@ 2018-09-17 15:59   ` Oleg Nesterov
  2018-09-18 22:18     ` Eric W. Biederman
  2018-09-17 16:24   ` Jeff Layton
  4 siblings, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2018-09-17 15:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeff Layton, viro, berrange, linux-fsdevel, linux-kernel

On 09/16, Eric W. Biederman wrote:
>
> Oleg's patch to remove get_files_struct from proc means we don't need
> two counts in files_struct.

So it seems you agree with this patch at least in general.

OK, if nobody else objects I'll split this patch and send the series
tomorrow.

> Eric W. Biederman (3):
>       exec: Move unshare_files down to avoid locks being dropped on exec.
>       exec: Simplify unshare_files
>       exec: Remove reset_files_struct

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 2/3] exec: Simplify unshare_files
  2018-09-16 17:40   ` [RFC][PATCH 2/3] exec: Simplify unshare_files Eric W. Biederman
@ 2018-09-17 16:23     ` Oleg Nesterov
  2018-09-17 20:26       ` Eric W. Biederman
  0 siblings, 1 reply; 22+ messages in thread
From: Oleg Nesterov @ 2018-09-17 16:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeff Layton, viro, berrange, linux-fsdevel, linux-kernel

absolutely off-topic question,

On 09/16, Eric W. Biederman wrote:
>
> @@ -747,11 +746,9 @@ void do_coredump(const siginfo_t *siginfo)
>  	}
>
>  	/* get us an unshared descriptor table; almost always a no-op */
> -	retval = unshare_files(&displaced);
> +	retval = unshare_files();

I fail to understand why do_coredump() needs unshare_files(). Could someone
explain?

And "almost always a no-op" above is not true, this is never a no-op in mt case;
other (killed) threads sleep in exit_mm() which is called before exit_files().

Oleg.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 0/3] exec: Moving unshare_files_struct
  2018-09-16 17:38 ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Eric W. Biederman
                     ` (3 preceding siblings ...)
  2018-09-17 15:59   ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Oleg Nesterov
@ 2018-09-17 16:24   ` Jeff Layton
  4 siblings, 0 replies; 22+ messages in thread
From: Jeff Layton @ 2018-09-17 16:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: viro, berrange, linux-fsdevel, linux-kernel, Oleg Nesterov

On Sun, 2018-09-16 at 19:38 +0200, Eric W. Biederman wrote:
> Paired with Oleg's patch to reduce the number of callers of
> get_files_struct it looks like we can simplify the basic idea of moving
> unshare_files in exec by quite a bit so that in net we have fewer lines
> of code.
> 
> The big simplification from Jeff's verion is that we take advantage
> of calling unshare_files past the point of no return.  Which removes
> the need for cleanup, and restoring ->files.  Which removes the
> need for blocking clone and unshare.
> 
> Oleg's patch to remove get_files_struct from proc means we don't need
> two counts in files_struct.
> 
> Which leaves us with the question of what are the races in fs/exec.c
> with respect to accessing files.  Semantically I don't think we care
> but we do need to be certain the implementation of exec is still robust.
> 
> These patches are still rough and ready and only compile tested but I
> believe they demonstrate what is possible.
> 
> Eric W. Biederman (3):
>       exec: Move unshare_files down to avoid locks being dropped on exec.
>       exec: Simplify unshare_files
>       exec: Remove reset_files_struct
> 
>  fs/coredump.c           |  5 +----
>  fs/exec.c               | 16 +++++-----------
>  fs/file.c               | 12 ------------
>  include/linux/fdtable.h |  3 +--
>  kernel/fork.c           | 12 ++++++------
>  5 files changed, 13 insertions(+), 35 deletions(-)
> 
> Eric

Much better than what I had proposed and I do like that diffstat. You
can add this from me too:

    Reviewed-by: Jeff Layton <jlayton@kernel.org>

Maybe get this into -next soon once it has passed some sanity tests?

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 2/3] exec: Simplify unshare_files
  2018-09-17 16:23     ` Oleg Nesterov
@ 2018-09-17 20:26       ` Eric W. Biederman
  0 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2018-09-17 20:26 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Jeff Layton, viro, berrange, linux-fsdevel, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> absolutely off-topic question,
>
> On 09/16, Eric W. Biederman wrote:
>>
>> @@ -747,11 +746,9 @@ void do_coredump(const siginfo_t *siginfo)
>>  	}
>>
>>  	/* get us an unshared descriptor table; almost always a no-op */
>> -	retval = unshare_files(&displaced);
>> +	retval = unshare_files();
>
> I fail to understand why do_coredump() needs unshare_files(). Could someone
> explain?
>
> And "almost always a no-op" above is not true, this is never a no-op in mt case;
> other (killed) threads sleep in exit_mm() which is called before
> exit_files().

So I looked at the history and I have half an explanation.
179e037fc1370288188cb1f90b81156d75a3cb2d do_coredump(): make sure that descriptor table isn't shared

As far as I can tell this was Al Viro making certain that there were not
any races that had to be dealt with when accessing the file table during
execve.

Which gets to the heart of what we have to do before this set of changes
that we have been looking at can be merged.  We need to go through exec
and do_coredump if we wish to remove this call of unshare_files
and verify that everything is thread-safe, and using thread-safe idioms.

There is at least one place in exec where it is documented that the
access to files is not thread-safe in the comment.  I don't think any of
that is fundamentally hard but that work needs to be done for the rest
of this cleanup to be usable.

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/3] exec: separate thread_count for files_struct
  2018-09-17 15:24       ` Oleg Nesterov
@ 2018-09-17 20:45         ` Eric W. Biederman
  0 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2018-09-17 20:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jeff Layton, viro, berrange, linux-fsdevel, linux-kernel, Andrew Morton

Oleg Nesterov <oleg@redhat.com> writes:

> On 09/16, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > As for binder.c, in this case we probably actually want to unshare ->files
>> > on exec so we can ignore it?
>>
>> Looking at the binder case it only captures ->files on mmap.  Exec
>> ditches the mmap.  So if the order of operations are correct than
>> the dropping of the old mm will also drop the count on files_struct
>> held by binder.
>>
>> So semantically binder should not effect locks on exec,
>
> Agreed, but it does.
>
> Before your "[PATCH 0/3] exec: Moving unshare_files_struct" unshare_files()
> is called before exec_mmap().
>
> And even with this series we can have another CLONE_VM process.
>
> Howver, I think this doesn't really matter. binder does __fd_install(files),
> so if it actually has a reference to execing_task->files, I think it should
> be unshared anyway.
>
>> In short as long as we get the oder of operations correct we should be
>> able to safely ignore binder, and not have binder affect the results of
>> this code.
>
> Agreed.

I may have spoken too soon.  Binder uses schedule_work to call
put_files_struct from munmap.  So the files->count may still be elevated
after the mm is put.  Ick.

Eric


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 0/3] exec: Moving unshare_files_struct
  2018-09-17 15:59   ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Oleg Nesterov
@ 2018-09-18 22:18     ` Eric W. Biederman
  0 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2018-09-18 22:18 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Jeff Layton, viro, berrange, linux-fsdevel, linux-kernel

Oleg Nesterov <oleg@redhat.com> writes:

> On 09/16, Eric W. Biederman wrote:
>>
>> Oleg's patch to remove get_files_struct from proc means we don't need
>> two counts in files_struct.
>
> So it seems you agree with this patch at least in general.

I do I think in the context we are looking at things finding a way not
to need to bump files_struct->count looks like it will simplify a lot of
things and there are not many cases we need to consider.

I have been thorough and mentioned the binder case and other case but
that isn't because I disagree I simply don't want us to overlook weird
or tricky cases.

It appears that the more often that we look at these bits the better
a solution we wind up writing.

> OK, if nobody else objects I'll split this patch and send the series
> tomorrow.

No objection from me.

>> Eric W. Biederman (3):
>>       exec: Move unshare_files down to avoid locks being dropped on exec.
>>       exec: Simplify unshare_files
>>       exec: Remove reset_files_struct
>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>

Eric

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-09-18 22:19 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-14 10:53 [PATCH v3 0/3] exec: fix passing of file locks across execve in multithreaded processes Jeff Layton
2018-09-14 10:53 ` [PATCH v3 1/3] exec: separate thread_count for files_struct Jeff Layton
2018-09-15 16:04   ` Oleg Nesterov
2018-09-16 16:10     ` Eric W. Biederman
2018-09-17 15:24       ` Oleg Nesterov
2018-09-17 20:45         ` Eric W. Biederman
2018-09-14 10:53 ` [PATCH v3 2/3] exec: delay clone(CLONE_FILES) if task associated with current files_struct is exec'ing Jeff Layton
2018-09-14 10:53 ` [PATCH v3 3/3] exec: do unshare_files after de_thread Jeff Layton
2018-09-15 16:37   ` Oleg Nesterov
2018-09-16 16:49     ` Eric W. Biederman
2018-09-17 15:28       ` Oleg Nesterov
2018-09-16 16:59   ` Eric W. Biederman
2018-09-16 17:38 ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Eric W. Biederman
2018-09-16 17:39   ` [RFC][PATCH 1/3] exec: Move unshare_files down to avoid locks being dropped on exec Eric W. Biederman
2018-09-17 15:49     ` Oleg Nesterov
2018-09-16 17:40   ` [RFC][PATCH 2/3] exec: Simplify unshare_files Eric W. Biederman
2018-09-17 16:23     ` Oleg Nesterov
2018-09-17 20:26       ` Eric W. Biederman
2018-09-16 17:41   ` [RFC][PATCH 3/3] exec: Remove reset_files_struct Eric W. Biederman
2018-09-17 15:59   ` [RFC][PATCH 0/3] exec: Moving unshare_files_struct Oleg Nesterov
2018-09-18 22:18     ` Eric W. Biederman
2018-09-17 16:24   ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).