linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] Implementing kernel_execve
@ 2020-07-14 13:27 Eric W. Biederman
  2020-07-14 13:28 ` [PATCH 1/7] exec: Remove unnecessary spaces from binfmts.h Eric W. Biederman
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Eric W. Biederman @ 2020-07-14 13:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Kees Cook, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig


This set of changes implements kernel_execve to remove the need for
kernel threads to pass in pointers to in-kernel data structures
to functions that take __user pointers.   Which is part of the
greater removal of set_fs work.

This set of changes makes do_execve static and so I have updated the
comments.  This affects the comments in the x86 entry point code
and the comments in tomoyo.  I believe I have updated them correctly.
If not please let me know.

I have moved the calls of copy_strings before the call of
security_bprm_creds_for_exec.  Which might be of interest to the
security folks.  I can't see that it matters but I have copied the
security folks just to be certain.

By moving the initialization of the new stack that copy_strings does
earlier it becomes possible to copy all of the parameters to exec before
anything else is done which makes it possible to have one function
kernel_execve that uncondtionally handles copying parameters from kernel
space, and another function do_execveat_common which handles copying
parameters from userspace.

This work was inspired by Christoph Hellwig's similar patchset, which my
earlier work to remove the file parameter to do_execveat_common
conflicted with.
https://lore.kernel.org/linux-fsdevel/20200627072704.2447163-1-hch@lst.de/

I figured that after causing all of that trouble for the set_fs work
the least I could do is implement the change myself.

The big practical change from Christoph's work is that he did not
separate out the copying of parameters from the rest of the work of
exec, which did not help the maintainability of the code.

Please let me know if you see something wrong.

This set of changes is against my exec-next branch:
https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git exec-next

Eric W. Biederman (7):
      exec: Remove unnecessary spaces from binfmts.h
      exec: Factor out alloc_bprm
      exec: Move initialization of bprm->filename into alloc_bprm
      exec: Move bprm_mm_init into alloc_bprm
      exec: Factor bprm_execve out of do_execve_common
      exec: Factor bprm_stack_limits out of prepare_arg_pages
      exec: Implement kernel_execve

 arch/x86/entry/entry_32.S      |   2 +-
 arch/x86/entry/entry_64.S      |   2 +-
 arch/x86/kernel/unwind_frame.c |   2 +-
 fs/exec.c                      | 301 ++++++++++++++++++++++++++++-------------
 include/linux/binfmts.h        |  20 ++-
 init/main.c                    |   4 +-
 kernel/umh.c                   |   6 +-
 security/tomoyo/common.h       |   2 +-
 security/tomoyo/domain.c       |   4 +-
 security/tomoyo/tomoyo.c       |   4 +-
 10 files changed, 224 insertions(+), 123 deletions(-)

Eric

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/7] exec: Remove unnecessary spaces from binfmts.h
  2020-07-14 13:27 [PATCH 0/7] Implementing kernel_execve Eric W. Biederman
@ 2020-07-14 13:28 ` Eric W. Biederman
  2020-07-14 21:38   ` Kees Cook
  2020-07-15  6:28   ` Christoph Hellwig
  2020-07-14 13:29 ` [PATCH 2/7] exec: Factor out alloc_bprm Eric W. Biederman
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Eric W. Biederman @ 2020-07-14 13:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Kees Cook, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig


The general convention in the linux kernel is to define a pointer
member as "type *name".  The declaration of struct linux_binprm has
several pointer defined as "type * name".  Update them to the
form of "type *name" for consistency.

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/binfmts.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 7c27d7b57871..eb5cb8df5485 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -45,15 +45,15 @@ struct linux_binprm {
 #ifdef __alpha__
 	unsigned int taso:1;
 #endif
-	struct file * executable; /* Executable to pass to the interpreter */
-	struct file * interpreter;
-	struct file * file;
+	struct file *executable; /* Executable to pass to the interpreter */
+	struct file *interpreter;
+	struct file *file;
 	struct cred *cred;	/* new credentials */
 	int unsafe;		/* how unsafe this exec is (mask of LSM_UNSAFE_*) */
 	unsigned int per_clear;	/* bits to clear in current->personality */
 	int argc, envc;
-	const char * filename;	/* Name of binary as seen by procps */
-	const char * interp;	/* Name of the binary really executed. Most
+	const char *filename;	/* Name of binary as seen by procps */
+	const char *interp;	/* Name of the binary really executed. Most
 				   of the time same as filename, but could be
 				   different for binfmt_{misc,script} */
 	unsigned interp_flags;
-- 
2.25.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/7] exec: Factor out alloc_bprm
  2020-07-14 13:27 [PATCH 0/7] Implementing kernel_execve Eric W. Biederman
  2020-07-14 13:28 ` [PATCH 1/7] exec: Remove unnecessary spaces from binfmts.h Eric W. Biederman
@ 2020-07-14 13:29 ` Eric W. Biederman
  2020-07-14 21:38   ` Kees Cook
  2020-07-15  6:30   ` Christoph Hellwig
  2020-07-14 13:29 ` [PATCH 3/7] exec: Move initialization of bprm->filename into alloc_bprm Eric W. Biederman
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Eric W. Biederman @ 2020-07-14 13:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Kees Cook, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig


Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.

To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier.  Move the allocation
of the bprm into it's own function (alloc_bprm) and move the call of
alloc_bprm before unshare_files so that bprm can ultimately be
allocated, the arguments can be placed on the new stack, and then the
bprm can be passed into the core of exec.

Neither the allocation of struct binprm nor the unsharing depend upon each
other so swapping the order in which they are called is trivially safe.

To keep things consistent the order of cleanup at the end of
do_execve_common swapped to match the order of initialization.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 23dfbb820626..526156d6461d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1560,6 +1560,14 @@ static void free_bprm(struct linux_binprm *bprm)
 	kfree(bprm);
 }
 
+static struct linux_binprm *alloc_bprm(void)
+{
+	struct linux_binprm *bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
+	if (!bprm)
+		return ERR_PTR(-ENOMEM);
+	return bprm;
+}
+
 int bprm_change_interp(const char *interp, struct linux_binprm *bprm)
 {
 	/* If a binfmt changed the interp, free it first. */
@@ -1848,18 +1856,19 @@ static int do_execveat_common(int fd, struct filename *filename,
 	 * further execve() calls fail. */
 	current->flags &= ~PF_NPROC_EXCEEDED;
 
-	retval = unshare_files(&displaced);
-	if (retval)
+	bprm = alloc_bprm();
+	if (IS_ERR(bprm)) {
+		retval = PTR_ERR(bprm);
 		goto out_ret;
+	}
 
-	retval = -ENOMEM;
-	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
-	if (!bprm)
-		goto out_files;
+	retval = unshare_files(&displaced);
+	if (retval)
+		goto out_free;
 
 	retval = prepare_bprm_creds(bprm);
 	if (retval)
-		goto out_free;
+		goto out_files;
 
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
@@ -1956,13 +1965,13 @@ static int do_execveat_common(int fd, struct filename *filename,
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
 
+out_files:
+	if (displaced)
+		reset_files_struct(displaced);
 out_free:
 	free_bprm(bprm);
 	kfree(pathbuf);
 
-out_files:
-	if (displaced)
-		reset_files_struct(displaced);
 out_ret:
 	putname(filename);
 	return retval;
-- 
2.25.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/7] exec: Move initialization of bprm->filename into alloc_bprm
  2020-07-14 13:27 [PATCH 0/7] Implementing kernel_execve Eric W. Biederman
  2020-07-14 13:28 ` [PATCH 1/7] exec: Remove unnecessary spaces from binfmts.h Eric W. Biederman
  2020-07-14 13:29 ` [PATCH 2/7] exec: Factor out alloc_bprm Eric W. Biederman
@ 2020-07-14 13:29 ` Eric W. Biederman
  2020-07-14 21:38   ` Kees Cook
  2020-07-15  6:34   ` Christoph Hellwig
  2020-07-14 13:30 ` [PATCH 4/7] exec: Move bprm_mm_init " Eric W. Biederman
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Eric W. Biederman @ 2020-07-14 13:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Kees Cook, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig


Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.

To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier.  Move the computation
of bprm->filename and possible allocation of a name in the case
of execveat into alloc_bprm to make that possible.

The exectuable name, the arguments, and the environment are
copied into the new usermode stack which is stored in bprm
until exec passes the point of no return.

As the executable name is copied first onto the usermode stack
it needs to be known.  As there are no dependencies to computing
the executable name, compute it early in alloc_bprm.

As an implementation detail if the filename needs to be generated
because it embeds a file descriptor store that filename in a new field
bprm->fdpath, and free it in free_bprm.  Previously this was done in
an independent variable pathbuf.  I have renamed pathbuf fdpath
because fdpath is more suggestive of what kind of path is in the
variable.  I moved fdpath into struct linux_binprm because it is
tightly tied to the other variables in struct linux_binprm, and as
such is needed to allow the call alloc_binprm to move.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c               | 61 ++++++++++++++++++++++-------------------
 include/linux/binfmts.h |  1 +
 2 files changed, 34 insertions(+), 28 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 526156d6461d..7e8af27dd199 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1557,15 +1557,37 @@ static void free_bprm(struct linux_binprm *bprm)
 	/* If a binfmt changed the interp, free it. */
 	if (bprm->interp != bprm->filename)
 		kfree(bprm->interp);
+	kfree(bprm->fdpath);
 	kfree(bprm);
 }
 
-static struct linux_binprm *alloc_bprm(void)
+static struct linux_binprm *alloc_bprm(int fd, struct filename *filename)
 {
 	struct linux_binprm *bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
+	int retval = -ENOMEM;
 	if (!bprm)
-		return ERR_PTR(-ENOMEM);
+		goto out;
+
+	if (fd == AT_FDCWD || filename->name[0] == '/') {
+		bprm->filename = filename->name;
+	} else {
+		if (filename->name[0] == '\0')
+			bprm->fdpath = kasprintf(GFP_KERNEL, "/dev/fd/%d", fd);
+		else
+			bprm->fdpath = kasprintf(GFP_KERNEL, "/dev/fd/%d/%s",
+						  fd, filename->name);
+		if (!bprm->fdpath)
+			goto out_free;
+
+		bprm->filename = bprm->fdpath;
+	}
+	bprm->interp = bprm->filename;
 	return bprm;
+
+out_free:
+	free_bprm(bprm);
+out:
+	return ERR_PTR(retval);
 }
 
 int bprm_change_interp(const char *interp, struct linux_binprm *bprm)
@@ -1831,7 +1853,6 @@ static int do_execveat_common(int fd, struct filename *filename,
 			      struct user_arg_ptr envp,
 			      int flags)
 {
-	char *pathbuf = NULL;
 	struct linux_binprm *bprm;
 	struct file *file;
 	struct files_struct *displaced;
@@ -1856,7 +1877,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	 * further execve() calls fail. */
 	current->flags &= ~PF_NPROC_EXCEEDED;
 
-	bprm = alloc_bprm();
+	bprm = alloc_bprm(fd, filename);
 	if (IS_ERR(bprm)) {
 		retval = PTR_ERR(bprm);
 		goto out_ret;
@@ -1881,28 +1902,14 @@ static int do_execveat_common(int fd, struct filename *filename,
 	sched_exec();
 
 	bprm->file = file;
-	if (fd == AT_FDCWD || filename->name[0] == '/') {
-		bprm->filename = filename->name;
-	} else {
-		if (filename->name[0] == '\0')
-			pathbuf = kasprintf(GFP_KERNEL, "/dev/fd/%d", fd);
-		else
-			pathbuf = kasprintf(GFP_KERNEL, "/dev/fd/%d/%s",
-					    fd, filename->name);
-		if (!pathbuf) {
-			retval = -ENOMEM;
-			goto out_unmark;
-		}
-		/*
-		 * Record that a name derived from an O_CLOEXEC fd will be
-		 * inaccessible after exec. Relies on having exclusive access to
-		 * current->files (due to unshare_files above).
-		 */
-		if (close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
-			bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
-		bprm->filename = pathbuf;
-	}
-	bprm->interp = bprm->filename;
+	/*
+	 * Record that a name derived from an O_CLOEXEC fd will be
+	 * inaccessible after exec. Relies on having exclusive access to
+	 * current->files (due to unshare_files above).
+	 */
+	if (bprm->fdpath &&
+	    close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
+		bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
 
 	retval = bprm_mm_init(bprm);
 	if (retval)
@@ -1941,7 +1948,6 @@ static int do_execveat_common(int fd, struct filename *filename,
 	acct_update_integrals(current);
 	task_numa_free(current, false);
 	free_bprm(bprm);
-	kfree(pathbuf);
 	putname(filename);
 	if (displaced)
 		put_files_struct(displaced);
@@ -1970,7 +1976,6 @@ static int do_execveat_common(int fd, struct filename *filename,
 		reset_files_struct(displaced);
 out_free:
 	free_bprm(bprm);
-	kfree(pathbuf);
 
 out_ret:
 	putname(filename);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index eb5cb8df5485..8e9e1b0c8eb8 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -56,6 +56,7 @@ struct linux_binprm {
 	const char *interp;	/* Name of the binary really executed. Most
 				   of the time same as filename, but could be
 				   different for binfmt_{misc,script} */
+	const char *fdpath;	/* generated filename for execveat */
 	unsigned interp_flags;
 	int execfd;		/* File descriptor of the executable */
 	unsigned long loader, exec;
-- 
2.25.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/7] exec: Move bprm_mm_init into alloc_bprm
  2020-07-14 13:27 [PATCH 0/7] Implementing kernel_execve Eric W. Biederman
                   ` (2 preceding siblings ...)
  2020-07-14 13:29 ` [PATCH 3/7] exec: Move initialization of bprm->filename into alloc_bprm Eric W. Biederman
@ 2020-07-14 13:30 ` Eric W. Biederman
  2020-07-14 21:37   ` Kees Cook
  2020-07-15  6:35   ` Christoph Hellwig
  2020-07-14 13:30 ` [PATCH 5/7] exec: Factor bprm_execve out of do_execve_common Eric W. Biederman
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Eric W. Biederman @ 2020-07-14 13:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Kees Cook, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig


Currently it is necessary for the usermode helper code and the code that
launches init to use set_fs so that pages coming from the kernel look like
they are coming from userspace.

To allow that usage of set_fs to be removed cleanly the argument copying
from userspace needs to happen earlier.  Move the allocation and
initialization of bprm->mm into alloc_bprm so that the bprm->mm is
available early to store the new user stack into.  This is a prerequisite
for copying argv and envp into the new user stack early before ther rest of
exec.

To keep the things consistent the cleanup of bprm->mm is moved into
free_bprm.  So that bprm->mm will be cleaned up whenever bprm->mm is
allocated and free_bprm are called.

Moving bprm_mm_init earlier is safe as it does not depend on any files,
current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file
table is shared. (AKA bprm_mm_init does not depend on any of the code that
happens between alloc_bprm and where it was previously called.)

This moves bprm->mm cleanup after current->fs->in_exec is set to 0.  This
is safe because current->fs->in_exec is only used to preventy taking an
additional reference on the fs_struct.

This moves bprm->mm cleanup after current->in_execve is set to 0.  This is
safe because current->in_execve is only used by the lsms (apparmor and
tomoyou) and always for LSM specific functions, never for anything to do
with the mm.

This adds bprm->mm cleanup into the successful return path.  This is safe
because being on the successful return path implies that begin_new_exec
succeeded and set brpm->mm to NULL.  As bprm->mm is NULL bprm cleanup I am
moving into free_bprm will do nothing.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 7e8af27dd199..afb168bf5e23 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1543,6 +1543,10 @@ static int prepare_bprm_creds(struct linux_binprm *bprm)
 
 static void free_bprm(struct linux_binprm *bprm)
 {
+	if (bprm->mm) {
+		acct_arg_size(bprm, 0);
+		mmput(bprm->mm);
+	}
 	free_arg_pages(bprm);
 	if (bprm->cred) {
 		mutex_unlock(&current->signal->cred_guard_mutex);
@@ -1582,6 +1586,10 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename)
 		bprm->filename = bprm->fdpath;
 	}
 	bprm->interp = bprm->filename;
+
+	retval = bprm_mm_init(bprm);
+	if (retval)
+		goto out_free;
 	return bprm;
 
 out_free:
@@ -1911,10 +1919,6 @@ static int do_execveat_common(int fd, struct filename *filename,
 	    close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
 		bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
 
-	retval = bprm_mm_init(bprm);
-	if (retval)
-		goto out_unmark;
-
 	retval = prepare_arg_pages(bprm, argv, envp);
 	if (retval < 0)
 		goto out;
@@ -1962,10 +1966,6 @@ static int do_execveat_common(int fd, struct filename *filename,
 	 */
 	if (bprm->point_of_no_return && !fatal_signal_pending(current))
 		force_sigsegv(SIGSEGV);
-	if (bprm->mm) {
-		acct_arg_size(bprm, 0);
-		mmput(bprm->mm);
-	}
 
 out_unmark:
 	current->fs->in_exec = 0;
-- 
2.25.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 5/7] exec: Factor bprm_execve out of do_execve_common
  2020-07-14 13:27 [PATCH 0/7] Implementing kernel_execve Eric W. Biederman
                   ` (3 preceding siblings ...)
  2020-07-14 13:30 ` [PATCH 4/7] exec: Move bprm_mm_init " Eric W. Biederman
@ 2020-07-14 13:30 ` Eric W. Biederman
  2020-07-14 21:38   ` Kees Cook
  2020-07-15  6:36   ` Christoph Hellwig
  2020-07-14 13:31 ` [PATCH 6/7] exec: Factor bprm_stack_limits out of prepare_arg_pages Eric W. Biederman
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Eric W. Biederman @ 2020-07-14 13:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Kees Cook, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig


Currently it is necessary for the usermode helper code and the code
that launches init to use set_fs so that pages coming from the kernel
look like they are coming from userspace.

To allow that usage of set_fs to be removed cleanly the argument
copying from userspace needs to happen earlier.  Factor bprm_execve
out of do_execve_common to separate out the copying of arguments
to the newe stack, and the rest of exec.

In separating bprm_execve from do_execve_common the copying
of the arguments onto the new stack happens earlier.

As the copying of the arguments does not depend any security hooks,
files, the file table, current->in_execve, current->fs->in_exec,
bprm->unsafe, or creds this is safe.

Likewise the security hook security_creds_for_exec does not depend upon
preventing the argument copying from happening.

In addition to making it possible to implement kernel_execve that
performs the copying differently, this separation of bprm_execve from
do_execve_common makes for a nice separation of responsibilities making
the exec code easier to navigate.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 108 +++++++++++++++++++++++++++++-------------------------
 1 file changed, 58 insertions(+), 50 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index afb168bf5e23..50508892fa71 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1856,44 +1856,16 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int do_execveat_common(int fd, struct filename *filename,
-			      struct user_arg_ptr argv,
-			      struct user_arg_ptr envp,
-			      int flags)
+static int bprm_execve(struct linux_binprm *bprm,
+		       int fd, struct filename *filename, int flags)
 {
-	struct linux_binprm *bprm;
 	struct file *file;
 	struct files_struct *displaced;
 	int retval;
 
-	if (IS_ERR(filename))
-		return PTR_ERR(filename);
-
-	/*
-	 * We move the actual failure in case of RLIMIT_NPROC excess from
-	 * set*uid() to execve() because too many poorly written programs
-	 * don't check setuid() return code.  Here we additionally recheck
-	 * whether NPROC limit is still exceeded.
-	 */
-	if ((current->flags & PF_NPROC_EXCEEDED) &&
-	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
-		retval = -EAGAIN;
-		goto out_ret;
-	}
-
-	/* We're below the limit (still or again), so we don't want to make
-	 * further execve() calls fail. */
-	current->flags &= ~PF_NPROC_EXCEEDED;
-
-	bprm = alloc_bprm(fd, filename);
-	if (IS_ERR(bprm)) {
-		retval = PTR_ERR(bprm);
-		goto out_ret;
-	}
-
 	retval = unshare_files(&displaced);
 	if (retval)
-		goto out_free;
+		return retval;
 
 	retval = prepare_bprm_creds(bprm);
 	if (retval)
@@ -1919,28 +1891,11 @@ static int do_execveat_common(int fd, struct filename *filename,
 	    close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
 		bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
 
-	retval = prepare_arg_pages(bprm, argv, envp);
-	if (retval < 0)
-		goto out;
-
 	/* Set the unchanging part of bprm->cred */
 	retval = security_bprm_creds_for_exec(bprm);
 	if (retval)
 		goto out;
 
-	retval = copy_string_kernel(bprm->filename, bprm);
-	if (retval < 0)
-		goto out;
-
-	bprm->exec = bprm->p;
-	retval = copy_strings(bprm->envc, envp, bprm);
-	if (retval < 0)
-		goto out;
-
-	retval = copy_strings(bprm->argc, argv, bprm);
-	if (retval < 0)
-		goto out;
-
 	retval = exec_binprm(bprm);
 	if (retval < 0)
 		goto out;
@@ -1951,8 +1906,6 @@ static int do_execveat_common(int fd, struct filename *filename,
 	rseq_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current, false);
-	free_bprm(bprm);
-	putname(filename);
 	if (displaced)
 		put_files_struct(displaced);
 	return retval;
@@ -1974,6 +1927,61 @@ static int do_execveat_common(int fd, struct filename *filename,
 out_files:
 	if (displaced)
 		reset_files_struct(displaced);
+
+	return retval;
+}
+
+static int do_execveat_common(int fd, struct filename *filename,
+			      struct user_arg_ptr argv,
+			      struct user_arg_ptr envp,
+			      int flags)
+{
+	struct linux_binprm *bprm;
+	int retval;
+
+	if (IS_ERR(filename))
+		return PTR_ERR(filename);
+
+	/*
+	 * We move the actual failure in case of RLIMIT_NPROC excess from
+	 * set*uid() to execve() because too many poorly written programs
+	 * don't check setuid() return code.  Here we additionally recheck
+	 * whether NPROC limit is still exceeded.
+	 */
+	if ((current->flags & PF_NPROC_EXCEEDED) &&
+	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
+		retval = -EAGAIN;
+		goto out_ret;
+	}
+
+	/* We're below the limit (still or again), so we don't want to make
+	 * further execve() calls fail. */
+	current->flags &= ~PF_NPROC_EXCEEDED;
+
+	bprm = alloc_bprm(fd, filename);
+	if (IS_ERR(bprm)) {
+		retval = PTR_ERR(bprm);
+		goto out_ret;
+	}
+
+	retval = prepare_arg_pages(bprm, argv, envp);
+	if (retval < 0)
+		goto out_free;
+
+	retval = copy_string_kernel(bprm->filename, bprm);
+	if (retval < 0)
+		goto out_free;
+	bprm->exec = bprm->p;
+
+	retval = copy_strings(bprm->envc, envp, bprm);
+	if (retval < 0)
+		goto out_free;
+
+	retval = copy_strings(bprm->argc, argv, bprm);
+	if (retval < 0)
+		goto out_free;
+
+	retval = bprm_execve(bprm, fd, filename, flags);
 out_free:
 	free_bprm(bprm);
 
-- 
2.25.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 6/7] exec: Factor bprm_stack_limits out of prepare_arg_pages
  2020-07-14 13:27 [PATCH 0/7] Implementing kernel_execve Eric W. Biederman
                   ` (4 preceding siblings ...)
  2020-07-14 13:30 ` [PATCH 5/7] exec: Factor bprm_execve out of do_execve_common Eric W. Biederman
@ 2020-07-14 13:31 ` Eric W. Biederman
  2020-07-14 21:41   ` Kees Cook
  2020-07-15  6:38   ` Christoph Hellwig
  2020-07-14 13:31 ` [PATCH 7/7] exec: Implement kernel_execve Eric W. Biederman
  2020-07-14 15:32 ` [PATCH 0/7] Implementing kernel_execve Linus Torvalds
  7 siblings, 2 replies; 30+ messages in thread
From: Eric W. Biederman @ 2020-07-14 13:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Kees Cook, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig


In preparation for implementiong kernel_execve (which will take kernel
pointers not userspace pointers) factor out bprm_stack_limits out of
prepare_arg_pages.  This separates the counting which depends upon the
getting data from userspace from the calculations of the stack limits
which is usable in kernel_execve.

The remove prepare_args_pages and compute bprm->argc and bprm->envc
directly in do_execveat_common, before bprm_stack_limits is called.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 50508892fa71..f8135dc149b3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -448,19 +448,10 @@ static int count(struct user_arg_ptr argv, int max)
 	return i;
 }
 
-static int prepare_arg_pages(struct linux_binprm *bprm,
-			struct user_arg_ptr argv, struct user_arg_ptr envp)
+static int bprm_stack_limits(struct linux_binprm *bprm)
 {
 	unsigned long limit, ptr_size;
 
-	bprm->argc = count(argv, MAX_ARG_STRINGS);
-	if (bprm->argc < 0)
-		return bprm->argc;
-
-	bprm->envc = count(envp, MAX_ARG_STRINGS);
-	if (bprm->envc < 0)
-		return bprm->envc;
-
 	/*
 	 * Limit to 1/4 of the max stack size or 3/4 of _STK_LIM
 	 * (whichever is smaller) for the argv+env strings.
@@ -1964,7 +1955,17 @@ static int do_execveat_common(int fd, struct filename *filename,
 		goto out_ret;
 	}
 
-	retval = prepare_arg_pages(bprm, argv, envp);
+	retval = count(argv, MAX_ARG_STRINGS);
+	if (retval < 0)
+		goto out_free;
+	bprm->argc = retval;
+
+	retval = count(envp, MAX_ARG_STRINGS);
+	if (retval < 0)
+		goto out_free;
+	bprm->envc = retval;
+
+	retval = bprm_stack_limits(bprm);
 	if (retval < 0)
 		goto out_free;
 
-- 
2.25.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 7/7] exec: Implement kernel_execve
  2020-07-14 13:27 [PATCH 0/7] Implementing kernel_execve Eric W. Biederman
                   ` (5 preceding siblings ...)
  2020-07-14 13:31 ` [PATCH 6/7] exec: Factor bprm_stack_limits out of prepare_arg_pages Eric W. Biederman
@ 2020-07-14 13:31 ` Eric W. Biederman
  2020-07-14 21:49   ` Kees Cook
  2020-07-15  6:42   ` Christoph Hellwig
  2020-07-14 15:32 ` [PATCH 0/7] Implementing kernel_execve Linus Torvalds
  7 siblings, 2 replies; 30+ messages in thread
From: Eric W. Biederman @ 2020-07-14 13:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Kees Cook, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig


To allow the kernel not to play games with set_fs to call exec
implement kernel_execve.  The function kernel_execve takes pointers
into kernel memory and copies the values pointed to onto the new
userspace stack.

The calls with arguments from kernel space of do_execve are replaced
with calls to kernel_execve.

The calls do_execve and do_execveat are made static as there are now
no callers outside of exec.

The comments that mention do_execve are updated to refer to
kernel_execve or execve depending on the circumstances.  In addition
to correcting the comments, this makes it easy to grep for do_execve
and verify it is not used.

Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/x86/entry/entry_32.S      |  2 +-
 arch/x86/entry/entry_64.S      |  2 +-
 arch/x86/kernel/unwind_frame.c |  2 +-
 fs/exec.c                      | 88 +++++++++++++++++++++++++++++++++-
 include/linux/binfmts.h        |  9 +---
 init/main.c                    |  4 +-
 kernel/umh.c                   |  6 +--
 security/tomoyo/common.h       |  2 +-
 security/tomoyo/domain.c       |  4 +-
 security/tomoyo/tomoyo.c       |  4 +-
 10 files changed, 100 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 024d7d276cd4..8f4e085ee06d 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -854,7 +854,7 @@ SYM_CODE_START(ret_from_fork)
 	CALL_NOSPEC ebx
 	/*
 	 * A kernel thread is allowed to return here after successfully
-	 * calling do_execve().  Exit to userspace to complete the execve()
+	 * calling kernel_execve().  Exit to userspace to complete the execve()
 	 * syscall.
 	 */
 	movl	$0, PT_EAX(%esp)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index d2a00c97e53f..73c7e255256b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -293,7 +293,7 @@ SYM_CODE_START(ret_from_fork)
 	CALL_NOSPEC rbx
 	/*
 	 * A kernel thread is allowed to return here after successfully
-	 * calling do_execve().  Exit to userspace to complete the execve()
+	 * calling kernel_execve().  Exit to userspace to complete the execve()
 	 * syscall.
 	 */
 	movq	$0, RAX(%rsp)
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index 722a85f3b2dd..e40b4942157f 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -275,7 +275,7 @@ bool unwind_next_frame(struct unwind_state *state)
 		 * This user_mode() check is slightly broader than a PF_KTHREAD
 		 * check because it also catches the awkward situation where a
 		 * newly forked kthread transitions into a user task by calling
-		 * do_execve(), which eventually clears PF_KTHREAD.
+		 * kernel_execve(), which eventually clears PF_KTHREAD.
 		 */
 		if (!user_mode(regs))
 			goto the_end;
diff --git a/fs/exec.c b/fs/exec.c
index f8135dc149b3..3698252719a3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -448,6 +448,23 @@ static int count(struct user_arg_ptr argv, int max)
 	return i;
 }
 
+static int count_strings_kernel(const char *const *argv)
+{
+	int i;
+
+	if (!argv)
+		return 0;
+
+	for (i = 0; argv[i]; ++i) {
+		if (i >= MAX_ARG_STRINGS)
+			return -E2BIG;
+		if (fatal_signal_pending(current))
+			return -ERESTARTNOHAND;
+		cond_resched();
+	}
+	return i;
+}
+
 static int bprm_stack_limits(struct linux_binprm *bprm)
 {
 	unsigned long limit, ptr_size;
@@ -624,6 +641,20 @@ int copy_string_kernel(const char *arg, struct linux_binprm *bprm)
 }
 EXPORT_SYMBOL(copy_string_kernel);
 
+static int copy_strings_kernel(int argc, const char *const *argv,
+			       struct linux_binprm *bprm)
+{
+	while (argc-- > 0) {
+		int ret = copy_string_kernel(argv[argc], bprm);
+		if (ret < 0)
+			return ret;
+		if (fatal_signal_pending(current))
+			return -ERESTARTNOHAND;
+		cond_resched();
+	}
+	return 0;
+}
+
 #ifdef CONFIG_MMU
 
 /*
@@ -1991,7 +2022,60 @@ static int do_execveat_common(int fd, struct filename *filename,
 	return retval;
 }
 
-int do_execve(struct filename *filename,
+int kernel_execve(const char *kernel_filename,
+		  const char *const *argv, const char *const *envp)
+{
+	struct filename *filename;
+	struct linux_binprm *bprm;
+	int fd = AT_FDCWD;
+	int retval;
+
+	filename = getname_kernel(kernel_filename);
+	if (IS_ERR(filename))
+		return PTR_ERR(filename);
+
+	bprm = alloc_bprm(fd, filename);
+	if (IS_ERR(bprm)) {
+		retval = PTR_ERR(bprm);
+		goto out_ret;
+	}
+
+	retval = count_strings_kernel(argv);
+	if (retval < 0)
+		goto out_free;
+	bprm->argc = retval;
+
+	retval = count_strings_kernel(envp);
+	if (retval < 0)
+		goto out_free;
+	bprm->envc = retval;
+
+	retval = bprm_stack_limits(bprm);
+	if (retval < 0)
+		goto out_free;
+
+	retval = copy_string_kernel(bprm->filename, bprm);
+	if (retval < 0)
+		goto out_free;
+	bprm->exec = bprm->p;
+
+	retval = copy_strings_kernel(bprm->envc, envp, bprm);
+	if (retval < 0)
+		goto out_free;
+
+	retval = copy_strings_kernel(bprm->argc, argv, bprm);
+	if (retval < 0)
+		goto out_free;
+
+	retval = bprm_execve(bprm, fd, filename, 0);
+out_free:
+	free_bprm(bprm);
+out_ret:
+	putname(filename);
+	return retval;
+}
+
+static int do_execve(struct filename *filename,
 	const char __user *const __user *__argv,
 	const char __user *const __user *__envp)
 {
@@ -2000,7 +2084,7 @@ int do_execve(struct filename *filename,
 	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
 }
 
-int do_execveat(int fd, struct filename *filename,
+static int do_execveat(int fd, struct filename *filename,
 		const char __user *const __user *__argv,
 		const char __user *const __user *__envp,
 		int flags)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 8e9e1b0c8eb8..0571701ab1c5 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -135,12 +135,7 @@ int copy_string_kernel(const char *arg, struct linux_binprm *bprm);
 extern void set_binfmt(struct linux_binfmt *new);
 extern ssize_t read_code(struct file *, unsigned long, loff_t, size_t);
 
-extern int do_execve(struct filename *,
-		     const char __user * const __user *,
-		     const char __user * const __user *);
-extern int do_execveat(int, struct filename *,
-		       const char __user * const __user *,
-		       const char __user * const __user *,
-		       int);
+int kernel_execve(const char *filename,
+		  const char *const *argv, const char *const *envp);
 
 #endif /* _LINUX_BINFMTS_H */
diff --git a/init/main.c b/init/main.c
index 0ead83e86b5a..78ccec5c28f3 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1329,9 +1329,7 @@ static int run_init_process(const char *init_filename)
 	pr_debug("  with environment:\n");
 	for (p = envp_init; *p; p++)
 		pr_debug("    %s\n", *p);
-	return do_execve(getname_kernel(init_filename),
-		(const char __user *const __user *)argv_init,
-		(const char __user *const __user *)envp_init);
+	return kernel_execve(init_filename, argv_init, envp_init);
 }
 
 static int try_to_run_init_process(const char *init_filename)
diff --git a/kernel/umh.c b/kernel/umh.c
index 6ca2096298b9..a25433f9cd9a 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -98,9 +98,9 @@ static int call_usermodehelper_exec_async(void *data)
 
 	commit_creds(new);
 
-	retval = do_execve(getname_kernel(sub_info->path),
-			   (const char __user *const __user *)sub_info->argv,
-			   (const char __user *const __user *)sub_info->envp);
+	retval = kernel_execve(sub_info->path,
+			       (const char *const *)sub_info->argv,
+			       (const char *const *)sub_info->envp);
 out:
 	sub_info->retval = retval;
 	/*
diff --git a/security/tomoyo/common.h b/security/tomoyo/common.h
index 050473df5809..85246b9df7ca 100644
--- a/security/tomoyo/common.h
+++ b/security/tomoyo/common.h
@@ -425,7 +425,7 @@ struct tomoyo_request_info {
 	struct tomoyo_obj_info *obj;
 	/*
 	 * For holding parameters specific to execve() request.
-	 * NULL if not dealing do_execve().
+	 * NULL if not dealing execve().
 	 */
 	struct tomoyo_execve *ee;
 	struct tomoyo_domain_info *domain;
diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c
index 7869d6a9980b..53b3e1f5f227 100644
--- a/security/tomoyo/domain.c
+++ b/security/tomoyo/domain.c
@@ -767,7 +767,7 @@ int tomoyo_find_next_domain(struct linux_binprm *bprm)
 
 	/*
 	 * Check for domain transition preference if "file execute" matched.
-	 * If preference is given, make do_execve() fail if domain transition
+	 * If preference is given, make execve() fail if domain transition
 	 * has failed, for domain transition preference should be used with
 	 * destination domain defined.
 	 */
@@ -810,7 +810,7 @@ int tomoyo_find_next_domain(struct linux_binprm *bprm)
 		snprintf(ee->tmp, TOMOYO_EXEC_TMPSIZE - 1, "<%s>",
 			 candidate->name);
 		/*
-		 * Make do_execve() fail if domain transition across namespaces
+		 * Make execve() fail if domain transition across namespaces
 		 * has failed.
 		 */
 		reject_on_transition_failure = true;
diff --git a/security/tomoyo/tomoyo.c b/security/tomoyo/tomoyo.c
index f9adddc42ac8..1f3cd432d830 100644
--- a/security/tomoyo/tomoyo.c
+++ b/security/tomoyo/tomoyo.c
@@ -93,7 +93,7 @@ static int tomoyo_bprm_check_security(struct linux_binprm *bprm)
 	struct tomoyo_task *s = tomoyo_task(current);
 
 	/*
-	 * Execute permission is checked against pathname passed to do_execve()
+	 * Execute permission is checked against pathname passed to execve()
 	 * using current domain.
 	 */
 	if (!s->old_domain_info) {
@@ -307,7 +307,7 @@ static int tomoyo_file_fcntl(struct file *file, unsigned int cmd,
  */
 static int tomoyo_file_open(struct file *f)
 {
-	/* Don't check read permission here if called from do_execve(). */
+	/* Don't check read permission here if called from execve(). */
 	if (current->in_execve)
 		return 0;
 	return tomoyo_check_open_permission(tomoyo_domain(), &f->f_path,
-- 
2.25.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/7] Implementing kernel_execve
  2020-07-14 13:27 [PATCH 0/7] Implementing kernel_execve Eric W. Biederman
                   ` (6 preceding siblings ...)
  2020-07-14 13:31 ` [PATCH 7/7] exec: Implement kernel_execve Eric W. Biederman
@ 2020-07-14 15:32 ` Linus Torvalds
  7 siblings, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2020-07-14 15:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Kernel Mailing List, Kees Cook, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa, LSM List,
	Serge E. Hallyn, James Morris, Kentaro Takeda, Casey Schaufler,
	John Johansen, Christoph Hellwig

On Tue, Jul 14, 2020 at 6:30 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Please let me know if you see something wrong.

Ack, looks good to me.

                 Linus

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/7] exec: Move bprm_mm_init into alloc_bprm
  2020-07-14 13:30 ` [PATCH 4/7] exec: Move bprm_mm_init " Eric W. Biederman
@ 2020-07-14 21:37   ` Kees Cook
  2020-07-15  6:35   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Kees Cook @ 2020-07-14 21:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:30:02AM -0500, Eric W. Biederman wrote:
> 
> Currently it is necessary for the usermode helper code and the code that
> launches init to use set_fs so that pages coming from the kernel look like
> they are coming from userspace.
> 
> To allow that usage of set_fs to be removed cleanly the argument copying
> from userspace needs to happen earlier.  Move the allocation and
> initialization of bprm->mm into alloc_bprm so that the bprm->mm is
> available early to store the new user stack into.  This is a prerequisite
> for copying argv and envp into the new user stack early before ther rest of
> exec.
> 
> To keep the things consistent the cleanup of bprm->mm is moved into
> free_bprm.  So that bprm->mm will be cleaned up whenever bprm->mm is
> allocated and free_bprm are called.
> 
> Moving bprm_mm_init earlier is safe as it does not depend on any files,
> current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file
> table is shared. (AKA bprm_mm_init does not depend on any of the code that
> happens between alloc_bprm and where it was previously called.)
> 
> This moves bprm->mm cleanup after current->fs->in_exec is set to 0.  This
> is safe because current->fs->in_exec is only used to preventy taking an
> additional reference on the fs_struct.
> 
> This moves bprm->mm cleanup after current->in_execve is set to 0.  This is
> safe because current->in_execve is only used by the lsms (apparmor and
> tomoyou) and always for LSM specific functions, never for anything to do
> with the mm.
> 
> This adds bprm->mm cleanup into the successful return path.  This is safe
> because being on the successful return path implies that begin_new_exec
> succeeded and set brpm->mm to NULL.  As bprm->mm is NULL bprm cleanup I am
> moving into free_bprm will do nothing.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

This looks correct, and is required before moving the arg pages stuff,
so good. :)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] exec: Factor bprm_execve out of do_execve_common
  2020-07-14 13:30 ` [PATCH 5/7] exec: Factor bprm_execve out of do_execve_common Eric W. Biederman
@ 2020-07-14 21:38   ` Kees Cook
  2020-07-15  6:36   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Kees Cook @ 2020-07-14 21:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:30:30AM -0500, Eric W. Biederman wrote:
> 
> Currently it is necessary for the usermode helper code and the code
> that launches init to use set_fs so that pages coming from the kernel
> look like they are coming from userspace.
> 
> To allow that usage of set_fs to be removed cleanly the argument
> copying from userspace needs to happen earlier.  Factor bprm_execve
> out of do_execve_common to separate out the copying of arguments
> to the newe stack, and the rest of exec.
> 
> In separating bprm_execve from do_execve_common the copying
> of the arguments onto the new stack happens earlier.
> 
> As the copying of the arguments does not depend any security hooks,
> files, the file table, current->in_execve, current->fs->in_exec,
> bprm->unsafe, or creds this is safe.
> 
> Likewise the security hook security_creds_for_exec does not depend upon
> preventing the argument copying from happening.
> 
> In addition to making it possible to implement kernel_execve that
> performs the copying differently, this separation of bprm_execve from
> do_execve_common makes for a nice separation of responsibilities making
> the exec code easier to navigate.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/7] exec: Remove unnecessary spaces from binfmts.h
  2020-07-14 13:28 ` [PATCH 1/7] exec: Remove unnecessary spaces from binfmts.h Eric W. Biederman
@ 2020-07-14 21:38   ` Kees Cook
  2020-07-15  6:28   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Kees Cook @ 2020-07-14 21:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:28:42AM -0500, Eric W. Biederman wrote:
> 
> The general convention in the linux kernel is to define a pointer
> member as "type *name".  The declaration of struct linux_binprm has
> several pointer defined as "type * name".  Update them to the
> form of "type *name" for consistency.
> 
> Suggested-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] exec: Factor out alloc_bprm
  2020-07-14 13:29 ` [PATCH 2/7] exec: Factor out alloc_bprm Eric W. Biederman
@ 2020-07-14 21:38   ` Kees Cook
  2020-07-15  6:30   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Kees Cook @ 2020-07-14 21:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:29:05AM -0500, Eric W. Biederman wrote:
> 
> Currently it is necessary for the usermode helper code and the code
> that launches init to use set_fs so that pages coming from the kernel
> look like they are coming from userspace.
> 
> To allow that usage of set_fs to be removed cleanly the argument
> copying from userspace needs to happen earlier.  Move the allocation
> of the bprm into it's own function (alloc_bprm) and move the call of
> alloc_bprm before unshare_files so that bprm can ultimately be
> allocated, the arguments can be placed on the new stack, and then the
> bprm can be passed into the core of exec.
> 
> Neither the allocation of struct binprm nor the unsharing depend upon each
> other so swapping the order in which they are called is trivially safe.
> 
> To keep things consistent the order of cleanup at the end of
> do_execve_common swapped to match the order of initialization.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/7] exec: Move initialization of bprm->filename into alloc_bprm
  2020-07-14 13:29 ` [PATCH 3/7] exec: Move initialization of bprm->filename into alloc_bprm Eric W. Biederman
@ 2020-07-14 21:38   ` Kees Cook
  2020-07-15  6:34   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Kees Cook @ 2020-07-14 21:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:29:36AM -0500, Eric W. Biederman wrote:
> 
> Currently it is necessary for the usermode helper code and the code
> that launches init to use set_fs so that pages coming from the kernel
> look like they are coming from userspace.
> 
> To allow that usage of set_fs to be removed cleanly the argument
> copying from userspace needs to happen earlier.  Move the computation
> of bprm->filename and possible allocation of a name in the case
> of execveat into alloc_bprm to make that possible.
> 
> The exectuable name, the arguments, and the environment are
> copied into the new usermode stack which is stored in bprm
> until exec passes the point of no return.
> 
> As the executable name is copied first onto the usermode stack
> it needs to be known.  As there are no dependencies to computing
> the executable name, compute it early in alloc_bprm.
> 
> As an implementation detail if the filename needs to be generated
> because it embeds a file descriptor store that filename in a new field
> bprm->fdpath, and free it in free_bprm.  Previously this was done in
> an independent variable pathbuf.  I have renamed pathbuf fdpath
> because fdpath is more suggestive of what kind of path is in the
> variable.  I moved fdpath into struct linux_binprm because it is
> tightly tied to the other variables in struct linux_binprm, and as
> such is needed to allow the call alloc_binprm to move.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/7] exec: Factor bprm_stack_limits out of prepare_arg_pages
  2020-07-14 13:31 ` [PATCH 6/7] exec: Factor bprm_stack_limits out of prepare_arg_pages Eric W. Biederman
@ 2020-07-14 21:41   ` Kees Cook
  2020-07-15  6:38   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Kees Cook @ 2020-07-14 21:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:31:03AM -0500, Eric W. Biederman wrote:
> 
> In preparation for implementiong kernel_execve (which will take kernel
> pointers not userspace pointers) factor out bprm_stack_limits out of
> prepare_arg_pages.  This separates the counting which depends upon the
> getting data from userspace from the calculations of the stack limits
> which is usable in kernel_execve.
> 
> The remove prepare_args_pages and compute bprm->argc and bprm->envc
> directly in do_execveat_common, before bprm_stack_limits is called.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] exec: Implement kernel_execve
  2020-07-14 13:31 ` [PATCH 7/7] exec: Implement kernel_execve Eric W. Biederman
@ 2020-07-14 21:49   ` Kees Cook
  2020-07-15  6:42     ` Christoph Hellwig
  2020-07-15  6:42   ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Kees Cook @ 2020-07-14 21:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Andy Lutomirski, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> +static int count_strings_kernel(const char *const *argv)
> +{
> +	int i;
> +
> +	if (!argv)
> +		return 0;
> +
> +	for (i = 0; argv[i]; ++i) {
> +		if (i >= MAX_ARG_STRINGS)
> +			return -E2BIG;
> +		if (fatal_signal_pending(current))
> +			return -ERESTARTNOHAND;
> +		cond_resched();
> +	}
> +	return i;
> +}

I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
refactor that too? (And maybe rename it to count_strings_user()?)

Otherwise, looks good:

Reviewed-by: Kees Cook <keescook@chromium.org>

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/7] exec: Remove unnecessary spaces from binfmts.h
  2020-07-14 13:28 ` [PATCH 1/7] exec: Remove unnecessary spaces from binfmts.h Eric W. Biederman
  2020-07-14 21:38   ` Kees Cook
@ 2020-07-15  6:28   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-15  6:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Kees Cook, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:28:42AM -0500, Eric W. Biederman wrote:
> 
> The general convention in the linux kernel is to define a pointer
> member as "type *name".  The declaration of struct linux_binprm has
> several pointer defined as "type * name".  Update them to the
> form of "type *name" for consistency.
> 
> Suggested-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/7] exec: Factor out alloc_bprm
  2020-07-14 13:29 ` [PATCH 2/7] exec: Factor out alloc_bprm Eric W. Biederman
  2020-07-14 21:38   ` Kees Cook
@ 2020-07-15  6:30   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-15  6:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Kees Cook, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:29:05AM -0500, Eric W. Biederman wrote:
> 
> Currently it is necessary for the usermode helper code and the code
> that launches init to use set_fs so that pages coming from the kernel
> look like they are coming from userspace.
> 
> To allow that usage of set_fs to be removed cleanly the argument
> copying from userspace needs to happen earlier.  Move the allocation
> of the bprm into it's own function (alloc_bprm) and move the call of
> alloc_bprm before unshare_files so that bprm can ultimately be
> allocated, the arguments can be placed on the new stack, and then the
> bprm can be passed into the core of exec.
> 
> Neither the allocation of struct binprm nor the unsharing depend upon each
> other so swapping the order in which they are called is trivially safe.
> 
> To keep things consistent the order of cleanup at the end of
> do_execve_common swapped to match the order of initialization.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c | 29 +++++++++++++++++++----------
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 23dfbb820626..526156d6461d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1560,6 +1560,14 @@ static void free_bprm(struct linux_binprm *bprm)
>  	kfree(bprm);
>  }
>  
> +static struct linux_binprm *alloc_bprm(void)
> +{
> +	struct linux_binprm *bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
> +	if (!bprm)
> +		return ERR_PTR(-ENOMEM);
> +	return bprm;

Unless this helper grows later I really don't see the point of it.
Also a NULL return vs ERR_PTR would simplify this a bit (again unless
this grows more code later with different return codes, but then again
it might make sense to only add the helper once it becomes useful).

The actual allocation order move looks fine, though.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/7] exec: Move initialization of bprm->filename into alloc_bprm
  2020-07-14 13:29 ` [PATCH 3/7] exec: Move initialization of bprm->filename into alloc_bprm Eric W. Biederman
  2020-07-14 21:38   ` Kees Cook
@ 2020-07-15  6:34   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-15  6:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Kees Cook, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:29:36AM -0500, Eric W. Biederman wrote:
>  
> -static struct linux_binprm *alloc_bprm(void)
> +static struct linux_binprm *alloc_bprm(int fd, struct filename *filename)
>  {
>  	struct linux_binprm *bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
> +	int retval = -ENOMEM;
>  	if (!bprm)
> -		return ERR_PTR(-ENOMEM);
> +		goto out;
> +

Ok, so here we add to it.   Two nitpicks:

The abose is missing a blank line after the declarations, which really
makes things more readable.  Also no need for the goto there.

Otherwisel ooks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/7] exec: Move bprm_mm_init into alloc_bprm
  2020-07-14 13:30 ` [PATCH 4/7] exec: Move bprm_mm_init " Eric W. Biederman
  2020-07-14 21:37   ` Kees Cook
@ 2020-07-15  6:35   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-15  6:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Kees Cook, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:30:02AM -0500, Eric W. Biederman wrote:
> 
> Currently it is necessary for the usermode helper code and the code that
> launches init to use set_fs so that pages coming from the kernel look like
> they are coming from userspace.
> 
> To allow that usage of set_fs to be removed cleanly the argument copying
> from userspace needs to happen earlier.  Move the allocation and
> initialization of bprm->mm into alloc_bprm so that the bprm->mm is
> available early to store the new user stack into.  This is a prerequisite
> for copying argv and envp into the new user stack early before ther rest of
> exec.
> 
> To keep the things consistent the cleanup of bprm->mm is moved into
> free_bprm.  So that bprm->mm will be cleaned up whenever bprm->mm is
> allocated and free_bprm are called.
> 
> Moving bprm_mm_init earlier is safe as it does not depend on any files,
> current->in_execve, current->fs->in_exec, bprm->unsafe, or the if the file
> table is shared. (AKA bprm_mm_init does not depend on any of the code that
> happens between alloc_bprm and where it was previously called.)
> 
> This moves bprm->mm cleanup after current->fs->in_exec is set to 0.  This
> is safe because current->fs->in_exec is only used to preventy taking an
> additional reference on the fs_struct.
> 
> This moves bprm->mm cleanup after current->in_execve is set to 0.  This is
> safe because current->in_execve is only used by the lsms (apparmor and
> tomoyou) and always for LSM specific functions, never for anything to do
> with the mm.
> 
> This adds bprm->mm cleanup into the successful return path.  This is safe
> because being on the successful return path implies that begin_new_exec
> succeeded and set brpm->mm to NULL.  As bprm->mm is NULL bprm cleanup I am
> moving into free_bprm will do nothing.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/7] exec: Factor bprm_execve out of do_execve_common
  2020-07-14 13:30 ` [PATCH 5/7] exec: Factor bprm_execve out of do_execve_common Eric W. Biederman
  2020-07-14 21:38   ` Kees Cook
@ 2020-07-15  6:36   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-15  6:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Kees Cook, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:30:30AM -0500, Eric W. Biederman wrote:
> 
> Currently it is necessary for the usermode helper code and the code
> that launches init to use set_fs so that pages coming from the kernel
> look like they are coming from userspace.
> 
> To allow that usage of set_fs to be removed cleanly the argument
> copying from userspace needs to happen earlier.  Factor bprm_execve
> out of do_execve_common to separate out the copying of arguments
> to the newe stack, and the rest of exec.
> 
> In separating bprm_execve from do_execve_common the copying
> of the arguments onto the new stack happens earlier.
> 
> As the copying of the arguments does not depend any security hooks,
> files, the file table, current->in_execve, current->fs->in_exec,
> bprm->unsafe, or creds this is safe.
> 
> Likewise the security hook security_creds_for_exec does not depend upon
> preventing the argument copying from happening.
> 
> In addition to making it possible to implement kernel_execve that
> performs the copying differently, this separation of bprm_execve from
> do_execve_common makes for a nice separation of responsibilities making
> the exec code easier to navigate.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/exec.c | 108 +++++++++++++++++++++++++++++-------------------------
>  1 file changed, 58 insertions(+), 50 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index afb168bf5e23..50508892fa71 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1856,44 +1856,16 @@ static int exec_binprm(struct linux_binprm *bprm)
>  /*
>   * sys_execve() executes a new program.
>   */
> -static int do_execveat_common(int fd, struct filename *filename,
> -			      struct user_arg_ptr argv,
> -			      struct user_arg_ptr envp,
> -			      int flags)
> +static int bprm_execve(struct linux_binprm *bprm,
> +		       int fd, struct filename *filename, int flags)

int fd easily fits onto the previous line.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/7] exec: Factor bprm_stack_limits out of prepare_arg_pages
  2020-07-14 13:31 ` [PATCH 6/7] exec: Factor bprm_stack_limits out of prepare_arg_pages Eric W. Biederman
  2020-07-14 21:41   ` Kees Cook
@ 2020-07-15  6:38   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-15  6:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Kees Cook, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 08:31:03AM -0500, Eric W. Biederman wrote:
> 
> In preparation for implementiong kernel_execve (which will take kernel
> pointers not userspace pointers) factor out bprm_stack_limits out of
> prepare_arg_pages.  This separates the counting which depends upon the
> getting data from userspace from the calculations of the stack limits
> which is usable in kernel_execve.
> 
> The remove prepare_args_pages and compute bprm->argc and bprm->envc
> directly in do_execveat_common, before bprm_stack_limits is called.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---

This looks basically identical to my "exec: split prepare_arg_pages".
I slightly prefer the version I had, but this looks ok as well:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] exec: Implement kernel_execve
  2020-07-14 13:31 ` [PATCH 7/7] exec: Implement kernel_execve Eric W. Biederman
  2020-07-14 21:49   ` Kees Cook
@ 2020-07-15  6:42   ` Christoph Hellwig
  2020-07-15 18:23     ` Eric W. Biederman
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-15  6:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Linus Torvalds, Kees Cook, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

> +static int count_strings_kernel(const char *const *argv)
> +{
> +	int i;
> +
> +	if (!argv)
> +		return 0;
> +
> +	for (i = 0; argv[i]; ++i) {
> +		if (i >= MAX_ARG_STRINGS)
> +			return -E2BIG;
> +		if (fatal_signal_pending(current))
> +			return -ERESTARTNOHAND;
> +		cond_resched();

I don't think we need a fatal_signal_pending and cond_resched() is
needed in each step given that we don't actually do anything.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] exec: Implement kernel_execve
  2020-07-14 21:49   ` Kees Cook
@ 2020-07-15  6:42     ` Christoph Hellwig
  2020-07-15 14:55       ` David Laight
  2020-07-15 15:00       ` Kees Cook
  0 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-15  6:42 UTC (permalink / raw)
  To: Kees Cook
  Cc: Eric W. Biederman, linux-kernel, Linus Torvalds, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen,
	Christoph Hellwig

On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > +static int count_strings_kernel(const char *const *argv)
> > +{
> > +	int i;
> > +
> > +	if (!argv)
> > +		return 0;
> > +
> > +	for (i = 0; argv[i]; ++i) {
> > +		if (i >= MAX_ARG_STRINGS)
> > +			return -E2BIG;
> > +		if (fatal_signal_pending(current))
> > +			return -ERESTARTNOHAND;
> > +		cond_resched();
> > +	}
> > +	return i;
> > +}
> 
> I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> refactor that too? (And maybe rename it to count_strings_user()?)

Liks this?

http://git.infradead.org/users/hch/misc.git/commitdiff/35a3129dab5b712b018c30681d15de42d9509731

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 7/7] exec: Implement kernel_execve
  2020-07-15  6:42     ` Christoph Hellwig
@ 2020-07-15 14:55       ` David Laight
  2020-07-15 15:09         ` Kees Cook
  2020-07-15 15:00       ` Kees Cook
  1 sibling, 1 reply; 30+ messages in thread
From: David Laight @ 2020-07-15 14:55 UTC (permalink / raw)
  To: 'Christoph Hellwig', Kees Cook
  Cc: Eric W. Biederman, linux-kernel, Linus Torvalds, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen

From: Christoph Hellwig
> Sent: 15 July 2020 07:43
> Subject: Re: [PATCH 7/7] exec: Implement kernel_execve
> 
> On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > +static int count_strings_kernel(const char *const *argv)
> > > +{
> > > +	int i;
> > > +
> > > +	if (!argv)
> > > +		return 0;
> > > +
> > > +	for (i = 0; argv[i]; ++i) {
> > > +		if (i >= MAX_ARG_STRINGS)
> > > +			return -E2BIG;
> > > +		if (fatal_signal_pending(current))
> > > +			return -ERESTARTNOHAND;
> > > +		cond_resched();
> > > +	}
> > > +	return i;
> > > +}
> >
> > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > refactor that too? (And maybe rename it to count_strings_user()?)

Thinks....
If you setup env[] and argv[] on the new user stack early in exec processing
then you may not need any limits at all - except the size of the user stack.
Even the get_user() loop will hit an invalid address before the counter
wraps (provided it is unsigned long).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] exec: Implement kernel_execve
  2020-07-15  6:42     ` Christoph Hellwig
  2020-07-15 14:55       ` David Laight
@ 2020-07-15 15:00       ` Kees Cook
  2020-07-15 18:20         ` Christoph Hellwig
  1 sibling, 1 reply; 30+ messages in thread
From: Kees Cook @ 2020-07-15 15:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric W. Biederman, linux-kernel, Linus Torvalds, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen

On Wed, Jul 15, 2020 at 07:42:48AM +0100, Christoph Hellwig wrote:
> On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > +static int count_strings_kernel(const char *const *argv)
> > > +{
> > > +	int i;
> > > +
> > > +	if (!argv)
> > > +		return 0;
> > > +
> > > +	for (i = 0; argv[i]; ++i) {
> > > +		if (i >= MAX_ARG_STRINGS)
> > > +			return -E2BIG;
> > > +		if (fatal_signal_pending(current))
> > > +			return -ERESTARTNOHAND;
> > > +		cond_resched();
> > > +	}
> > > +	return i;
> > > +}
> > 
> > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > refactor that too? (And maybe rename it to count_strings_user()?)
> 
> Liks this?
> 
> http://git.infradead.org/users/hch/misc.git/commitdiff/35a3129dab5b712b018c30681d15de42d9509731

Heh, yes please. :) (Which branch is this from? Are yours and Eric's
tree going to collide?)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] exec: Implement kernel_execve
  2020-07-15 14:55       ` David Laight
@ 2020-07-15 15:09         ` Kees Cook
  2020-07-15 16:46           ` David Laight
  0 siblings, 1 reply; 30+ messages in thread
From: Kees Cook @ 2020-07-15 15:09 UTC (permalink / raw)
  To: David Laight
  Cc: 'Christoph Hellwig',
	Eric W. Biederman, linux-kernel, Linus Torvalds, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen

On Wed, Jul 15, 2020 at 02:55:50PM +0000, David Laight wrote:
> From: Christoph Hellwig
> > Sent: 15 July 2020 07:43
> > Subject: Re: [PATCH 7/7] exec: Implement kernel_execve
> > 
> > On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > > +static int count_strings_kernel(const char *const *argv)
> > > > +{
> > > > +	int i;
> > > > +
> > > > +	if (!argv)
> > > > +		return 0;
> > > > +
> > > > +	for (i = 0; argv[i]; ++i) {
> > > > +		if (i >= MAX_ARG_STRINGS)
> > > > +			return -E2BIG;
> > > > +		if (fatal_signal_pending(current))
> > > > +			return -ERESTARTNOHAND;
> > > > +		cond_resched();
> > > > +	}
> > > > +	return i;
> > > > +}
> > >
> > > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > > refactor that too? (And maybe rename it to count_strings_user()?)
> 
> Thinks....
> If you setup env[] and argv[] on the new user stack early in exec processing
> then you may not need any limits at all - except the size of the user stack.
> Even the get_user() loop will hit an invalid address before the counter
> wraps (provided it is unsigned long).

*grumpy noises* Yes, but not in practice (if I'm understanding what you
mean). The expectations of a number of execution environments can be
really odd-ball. I've tried to collect the notes from over the years in
prepare_arg_pages()'s comments, and it mostly boils down to "there has
to be enough room for the exec to start" otherwise the exec ends up in a
hard-to-debug failure state (i.e. past the "point of no return", where you
get no useful information about the cause of the SEGV). So the point has
been to move as many of the setup checks as early as possible and report
about them if they fail. The argv processing is already very early, but
it needs to do the limit checks otherwise it'll just break after the exec
is underway and the process will just SEGV. (And ... some environments
will attempt to dynamically check the size of the argv space by growing
until it sees E2BIG, so we can't just remove it and let those hit SEGV.)

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 7/7] exec: Implement kernel_execve
  2020-07-15 15:09         ` Kees Cook
@ 2020-07-15 16:46           ` David Laight
  0 siblings, 0 replies; 30+ messages in thread
From: David Laight @ 2020-07-15 16:46 UTC (permalink / raw)
  To: 'Kees Cook'
  Cc: 'Christoph Hellwig',
	Eric W. Biederman, linux-kernel, Linus Torvalds, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen

From: Kees Cook <keescook@chromium.org>
> Sent: 15 July 2020 16:09
> 
> On Wed, Jul 15, 2020 at 02:55:50PM +0000, David Laight wrote:
> > From: Christoph Hellwig
> > > Sent: 15 July 2020 07:43
> > > Subject: Re: [PATCH 7/7] exec: Implement kernel_execve
> > >
> > > On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > > > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > > > +static int count_strings_kernel(const char *const *argv)
> > > > > +{
> > > > > +	int i;
> > > > > +
> > > > > +	if (!argv)
> > > > > +		return 0;
> > > > > +
> > > > > +	for (i = 0; argv[i]; ++i) {
> > > > > +		if (i >= MAX_ARG_STRINGS)
> > > > > +			return -E2BIG;
> > > > > +		if (fatal_signal_pending(current))
> > > > > +			return -ERESTARTNOHAND;
> > > > > +		cond_resched();
> > > > > +	}
> > > > > +	return i;
> > > > > +}
> > > >
> > > > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > > > refactor that too? (And maybe rename it to count_strings_user()?)
> >
> > Thinks....
> > If you setup env[] and argv[] on the new user stack early in exec processing
> > then you may not need any limits at all - except the size of the user stack.
> > Even the get_user() loop will hit an invalid address before the counter
> > wraps (provided it is unsigned long).
> 
> *grumpy noises* Yes, but not in practice (if I'm understanding what you
> mean). The expectations of a number of execution environments can be
> really odd-ball. I've tried to collect the notes from over the years in
> prepare_arg_pages()'s comments, and it mostly boils down to "there has
> to be enough room for the exec to start" otherwise the exec ends up in a
> hard-to-debug failure state (i.e. past the "point of no return", where you
> get no useful information about the cause of the SEGV). So the point has
> been to move as many of the setup checks as early as possible and report
> about them if they fail. The argv processing is already very early, but
> it needs to do the limit checks otherwise it'll just break after the exec
> is underway and the process will just SEGV. (And ... some environments
> will attempt to dynamically check the size of the argv space by growing
> until it sees E2BIG, so we can't just remove it and let those hit SEGV.)

Yes - I bet the code is horrid.
I guess the real problem is that you'd need access to the old process's
user addresses and the new process's stack area at the same time.
Unless there is a suitable hole in the old process's address map
any attempted trick will fall foul of cache aliasing on some
architectures - like anything else that does page-loaning.

I'm sure there are hair-brained schemes that might work.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] exec: Implement kernel_execve
  2020-07-15 15:00       ` Kees Cook
@ 2020-07-15 18:20         ` Christoph Hellwig
  0 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2020-07-15 18:20 UTC (permalink / raw)
  To: Kees Cook
  Cc: Christoph Hellwig, Eric W. Biederman, linux-kernel,
	Linus Torvalds, Andy Lutomirski, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Al Viro, Luis Chamberlain,
	linux-fsdevel, Tetsuo Handa, linux-security-module,
	Serge E. Hallyn, James Morris, Kentaro Takeda, Casey Schaufler,
	John Johansen

On Wed, Jul 15, 2020 at 08:00:16AM -0700, Kees Cook wrote:
> Heh, yes please. :) (Which branch is this from?

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/exec-cleanup

> Are yours and Eric's
> tree going to collide?)

Yes, badly.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/7] exec: Implement kernel_execve
  2020-07-15  6:42   ` Christoph Hellwig
@ 2020-07-15 18:23     ` Eric W. Biederman
  0 siblings, 0 replies; 30+ messages in thread
From: Eric W. Biederman @ 2020-07-15 18:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, Linus Torvalds, Kees Cook, Andy Lutomirski,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
	linux-security-module, Serge E. Hallyn, James Morris,
	Kentaro Takeda, Casey Schaufler, John Johansen

Christoph Hellwig <hch@infradead.org> writes:

>> +static int count_strings_kernel(const char *const *argv)
>> +{
>> +	int i;
>> +
>> +	if (!argv)
>> +		return 0;
>> +
>> +	for (i = 0; argv[i]; ++i) {
>> +		if (i >= MAX_ARG_STRINGS)
>> +			return -E2BIG;
>> +		if (fatal_signal_pending(current))
>> +			return -ERESTARTNOHAND;
>> +		cond_resched();
>
> I don't think we need a fatal_signal_pending and cond_resched() is
> needed in each step given that we don't actually do anything.

If we have a MAX_ARG_STRINGS sized argv passed in, that is 2^31
iterations of the loop.  A processor at 2Ghz performs roughly 2^31
cycles per second.  So this loop has the potential to run for an entire
second.  That is long enough to need fatal_signal_pending() and
cond_resched checks.

In practice I don't think we have any argv arrays anywhere near that big
passed in from the kernel.  However removing the logic that accounts for
long running loops is best handled as a separate change so that people
will analyze the patch based on that criterian, and so that in the
highly unlikely even something goes wrong people have a nice commit
to bisect things to.

Eric

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-07-15 18:26 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-14 13:27 [PATCH 0/7] Implementing kernel_execve Eric W. Biederman
2020-07-14 13:28 ` [PATCH 1/7] exec: Remove unnecessary spaces from binfmts.h Eric W. Biederman
2020-07-14 21:38   ` Kees Cook
2020-07-15  6:28   ` Christoph Hellwig
2020-07-14 13:29 ` [PATCH 2/7] exec: Factor out alloc_bprm Eric W. Biederman
2020-07-14 21:38   ` Kees Cook
2020-07-15  6:30   ` Christoph Hellwig
2020-07-14 13:29 ` [PATCH 3/7] exec: Move initialization of bprm->filename into alloc_bprm Eric W. Biederman
2020-07-14 21:38   ` Kees Cook
2020-07-15  6:34   ` Christoph Hellwig
2020-07-14 13:30 ` [PATCH 4/7] exec: Move bprm_mm_init " Eric W. Biederman
2020-07-14 21:37   ` Kees Cook
2020-07-15  6:35   ` Christoph Hellwig
2020-07-14 13:30 ` [PATCH 5/7] exec: Factor bprm_execve out of do_execve_common Eric W. Biederman
2020-07-14 21:38   ` Kees Cook
2020-07-15  6:36   ` Christoph Hellwig
2020-07-14 13:31 ` [PATCH 6/7] exec: Factor bprm_stack_limits out of prepare_arg_pages Eric W. Biederman
2020-07-14 21:41   ` Kees Cook
2020-07-15  6:38   ` Christoph Hellwig
2020-07-14 13:31 ` [PATCH 7/7] exec: Implement kernel_execve Eric W. Biederman
2020-07-14 21:49   ` Kees Cook
2020-07-15  6:42     ` Christoph Hellwig
2020-07-15 14:55       ` David Laight
2020-07-15 15:09         ` Kees Cook
2020-07-15 16:46           ` David Laight
2020-07-15 15:00       ` Kees Cook
2020-07-15 18:20         ` Christoph Hellwig
2020-07-15  6:42   ` Christoph Hellwig
2020-07-15 18:23     ` Eric W. Biederman
2020-07-14 15:32 ` [PATCH 0/7] Implementing kernel_execve Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).