All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv10 0/5] syscalls,x86,sparc: Add execveat() system call
@ 2014-11-24 11:53 ` David Drysdale
  0 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2014-11-24 11:53 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner
  Cc: Stephen Rothwell, Oleg Nesterov, Michael Kerrisk, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux,
	David Drysdale

This patch set adds execveat(2) for x86 and sparc, and is derived from
Meredydd Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc
filesystem, at least for executables (rather than scripts).  The
current glibc version of fexecve(3) is implemented via /proc, which
causes problems in sandboxed or otherwise restricted environments.

Given the desire for a /proc-free fexecve() implementation, HPA
suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
syscall would be an appropriate generalization.

Also, having a new syscall means that it can take a flags argument
without back-compatibility concerns.  The current implementation just
defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other
flags could be added in future -- for example, flags for new namespaces
(as suggested at https://lkml.org/lkml/2006/7/11/474).

Related history:
 - https://lkml.org/lkml/2006/12/27/123 is an example of someone
   realizing that fexecve() is likely to fail in a chroot environment.
 - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
   documenting the /proc requirement of fexecve(3) in its manpage, to
   "prevent other people from wasting their time".
 - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
   problem where a process that did setuid() could not fexecve()
   because it no longer had access to /proc/self/fd; this has since
   been fixed.


Changes since v9:
 - Add sparc syscall wrappers and use correct sparc 32b compatibility
   function [Stephen Rothwell, David S. Miller]

Changes since v8:
 - Split core/fs changes from x86 changes [Thomas Gleixner]

Changes since v7:
 - Speculatively wire up sparc version of syscall (untested)
 - Fix leak of pathbuf in mainline arm [Oleg Nesterov]
 - Add rcu_dereference_raw() on fdt access [sparse kbuild robot]
 - Realigned comment [Andrew Morton]
 - Merged up to v3.18-rc4

Changes since v6:
 - Remove special case for O_PATH file descriptors [Andy Lutomirski]
 - Use kasprintf rather than error-prone arithmetic [Kees Cook]
 - Add test for long name [Kees Cook]
 - Add test for non-executable O_PATH fd [Andy Lutomirski]

Changes since v5:
 - Set new flag in bprm->interp_flags for O_CLOEXEC fds, so that binfmts
   that invoke an interpreter fail the exec (as they will not be able
   to access the invoked file). [Andy Lutomirski]
 - Don't truncate long paths. [Andy Lutomirski]
 - Commonize code to open the executed file. [Eric W. Biederman]
 - Mark O_PATH file descriptors so they cannot be fexecve()ed.
 - Make self-test more helpful, and add additional cases:
     - file offset non-zero
     - binary file without execute bit
     - O_CLOEXEC fds

Changes since v4, suggested by Eric W. Biederman:
 - Use empty filename with AT_EMPTY_PATH flag rather than NULL
   pathname to request fexecve-like behaviour.
 - Build pathname as "/dev/fd/<fd>/<filename>" (or "/dev/fd/<fd>")
   rather than using d_path().
 - Patch against v3.17 (bfe01a5ba249)

Changes since Meredydd's v3 patch:
 - Added a selftest.
 - Added a man page.
 - Left open_exec() signature untouched to reduce patch impact
   elsewhere (as suggested by Al Viro).
 - Filled in bprm->filename with d_path() into a buffer, to avoid use
   of potentially-ephemeral dentry->d_name.
 - Patch against v3.14 (455c6fdbd21916).


David Drysdale (4):
  syscalls: implement execveat() system call
  x86: Hook up execveat system call.
  syscalls: add selftest for execveat(2)
  sparc: Hook up execveat system call.

 arch/sparc/include/uapi/asm/unistd.h    |   3 +-
 arch/sparc/kernel/systbls_32.S          |   1 +
 arch/sparc/kernel/systbls_64.S          |   2 +
 arch/x86/ia32/audit.c                   |   1 +
 arch/x86/ia32/ia32entry.S               |   1 +
 arch/x86/kernel/audit_64.c              |   1 +
 arch/x86/kernel/entry_64.S              |  28 +++
 arch/x86/syscalls/syscall_32.tbl        |   1 +
 arch/x86/syscalls/syscall_64.tbl        |   2 +
 arch/x86/um/sys_call_table_64.c         |   1 +
 fs/binfmt_em86.c                        |   4 +
 fs/binfmt_misc.c                        |   4 +
 fs/binfmt_script.c                      |  10 +
 fs/exec.c                               | 113 +++++++--
 fs/namei.c                              |   2 +-
 include/linux/binfmts.h                 |   4 +
 include/linux/compat.h                  |   3 +
 include/linux/fs.h                      |   1 +
 include/linux/sched.h                   |   4 +
 include/linux/syscalls.h                |   5 +
 include/uapi/asm-generic/unistd.h       |   4 +-
 kernel/sys_ni.c                         |   3 +
 lib/audit.c                             |   3 +
 tools/testing/selftests/Makefile        |   1 +
 tools/testing/selftests/exec/.gitignore |   9 +
 tools/testing/selftests/exec/Makefile   |  25 ++
 tools/testing/selftests/exec/execveat.c | 397 ++++++++++++++++++++++++++++++++
 27 files changed, 617 insertions(+), 16 deletions(-)
 create mode 100644 tools/testing/selftests/exec/.gitignore
 create mode 100644 tools/testing/selftests/exec/Makefile
 create mode 100644 tools/testing/selftests/exec/execveat.c

-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCHv10 0/5] syscalls,x86,sparc: Add execveat() system call
@ 2014-11-24 11:53 ` David Drysdale
  0 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2014-11-24 11:53 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner
  Cc: Stephen Rothwell, Oleg Nesterov, Michael Kerrisk, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux,
	David Drysdale

This patch set adds execveat(2) for x86 and sparc, and is derived from
Meredydd Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc
filesystem, at least for executables (rather than scripts).  The
current glibc version of fexecve(3) is implemented via /proc, which
causes problems in sandboxed or otherwise restricted environments.

Given the desire for a /proc-free fexecve() implementation, HPA
suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
syscall would be an appropriate generalization.

Also, having a new syscall means that it can take a flags argument
without back-compatibility concerns.  The current implementation just
defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other
flags could be added in future -- for example, flags for new namespaces
(as suggested at https://lkml.org/lkml/2006/7/11/474).

Related history:
 - https://lkml.org/lkml/2006/12/27/123 is an example of someone
   realizing that fexecve() is likely to fail in a chroot environment.
 - http://bugs.debian.org/cgi-bin/bugreport.cgi?bugQ4043 covered
   documenting the /proc requirement of fexecve(3) in its manpage, to
   "prevent other people from wasting their time".
 - https://bugzilla.redhat.com/show_bug.cgi?id$1609 described a
   problem where a process that did setuid() could not fexecve()
   because it no longer had access to /proc/self/fd; this has since
   been fixed.


Changes since v9:
 - Add sparc syscall wrappers and use correct sparc 32b compatibility
   function [Stephen Rothwell, David S. Miller]

Changes since v8:
 - Split core/fs changes from x86 changes [Thomas Gleixner]

Changes since v7:
 - Speculatively wire up sparc version of syscall (untested)
 - Fix leak of pathbuf in mainline arm [Oleg Nesterov]
 - Add rcu_dereference_raw() on fdt access [sparse kbuild robot]
 - Realigned comment [Andrew Morton]
 - Merged up to v3.18-rc4

Changes since v6:
 - Remove special case for O_PATH file descriptors [Andy Lutomirski]
 - Use kasprintf rather than error-prone arithmetic [Kees Cook]
 - Add test for long name [Kees Cook]
 - Add test for non-executable O_PATH fd [Andy Lutomirski]

Changes since v5:
 - Set new flag in bprm->interp_flags for O_CLOEXEC fds, so that binfmts
   that invoke an interpreter fail the exec (as they will not be able
   to access the invoked file). [Andy Lutomirski]
 - Don't truncate long paths. [Andy Lutomirski]
 - Commonize code to open the executed file. [Eric W. Biederman]
 - Mark O_PATH file descriptors so they cannot be fexecve()ed.
 - Make self-test more helpful, and add additional cases:
     - file offset non-zero
     - binary file without execute bit
     - O_CLOEXEC fds

Changes since v4, suggested by Eric W. Biederman:
 - Use empty filename with AT_EMPTY_PATH flag rather than NULL
   pathname to request fexecve-like behaviour.
 - Build pathname as "/dev/fd/<fd>/<filename>" (or "/dev/fd/<fd>")
   rather than using d_path().
 - Patch against v3.17 (bfe01a5ba249)

Changes since Meredydd's v3 patch:
 - Added a selftest.
 - Added a man page.
 - Left open_exec() signature untouched to reduce patch impact
   elsewhere (as suggested by Al Viro).
 - Filled in bprm->filename with d_path() into a buffer, to avoid use
   of potentially-ephemeral dentry->d_name.
 - Patch against v3.14 (455c6fdbd21916).


David Drysdale (4):
  syscalls: implement execveat() system call
  x86: Hook up execveat system call.
  syscalls: add selftest for execveat(2)
  sparc: Hook up execveat system call.

 arch/sparc/include/uapi/asm/unistd.h    |   3 +-
 arch/sparc/kernel/systbls_32.S          |   1 +
 arch/sparc/kernel/systbls_64.S          |   2 +
 arch/x86/ia32/audit.c                   |   1 +
 arch/x86/ia32/ia32entry.S               |   1 +
 arch/x86/kernel/audit_64.c              |   1 +
 arch/x86/kernel/entry_64.S              |  28 +++
 arch/x86/syscalls/syscall_32.tbl        |   1 +
 arch/x86/syscalls/syscall_64.tbl        |   2 +
 arch/x86/um/sys_call_table_64.c         |   1 +
 fs/binfmt_em86.c                        |   4 +
 fs/binfmt_misc.c                        |   4 +
 fs/binfmt_script.c                      |  10 +
 fs/exec.c                               | 113 +++++++--
 fs/namei.c                              |   2 +-
 include/linux/binfmts.h                 |   4 +
 include/linux/compat.h                  |   3 +
 include/linux/fs.h                      |   1 +
 include/linux/sched.h                   |   4 +
 include/linux/syscalls.h                |   5 +
 include/uapi/asm-generic/unistd.h       |   4 +-
 kernel/sys_ni.c                         |   3 +
 lib/audit.c                             |   3 +
 tools/testing/selftests/Makefile        |   1 +
 tools/testing/selftests/exec/.gitignore |   9 +
 tools/testing/selftests/exec/Makefile   |  25 ++
 tools/testing/selftests/exec/execveat.c | 397 ++++++++++++++++++++++++++++++++
 27 files changed, 617 insertions(+), 16 deletions(-)
 create mode 100644 tools/testing/selftests/exec/.gitignore
 create mode 100644 tools/testing/selftests/exec/Makefile
 create mode 100644 tools/testing/selftests/exec/execveat.c

-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCHv10 1/5] syscalls: implement execveat() system call
  2014-11-24 11:53 ` David Drysdale
@ 2014-11-24 11:53   ` David Drysdale
  -1 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2014-11-24 11:53 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner
  Cc: Stephen Rothwell, Oleg Nesterov, Michael Kerrisk, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux,
	David Drysdale

Add a new execveat(2) system call. execveat() is to execve() as
openat() is to open(): it takes a file descriptor that refers to a
directory, and resolves the filename relative to that.

In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in
other UNIXen, but in Linux glibc it depends on opening
"/proc/self/fd/<fd>" (and so relies on /proc being mounted).

The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found.  This does however mean that
execution of a script in a /proc-less environment won't work; also,
script execution via an O_CLOEXEC file descriptor fails (as the file
will not be accessible after exec).

Based on patches by Meredydd Luff <meredydd@senatehouse.org>

Signed-off-by: David Drysdale <drysdale@google.com>
---
 fs/binfmt_em86.c                  |   4 ++
 fs/binfmt_misc.c                  |   4 ++
 fs/binfmt_script.c                |  10 ++++
 fs/exec.c                         | 113 +++++++++++++++++++++++++++++++++-----
 fs/namei.c                        |   2 +-
 include/linux/binfmts.h           |   4 ++
 include/linux/compat.h            |   3 +
 include/linux/fs.h                |   1 +
 include/linux/sched.h             |   4 ++
 include/linux/syscalls.h          |   5 ++
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c                   |   3 +
 lib/audit.c                       |   3 +
 13 files changed, 145 insertions(+), 15 deletions(-)

diff --git a/fs/binfmt_em86.c b/fs/binfmt_em86.c
index f37b08cea1f7..490538536cb4 100644
--- a/fs/binfmt_em86.c
+++ b/fs/binfmt_em86.c
@@ -42,6 +42,10 @@ static int load_em86(struct linux_binprm *bprm)
 			return -ENOEXEC;
 	}
 
+	/* Need to be able to load the file after exec */
+	if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE)
+		return -ENOENT;
+
 	allow_write_access(bprm->file);
 	fput(bprm->file);
 	bprm->file = NULL;
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index fd8beb9657a2..85acb8c83a9a 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -142,6 +142,10 @@ static int load_misc_binary(struct linux_binprm *bprm)
 	if (!fmt)
 		goto _ret;
 
+	/* Need to be able to load the file after exec */
+	if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE)
+		return -ENOENT;
+
 	if (!(fmt->flags & MISC_FMT_PRESERVE_ARGV0)) {
 		retval = remove_arg_zero(bprm);
 		if (retval)
diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
index 5027a3e14922..afdf4e3cafc2 100644
--- a/fs/binfmt_script.c
+++ b/fs/binfmt_script.c
@@ -24,6 +24,16 @@ static int load_script(struct linux_binprm *bprm)
 
 	if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!'))
 		return -ENOEXEC;
+
+	/*
+	 * If the script filename will be inaccessible after exec, typically
+	 * because it is a "/dev/fd/<fd>/.." path against an O_CLOEXEC fd, give
+	 * up now (on the assumption that the interpreter will want to load
+	 * this file).
+	 */
+	if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE)
+		return -ENOENT;
+
 	/*
 	 * This section does the #! interpretation.
 	 * Sorta complicated, but hopefully it will work.  -TYT
diff --git a/fs/exec.c b/fs/exec.c
index 7302b75a9820..6ce5cc47a201 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -747,18 +747,25 @@ EXPORT_SYMBOL(setup_arg_pages);
 
 #endif /* CONFIG_MMU */
 
-static struct file *do_open_exec(struct filename *name)
+static struct file *do_open_execat(int fd, struct filename *name, int flags)
 {
 	struct file *file;
 	int err;
-	static const struct open_flags open_exec_flags = {
+	struct open_flags open_exec_flags = {
 		.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
 		.acc_mode = MAY_EXEC | MAY_OPEN,
 		.intent = LOOKUP_OPEN,
 		.lookup_flags = LOOKUP_FOLLOW,
 	};
 
-	file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
+	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
+		return ERR_PTR(-EINVAL);
+	if (flags & AT_SYMLINK_NOFOLLOW)
+		open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
+	if (flags & AT_EMPTY_PATH)
+		open_exec_flags.lookup_flags |= LOOKUP_EMPTY;
+
+	file = do_filp_open(fd, name, &open_exec_flags);
 	if (IS_ERR(file))
 		goto out;
 
@@ -769,12 +776,13 @@ static struct file *do_open_exec(struct filename *name)
 	if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
 		goto exit;
 
-	fsnotify_open(file);
-
 	err = deny_write_access(file);
 	if (err)
 		goto exit;
 
+	if (name->name[0] != '\0')
+		fsnotify_open(file);
+
 out:
 	return file;
 
@@ -786,7 +794,7 @@ exit:
 struct file *open_exec(const char *name)
 {
 	struct filename tmp = { .name = name };
-	return do_open_exec(&tmp);
+	return do_open_execat(AT_FDCWD, &tmp, 0);
 }
 EXPORT_SYMBOL(open_exec);
 
@@ -1427,10 +1435,12 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int do_execve_common(struct filename *filename,
-				struct user_arg_ptr argv,
-				struct user_arg_ptr envp)
+static int do_execveat_common(int fd, struct filename *filename,
+			      struct user_arg_ptr argv,
+			      struct user_arg_ptr envp,
+			      int flags)
 {
+	char *pathbuf = NULL;
 	struct linux_binprm *bprm;
 	struct file *file;
 	struct files_struct *displaced;
@@ -1471,7 +1481,7 @@ static int do_execve_common(struct filename *filename,
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
 
-	file = do_open_exec(filename);
+	file = do_open_execat(fd, filename, flags);
 	retval = PTR_ERR(file);
 	if (IS_ERR(file))
 		goto out_unmark;
@@ -1479,7 +1489,28 @@ static int do_execve_common(struct filename *filename,
 	sched_exec();
 
 	bprm->file = file;
-	bprm->filename = bprm->interp = filename->name;
+	if (fd == AT_FDCWD || filename->name[0] == '/') {
+		bprm->filename = filename->name;
+	} else {
+		if (filename->name[0] == '\0')
+			pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d", fd);
+		else
+			pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d/%s",
+					    fd, filename->name);
+		if (!pathbuf) {
+			retval = -ENOMEM;
+			goto out_unmark;
+		}
+		/*
+		 * Record that a name derived from an O_CLOEXEC fd will be
+		 * inaccessible after exec. Relies on having exclusive access to
+		 * current->files (due to unshare_files above).
+		 */
+		if (close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
+			bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
+		bprm->filename = pathbuf;
+	}
+	bprm->interp = bprm->filename;
 
 	retval = bprm_mm_init(bprm);
 	if (retval)
@@ -1520,6 +1551,7 @@ static int do_execve_common(struct filename *filename,
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
+	kfree(pathbuf);
 	putname(filename);
 	if (displaced)
 		put_files_struct(displaced);
@@ -1537,6 +1569,7 @@ out_unmark:
 
 out_free:
 	free_bprm(bprm);
+	kfree(pathbuf);
 
 out_files:
 	if (displaced)
@@ -1552,7 +1585,18 @@ int do_execve(struct filename *filename,
 {
 	struct user_arg_ptr argv = { .ptr.native = __argv };
 	struct user_arg_ptr envp = { .ptr.native = __envp };
-	return do_execve_common(filename, argv, envp);
+	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+int do_execveat(int fd, struct filename *filename,
+		const char __user *const __user *__argv,
+		const char __user *const __user *__envp,
+		int flags)
+{
+	struct user_arg_ptr argv = { .ptr.native = __argv };
+	struct user_arg_ptr envp = { .ptr.native = __envp };
+
+	return do_execveat_common(fd, filename, argv, envp, flags);
 }
 
 #ifdef CONFIG_COMPAT
@@ -1568,7 +1612,23 @@ static int compat_do_execve(struct filename *filename,
 		.is_compat = true,
 		.ptr.compat = __envp,
 	};
-	return do_execve_common(filename, argv, envp);
+	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+static int compat_do_execveat(int fd, struct filename *filename,
+			      const compat_uptr_t __user *__argv,
+			      const compat_uptr_t __user *__envp,
+			      int flags)
+{
+	struct user_arg_ptr argv = {
+		.is_compat = true,
+		.ptr.compat = __argv,
+	};
+	struct user_arg_ptr envp = {
+		.is_compat = true,
+		.ptr.compat = __envp,
+	};
+	return do_execveat_common(fd, filename, argv, envp, flags);
 }
 #endif
 
@@ -1608,6 +1668,20 @@ SYSCALL_DEFINE3(execve,
 {
 	return do_execve(getname(filename), argv, envp);
 }
+
+SYSCALL_DEFINE5(execveat,
+		int, fd, const char __user *, filename,
+		const char __user *const __user *, argv,
+		const char __user *const __user *, envp,
+		int, flags)
+{
+	int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
+
+	return do_execveat(fd,
+			   getname_flags(filename, lookup_flags, NULL),
+			   argv, envp, flags);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
 	const compat_uptr_t __user *, argv,
@@ -1615,4 +1689,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
 {
 	return compat_do_execve(getname(filename), argv, envp);
 }
+
+COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
+		       const char __user *, filename,
+		       const compat_uptr_t __user *, argv,
+		       const compat_uptr_t __user *, envp,
+		       int,  flags)
+{
+	int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
+
+	return compat_do_execveat(fd,
+				  getname_flags(filename, lookup_flags, NULL),
+				  argv, envp, flags);
+}
 #endif
diff --git a/fs/namei.c b/fs/namei.c
index db5fe86319e6..ca814165d84c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -130,7 +130,7 @@ void final_putname(struct filename *name)
 
 #define EMBEDDED_NAME_MAX	(PATH_MAX - sizeof(struct filename))
 
-static struct filename *
+struct filename *
 getname_flags(const char __user *filename, int flags, int *empty)
 {
 	struct filename *result, *err;
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 61f29e5ea840..576e4639ca60 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -53,6 +53,10 @@ struct linux_binprm {
 #define BINPRM_FLAGS_EXECFD_BIT 1
 #define BINPRM_FLAGS_EXECFD (1 << BINPRM_FLAGS_EXECFD_BIT)
 
+/* filename of the binary will be inaccessible after exec */
+#define BINPRM_FLAGS_PATH_INACCESSIBLE_BIT 2
+#define BINPRM_FLAGS_PATH_INACCESSIBLE (1 << BINPRM_FLAGS_PATH_INACCESSIBLE_BIT)
+
 /* Function parameter for binfmt->coredump */
 struct coredump_params {
 	const siginfo_t *siginfo;
diff --git a/include/linux/compat.h b/include/linux/compat.h
index e6494261eaff..7450ca2ac1fc 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
 
 asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
 		     const compat_uptr_t __user *envp);
+asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
+		     const compat_uptr_t __user *argv,
+		     const compat_uptr_t __user *envp, int flags);
 
 asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
 		compat_ulong_t __user *outp, compat_ulong_t __user *exp,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ab779e8a63c..133b60b1d4d0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2072,6 +2072,7 @@ extern int vfs_open(const struct path *, struct file *, const struct cred *);
 extern struct file * dentry_open(const struct path *, int, const struct cred *);
 extern int filp_close(struct file *, fl_owner_t id);
 
+extern struct filename *getname_flags(const char __user *, int, int *);
 extern struct filename *getname(const char __user *);
 extern struct filename *getname_kernel(const char *);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bbe63ec..344163d09efb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2441,6 +2441,10 @@ extern void do_group_exit(int);
 extern int do_execve(struct filename *,
 		     const char __user * const __user *,
 		     const char __user * const __user *);
+extern int do_execveat(int, struct filename *,
+		       const char __user * const __user *,
+		       const char __user * const __user *,
+		       int);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bda9b81357cc..1ff5a4d09693 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -877,4 +877,9 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
 asmlinkage long sys_getrandom(char __user *buf, size_t count,
 			      unsigned int flags);
 asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
+
+asmlinkage long sys_execveat(int dfd, const char __user *filename,
+			const char __user *const __user *argv,
+			const char __user *const __user *envp, int flags);
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 22749c134117..e016bd9b1a04 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -707,9 +707,11 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
 __SYSCALL(__NR_memfd_create, sys_memfd_create)
 #define __NR_bpf 280
 __SYSCALL(__NR_bpf, sys_bpf)
+#define __NR_execveat 281
+__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
 
 #undef __NR_syscalls
-#define __NR_syscalls 281
+#define __NR_syscalls 282
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 02aa4185b17e..832fba6e2eb1 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -224,3 +224,6 @@ cond_syscall(sys_seccomp);
 
 /* access BPF programs and maps */
 cond_syscall(sys_bpf);
+
+/* execveat */
+cond_syscall(sys_execveat);
diff --git a/lib/audit.c b/lib/audit.c
index 1d726a22565b..b8fb5ee81e26 100644
--- a/lib/audit.c
+++ b/lib/audit.c
@@ -54,6 +54,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
 	case __NR_socketcall:
 		return 4;
 #endif
+#ifdef __NR_execveat
+	case __NR_execveat:
+#endif
 	case __NR_execve:
 		return 5;
 	default:
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCHv10 1/5] syscalls: implement execveat() system call
@ 2014-11-24 11:53   ` David Drysdale
  0 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2014-11-24 11:53 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner
  Cc: Stephen Rothwell, Oleg Nesterov, Michael Kerrisk, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux,
	David Drysdale

Add a new execveat(2) system call. execveat() is to execve() as
openat() is to open(): it takes a file descriptor that refers to a
directory, and resolves the filename relative to that.

In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in
other UNIXen, but in Linux glibc it depends on opening
"/proc/self/fd/<fd>" (and so relies on /proc being mounted).

The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found.  This does however mean that
execution of a script in a /proc-less environment won't work; also,
script execution via an O_CLOEXEC file descriptor fails (as the file
will not be accessible after exec).

Based on patches by Meredydd Luff <meredydd@senatehouse.org>

Signed-off-by: David Drysdale <drysdale@google.com>
---
 fs/binfmt_em86.c                  |   4 ++
 fs/binfmt_misc.c                  |   4 ++
 fs/binfmt_script.c                |  10 ++++
 fs/exec.c                         | 113 +++++++++++++++++++++++++++++++++-----
 fs/namei.c                        |   2 +-
 include/linux/binfmts.h           |   4 ++
 include/linux/compat.h            |   3 +
 include/linux/fs.h                |   1 +
 include/linux/sched.h             |   4 ++
 include/linux/syscalls.h          |   5 ++
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c                   |   3 +
 lib/audit.c                       |   3 +
 13 files changed, 145 insertions(+), 15 deletions(-)

diff --git a/fs/binfmt_em86.c b/fs/binfmt_em86.c
index f37b08cea1f7..490538536cb4 100644
--- a/fs/binfmt_em86.c
+++ b/fs/binfmt_em86.c
@@ -42,6 +42,10 @@ static int load_em86(struct linux_binprm *bprm)
 			return -ENOEXEC;
 	}
 
+	/* Need to be able to load the file after exec */
+	if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE)
+		return -ENOENT;
+
 	allow_write_access(bprm->file);
 	fput(bprm->file);
 	bprm->file = NULL;
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index fd8beb9657a2..85acb8c83a9a 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -142,6 +142,10 @@ static int load_misc_binary(struct linux_binprm *bprm)
 	if (!fmt)
 		goto _ret;
 
+	/* Need to be able to load the file after exec */
+	if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE)
+		return -ENOENT;
+
 	if (!(fmt->flags & MISC_FMT_PRESERVE_ARGV0)) {
 		retval = remove_arg_zero(bprm);
 		if (retval)
diff --git a/fs/binfmt_script.c b/fs/binfmt_script.c
index 5027a3e14922..afdf4e3cafc2 100644
--- a/fs/binfmt_script.c
+++ b/fs/binfmt_script.c
@@ -24,6 +24,16 @@ static int load_script(struct linux_binprm *bprm)
 
 	if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!'))
 		return -ENOEXEC;
+
+	/*
+	 * If the script filename will be inaccessible after exec, typically
+	 * because it is a "/dev/fd/<fd>/.." path against an O_CLOEXEC fd, give
+	 * up now (on the assumption that the interpreter will want to load
+	 * this file).
+	 */
+	if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE)
+		return -ENOENT;
+
 	/*
 	 * This section does the #! interpretation.
 	 * Sorta complicated, but hopefully it will work.  -TYT
diff --git a/fs/exec.c b/fs/exec.c
index 7302b75a9820..6ce5cc47a201 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -747,18 +747,25 @@ EXPORT_SYMBOL(setup_arg_pages);
 
 #endif /* CONFIG_MMU */
 
-static struct file *do_open_exec(struct filename *name)
+static struct file *do_open_execat(int fd, struct filename *name, int flags)
 {
 	struct file *file;
 	int err;
-	static const struct open_flags open_exec_flags = {
+	struct open_flags open_exec_flags = {
 		.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
 		.acc_mode = MAY_EXEC | MAY_OPEN,
 		.intent = LOOKUP_OPEN,
 		.lookup_flags = LOOKUP_FOLLOW,
 	};
 
-	file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
+	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
+		return ERR_PTR(-EINVAL);
+	if (flags & AT_SYMLINK_NOFOLLOW)
+		open_exec_flags.lookup_flags &= ~LOOKUP_FOLLOW;
+	if (flags & AT_EMPTY_PATH)
+		open_exec_flags.lookup_flags |= LOOKUP_EMPTY;
+
+	file = do_filp_open(fd, name, &open_exec_flags);
 	if (IS_ERR(file))
 		goto out;
 
@@ -769,12 +776,13 @@ static struct file *do_open_exec(struct filename *name)
 	if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
 		goto exit;
 
-	fsnotify_open(file);
-
 	err = deny_write_access(file);
 	if (err)
 		goto exit;
 
+	if (name->name[0] != '\0')
+		fsnotify_open(file);
+
 out:
 	return file;
 
@@ -786,7 +794,7 @@ exit:
 struct file *open_exec(const char *name)
 {
 	struct filename tmp = { .name = name };
-	return do_open_exec(&tmp);
+	return do_open_execat(AT_FDCWD, &tmp, 0);
 }
 EXPORT_SYMBOL(open_exec);
 
@@ -1427,10 +1435,12 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int do_execve_common(struct filename *filename,
-				struct user_arg_ptr argv,
-				struct user_arg_ptr envp)
+static int do_execveat_common(int fd, struct filename *filename,
+			      struct user_arg_ptr argv,
+			      struct user_arg_ptr envp,
+			      int flags)
 {
+	char *pathbuf = NULL;
 	struct linux_binprm *bprm;
 	struct file *file;
 	struct files_struct *displaced;
@@ -1471,7 +1481,7 @@ static int do_execve_common(struct filename *filename,
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
 
-	file = do_open_exec(filename);
+	file = do_open_execat(fd, filename, flags);
 	retval = PTR_ERR(file);
 	if (IS_ERR(file))
 		goto out_unmark;
@@ -1479,7 +1489,28 @@ static int do_execve_common(struct filename *filename,
 	sched_exec();
 
 	bprm->file = file;
-	bprm->filename = bprm->interp = filename->name;
+	if (fd = AT_FDCWD || filename->name[0] = '/') {
+		bprm->filename = filename->name;
+	} else {
+		if (filename->name[0] = '\0')
+			pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d", fd);
+		else
+			pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d/%s",
+					    fd, filename->name);
+		if (!pathbuf) {
+			retval = -ENOMEM;
+			goto out_unmark;
+		}
+		/*
+		 * Record that a name derived from an O_CLOEXEC fd will be
+		 * inaccessible after exec. Relies on having exclusive access to
+		 * current->files (due to unshare_files above).
+		 */
+		if (close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
+			bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
+		bprm->filename = pathbuf;
+	}
+	bprm->interp = bprm->filename;
 
 	retval = bprm_mm_init(bprm);
 	if (retval)
@@ -1520,6 +1551,7 @@ static int do_execve_common(struct filename *filename,
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
+	kfree(pathbuf);
 	putname(filename);
 	if (displaced)
 		put_files_struct(displaced);
@@ -1537,6 +1569,7 @@ out_unmark:
 
 out_free:
 	free_bprm(bprm);
+	kfree(pathbuf);
 
 out_files:
 	if (displaced)
@@ -1552,7 +1585,18 @@ int do_execve(struct filename *filename,
 {
 	struct user_arg_ptr argv = { .ptr.native = __argv };
 	struct user_arg_ptr envp = { .ptr.native = __envp };
-	return do_execve_common(filename, argv, envp);
+	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+int do_execveat(int fd, struct filename *filename,
+		const char __user *const __user *__argv,
+		const char __user *const __user *__envp,
+		int flags)
+{
+	struct user_arg_ptr argv = { .ptr.native = __argv };
+	struct user_arg_ptr envp = { .ptr.native = __envp };
+
+	return do_execveat_common(fd, filename, argv, envp, flags);
 }
 
 #ifdef CONFIG_COMPAT
@@ -1568,7 +1612,23 @@ static int compat_do_execve(struct filename *filename,
 		.is_compat = true,
 		.ptr.compat = __envp,
 	};
-	return do_execve_common(filename, argv, envp);
+	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+static int compat_do_execveat(int fd, struct filename *filename,
+			      const compat_uptr_t __user *__argv,
+			      const compat_uptr_t __user *__envp,
+			      int flags)
+{
+	struct user_arg_ptr argv = {
+		.is_compat = true,
+		.ptr.compat = __argv,
+	};
+	struct user_arg_ptr envp = {
+		.is_compat = true,
+		.ptr.compat = __envp,
+	};
+	return do_execveat_common(fd, filename, argv, envp, flags);
 }
 #endif
 
@@ -1608,6 +1668,20 @@ SYSCALL_DEFINE3(execve,
 {
 	return do_execve(getname(filename), argv, envp);
 }
+
+SYSCALL_DEFINE5(execveat,
+		int, fd, const char __user *, filename,
+		const char __user *const __user *, argv,
+		const char __user *const __user *, envp,
+		int, flags)
+{
+	int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
+
+	return do_execveat(fd,
+			   getname_flags(filename, lookup_flags, NULL),
+			   argv, envp, flags);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
 	const compat_uptr_t __user *, argv,
@@ -1615,4 +1689,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
 {
 	return compat_do_execve(getname(filename), argv, envp);
 }
+
+COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
+		       const char __user *, filename,
+		       const compat_uptr_t __user *, argv,
+		       const compat_uptr_t __user *, envp,
+		       int,  flags)
+{
+	int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
+
+	return compat_do_execveat(fd,
+				  getname_flags(filename, lookup_flags, NULL),
+				  argv, envp, flags);
+}
 #endif
diff --git a/fs/namei.c b/fs/namei.c
index db5fe86319e6..ca814165d84c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -130,7 +130,7 @@ void final_putname(struct filename *name)
 
 #define EMBEDDED_NAME_MAX	(PATH_MAX - sizeof(struct filename))
 
-static struct filename *
+struct filename *
 getname_flags(const char __user *filename, int flags, int *empty)
 {
 	struct filename *result, *err;
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 61f29e5ea840..576e4639ca60 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -53,6 +53,10 @@ struct linux_binprm {
 #define BINPRM_FLAGS_EXECFD_BIT 1
 #define BINPRM_FLAGS_EXECFD (1 << BINPRM_FLAGS_EXECFD_BIT)
 
+/* filename of the binary will be inaccessible after exec */
+#define BINPRM_FLAGS_PATH_INACCESSIBLE_BIT 2
+#define BINPRM_FLAGS_PATH_INACCESSIBLE (1 << BINPRM_FLAGS_PATH_INACCESSIBLE_BIT)
+
 /* Function parameter for binfmt->coredump */
 struct coredump_params {
 	const siginfo_t *siginfo;
diff --git a/include/linux/compat.h b/include/linux/compat.h
index e6494261eaff..7450ca2ac1fc 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
 
 asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
 		     const compat_uptr_t __user *envp);
+asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
+		     const compat_uptr_t __user *argv,
+		     const compat_uptr_t __user *envp, int flags);
 
 asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
 		compat_ulong_t __user *outp, compat_ulong_t __user *exp,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ab779e8a63c..133b60b1d4d0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2072,6 +2072,7 @@ extern int vfs_open(const struct path *, struct file *, const struct cred *);
 extern struct file * dentry_open(const struct path *, int, const struct cred *);
 extern int filp_close(struct file *, fl_owner_t id);
 
+extern struct filename *getname_flags(const char __user *, int, int *);
 extern struct filename *getname(const char __user *);
 extern struct filename *getname_kernel(const char *);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5e344bbe63ec..344163d09efb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2441,6 +2441,10 @@ extern void do_group_exit(int);
 extern int do_execve(struct filename *,
 		     const char __user * const __user *,
 		     const char __user * const __user *);
+extern int do_execveat(int, struct filename *,
+		       const char __user * const __user *,
+		       const char __user * const __user *,
+		       int);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bda9b81357cc..1ff5a4d09693 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -877,4 +877,9 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
 asmlinkage long sys_getrandom(char __user *buf, size_t count,
 			      unsigned int flags);
 asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
+
+asmlinkage long sys_execveat(int dfd, const char __user *filename,
+			const char __user *const __user *argv,
+			const char __user *const __user *envp, int flags);
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 22749c134117..e016bd9b1a04 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -707,9 +707,11 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
 __SYSCALL(__NR_memfd_create, sys_memfd_create)
 #define __NR_bpf 280
 __SYSCALL(__NR_bpf, sys_bpf)
+#define __NR_execveat 281
+__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
 
 #undef __NR_syscalls
-#define __NR_syscalls 281
+#define __NR_syscalls 282
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 02aa4185b17e..832fba6e2eb1 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -224,3 +224,6 @@ cond_syscall(sys_seccomp);
 
 /* access BPF programs and maps */
 cond_syscall(sys_bpf);
+
+/* execveat */
+cond_syscall(sys_execveat);
diff --git a/lib/audit.c b/lib/audit.c
index 1d726a22565b..b8fb5ee81e26 100644
--- a/lib/audit.c
+++ b/lib/audit.c
@@ -54,6 +54,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
 	case __NR_socketcall:
 		return 4;
 #endif
+#ifdef __NR_execveat
+	case __NR_execveat:
+#endif
 	case __NR_execve:
 		return 5;
 	default:
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCHv10 2/5] x86: Hook up execveat system call.
  2014-11-24 11:53 ` David Drysdale
  (?)
  (?)
@ 2014-11-24 11:53 ` David Drysdale
  2014-11-24 12:45     ` Thomas Gleixner
  2014-11-24 17:06     ` Dan Carpenter
  -1 siblings, 2 replies; 123+ messages in thread
From: David Drysdale @ 2014-11-24 11:53 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner
  Cc: Stephen Rothwell, Oleg Nesterov, Michael Kerrisk, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux,
	David Drysdale

Hook up x86-64, i386 and x32 ABIs.

Signed-off-by: David Drysdale <drysdale@google.com>
---
 arch/x86/ia32/audit.c            |  1 +
 arch/x86/ia32/ia32entry.S        |  1 +
 arch/x86/kernel/audit_64.c       |  1 +
 arch/x86/kernel/entry_64.S       | 28 ++++++++++++++++++++++++++++
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  2 ++
 arch/x86/um/sys_call_table_64.c  |  1 +
 7 files changed, 35 insertions(+)

diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
index 5d7b381da692..2eccc8932ae6 100644
--- a/arch/x86/ia32/audit.c
+++ b/arch/x86/ia32/audit.c
@@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
 	case __NR_socketcall:
 		return 4;
 	case __NR_execve:
+	case __NR_execveat:
 		return 5;
 	default:
 		return 1;
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index ffe71228fc10..82e8a1d44658 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -480,6 +480,7 @@ GLOBAL(\label)
 	PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
 	PTREGSCALL stub32_sigreturn, sys32_sigreturn
 	PTREGSCALL stub32_execve, compat_sys_execve
+	PTREGSCALL stub32_execveat, compat_sys_execveat
 	PTREGSCALL stub32_fork, sys_fork
 	PTREGSCALL stub32_vfork, sys_vfork
 
diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
index 06d3e5a14d9d..f3672508b249 100644
--- a/arch/x86/kernel/audit_64.c
+++ b/arch/x86/kernel/audit_64.c
@@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
 	case __NR_openat:
 		return 3;
 	case __NR_execve:
+	case __NR_execveat:
 		return 5;
 	default:
 		return 0;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index df088bb03fb3..40d893c60fcc 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -652,6 +652,20 @@ ENTRY(stub_execve)
 	CFI_ENDPROC
 END(stub_execve)
 
+ENTRY(stub_execveat)
+	CFI_STARTPROC
+	addq $8, %rsp
+	PARTIAL_FRAME 0
+	SAVE_REST
+	FIXUP_TOP_OF_STACK %r11
+	call sys_execveat
+	RESTORE_TOP_OF_STACK %r11
+	movq %rax,RAX(%rsp)
+	RESTORE_REST
+	jmp int_ret_from_sys_call
+	CFI_ENDPROC
+END(stub_execveat)
+
 /*
  * sigreturn is special because it needs to restore all registers on return.
  * This cannot be done with SYSRET, so use the IRET return path instead.
@@ -697,6 +711,20 @@ ENTRY(stub_x32_execve)
 	CFI_ENDPROC
 END(stub_x32_execve)
 
+ENTRY(stub_x32_execveat)
+	CFI_STARTPROC
+	addq $8, %rsp
+	PARTIAL_FRAME 0
+	SAVE_REST
+	FIXUP_TOP_OF_STACK %r11
+	call compat_sys_execveat
+	RESTORE_TOP_OF_STACK %r11
+	movq %rax,RAX(%rsp)
+	RESTORE_REST
+	jmp int_ret_from_sys_call
+	CFI_ENDPROC
+END(stub_x32_execveat)
+
 #endif
 
 /*
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 9fe1b5d002f0..b3560ece1c9f 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -364,3 +364,4 @@
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
+358	i386	execveat		sys_execveat			stub32_execveat
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 281150b539a2..8d656fbb57aa 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -328,6 +328,7 @@
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
+322	64	execveat		stub_execveat
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
@@ -366,3 +367,4 @@
 542	x32	getsockopt		compat_sys_getsockopt
 543	x32	io_setup		compat_sys_io_setup
 544	x32	io_submit		compat_sys_io_submit
+545	x32	execveat		stub_x32_execveat
diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
index f2f0723070ca..20c3649d0691 100644
--- a/arch/x86/um/sys_call_table_64.c
+++ b/arch/x86/um/sys_call_table_64.c
@@ -31,6 +31,7 @@
 #define stub_fork sys_fork
 #define stub_vfork sys_vfork
 #define stub_execve sys_execve
+#define stub_execveat sys_execveat
 #define stub_rt_sigreturn sys_rt_sigreturn
 
 #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCHv10 3/5] syscalls: add selftest for execveat(2)
  2014-11-24 11:53 ` David Drysdale
@ 2014-11-24 11:53   ` David Drysdale
  -1 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2014-11-24 11:53 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner
  Cc: Stephen Rothwell, Oleg Nesterov, Michael Kerrisk, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux,
	David Drysdale

Signed-off-by: David Drysdale <drysdale@google.com>
---
 tools/testing/selftests/Makefile        |   1 +
 tools/testing/selftests/exec/.gitignore |   9 +
 tools/testing/selftests/exec/Makefile   |  25 ++
 tools/testing/selftests/exec/execveat.c | 397 ++++++++++++++++++++++++++++++++
 4 files changed, 432 insertions(+)
 create mode 100644 tools/testing/selftests/exec/.gitignore
 create mode 100644 tools/testing/selftests/exec/Makefile
 create mode 100644 tools/testing/selftests/exec/execveat.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 45f145c6f843..c14893b501a9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -15,6 +15,7 @@ TARGETS += user
 TARGETS += sysctl
 TARGETS += firmware
 TARGETS += ftrace
+TARGETS += exec
 
 TARGETS_HOTPLUG = cpu-hotplug
 TARGETS_HOTPLUG += memory-hotplug
diff --git a/tools/testing/selftests/exec/.gitignore b/tools/testing/selftests/exec/.gitignore
new file mode 100644
index 000000000000..64073e050c6a
--- /dev/null
+++ b/tools/testing/selftests/exec/.gitignore
@@ -0,0 +1,9 @@
+subdir*
+script*
+execveat
+execveat.symlink
+execveat.moved
+execveat.path.ephemeral
+execveat.ephemeral
+execveat.denatured
+xxxxxxxx*
\ No newline at end of file
diff --git a/tools/testing/selftests/exec/Makefile b/tools/testing/selftests/exec/Makefile
new file mode 100644
index 000000000000..66dfc2ce1788
--- /dev/null
+++ b/tools/testing/selftests/exec/Makefile
@@ -0,0 +1,25 @@
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+BINARIES = execveat
+DEPS = execveat.symlink execveat.denatured script subdir
+all: $(BINARIES) $(DEPS)
+
+subdir:
+	mkdir -p $@
+script:
+	echo '#!/bin/sh' > $@
+	echo 'exit $$*' >> $@
+	chmod +x $@
+execveat.symlink: execveat
+	ln -s -f $< $@
+execveat.denatured: execveat
+	cp $< $@
+	chmod -x $@
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+	./execveat
+
+clean:
+	rm -rf $(BINARIES) $(DEPS) subdir.moved execveat.moved xxxxx*
diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
new file mode 100644
index 000000000000..33a5c06d95ca
--- /dev/null
+++ b/tools/testing/selftests/exec/execveat.c
@@ -0,0 +1,397 @@
+/*
+ * Copyright (c) 2014 Google, Inc.
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Selftests for execveat(2).
+ */
+
+#define _GNU_SOURCE  /* to get O_PATH, AT_EMPTY_PATH */
+#include <sys/sendfile.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+static char longpath[2 * PATH_MAX] = "";
+static char *envp[] = { "IN_TEST=yes", NULL, NULL };
+static char *argv[] = { "execveat", "99", NULL };
+
+static int execveat_(int fd, const char *path, char **argv, char **envp,
+		     int flags)
+{
+#ifdef __NR_execveat
+	return syscall(__NR_execveat, fd, path, argv, envp, flags);
+#else
+	errno = -ENOSYS;
+	return -1;
+#endif
+}
+
+#define check_execveat_fail(fd, path, flags, errno)	\
+	_check_execveat_fail(fd, path, flags, errno, #errno)
+static int _check_execveat_fail(int fd, const char *path, int flags,
+				int expected_errno, const char *errno_str)
+{
+	int rc;
+
+	errno = 0;
+	printf("Check failure of execveat(%d, '%s', %d) with %s... ",
+		fd, path?:"(null)", flags, errno_str);
+	rc = execveat_(fd, path, argv, envp, flags);
+
+	if (rc > 0) {
+		printf("[FAIL] (unexpected success from execveat(2))\n");
+		return 1;
+	}
+	if (errno != expected_errno) {
+		printf("[FAIL] (expected errno %d (%s) not %d (%s)\n",
+			expected_errno, strerror(expected_errno),
+			errno, strerror(errno));
+		return 1;
+	}
+	printf("[OK]\n");
+	return 0;
+}
+
+static int check_execveat_invoked_rc(int fd, const char *path, int flags,
+				     int expected_rc)
+{
+	int status;
+	int rc;
+	pid_t child;
+	int pathlen = path ? strlen(path) : 0;
+
+	if (pathlen > 40)
+		printf("Check success of execveat(%d, '%.20s...%s', %d)... ",
+			fd, path, (path + pathlen - 20), flags);
+	else
+		printf("Check success of execveat(%d, '%s', %d)... ",
+			fd, path?:"(null)", flags);
+	child = fork();
+	if (child < 0) {
+		printf("[FAIL] (fork() failed)\n");
+		return 1;
+	}
+	if (child == 0) {
+		/* Child: do execveat(). */
+		rc = execveat_(fd, path, argv, envp, flags);
+		printf("[FAIL]: execveat() failed, rc=%d errno=%d (%s)\n",
+			rc, errno, strerror(errno));
+		exit(1);  /* should not reach here */
+	}
+	/* Parent: wait for & check child's exit status. */
+	rc = waitpid(child, &status, 0);
+	if (rc != child) {
+		printf("[FAIL] (waitpid(%d,...) returned %d)\n", child, rc);
+		return 1;
+	}
+	if (!WIFEXITED(status)) {
+		printf("[FAIL] (child %d did not exit cleanly, status=%08x)\n",
+			child, status);
+		return 1;
+	}
+	if (WEXITSTATUS(status) != expected_rc) {
+		printf("[FAIL] (child %d exited with %d not %d)\n",
+			child, WEXITSTATUS(status), expected_rc);
+		return 1;
+	}
+	printf("[OK]\n");
+	return 0;
+}
+
+static int check_execveat(int fd, const char *path, int flags)
+{
+	return check_execveat_invoked_rc(fd, path, flags, 99);
+}
+
+static char *concat(const char *left, const char *right)
+{
+	char *result = malloc(strlen(left) + strlen(right) + 1);
+
+	strcpy(result, left);
+	strcat(result, right);
+	return result;
+}
+
+static int open_or_die(const char *filename, int flags)
+{
+	int fd = open(filename, flags);
+
+	if (fd < 0) {
+		printf("Failed to open '%s'; "
+			"check prerequisites are available\n", filename);
+		exit(1);
+	}
+	return fd;
+}
+
+static void exe_cp(const char *src, const char *dest)
+{
+	int in_fd = open_or_die(src, O_RDONLY);
+	int out_fd = open(dest, O_RDWR|O_CREAT|O_TRUNC, 0755);
+	struct stat info;
+
+	fstat(in_fd, &info);
+	sendfile(out_fd, in_fd, NULL, info.st_size);
+	close(in_fd);
+	close(out_fd);
+}
+
+#define XX_DIR_LEN 200
+static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script)
+{
+	int fail = 0;
+	int ii, count, len;
+	char longname[XX_DIR_LEN + 1];
+	int fd;
+
+	if (*longpath == '\0') {
+		/* Create a filename close to PATH_MAX in length */
+		memset(longname, 'x', XX_DIR_LEN - 1);
+		longname[XX_DIR_LEN - 1] = '/';
+		longname[XX_DIR_LEN] = '\0';
+		count = (PATH_MAX - 3) / XX_DIR_LEN;
+		for (ii = 0; ii < count; ii++) {
+			strcat(longpath, longname);
+			mkdir(longpath, 0755);
+		}
+		len = (PATH_MAX - 3) - (count * XX_DIR_LEN);
+		if (len <= 0)
+			len = 1;
+		memset(longname, 'y', len);
+		longname[len] = '\0';
+		strcat(longpath, longname);
+	}
+	exe_cp(src, longpath);
+
+	/*
+	 * Execute as a pre-opened file descriptor, which works whether this is
+	 * a script or not (because the interpreter sees a filename like
+	 * "/dev/fd/20").
+	 */
+	fd = open(longpath, O_RDONLY);
+	if (fd > 0) {
+		printf("Invoke copy of '%s' via filename of length %lu:\n",
+			src, strlen(longpath));
+		fail += check_execveat(fd, "", AT_EMPTY_PATH);
+	} else {
+		printf("Failed to open length %lu filename, errno=%d (%s)\n",
+			strlen(longpath), errno, strerror(errno));
+		fail++;
+	}
+
+	/*
+	 * Execute as a long pathname relative to ".".  If this is a script,
+	 * the interpreter will launch but fail to open the script because its
+	 * name ("/dev/fd/5/xxx....") is bigger than PATH_MAX.
+	 */
+	if (is_script)
+		fail += check_execveat_invoked_rc(dot_dfd, longpath, 0, 127);
+	else
+		fail += check_execveat(dot_dfd, longpath, 0);
+
+	return fail;
+}
+
+static int run_tests(void)
+{
+	int fail = 0;
+	char *fullname = realpath("execveat", NULL);
+	char *fullname_script = realpath("script", NULL);
+	char *fullname_symlink = concat(fullname, ".symlink");
+	int subdir_dfd = open_or_die("subdir", O_DIRECTORY|O_RDONLY);
+	int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
+					       O_DIRECTORY|O_RDONLY);
+	int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+	int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
+	int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC);
+	int fd = open_or_die("execveat", O_RDONLY);
+	int fd_path = open_or_die("execveat", O_RDONLY|O_PATH);
+	int fd_symlink = open_or_die("execveat.symlink", O_RDONLY);
+	int fd_denatured = open_or_die("execveat.denatured", O_RDONLY);
+	int fd_denatured_path = open_or_die("execveat.denatured",
+					    O_RDONLY|O_PATH);
+	int fd_script = open_or_die("script", O_RDONLY);
+	int fd_ephemeral = open_or_die("execveat.ephemeral", O_RDONLY);
+	int fd_ephemeral_path = open_or_die("execveat.path.ephemeral",
+					    O_RDONLY|O_PATH);
+	int fd_script_ephemeral = open_or_die("script.ephemeral", O_RDONLY);
+	int fd_cloexec = open_or_die("execveat", O_RDONLY|O_CLOEXEC);
+	int fd_script_cloexec = open_or_die("script", O_RDONLY|O_CLOEXEC);
+
+	/* Change file position to confirm it doesn't affect anything */
+	lseek(fd, 10, SEEK_SET);
+
+	/* Normal executable file: */
+	/*   dfd + path */
+	fail += check_execveat(subdir_dfd, "../execveat", 0);
+	fail += check_execveat(dot_dfd, "execveat", 0);
+	fail += check_execveat(dot_dfd_path, "execveat", 0);
+	/*   absolute path */
+	fail += check_execveat(AT_FDCWD, fullname, 0);
+	/*   absolute path with nonsense dfd */
+	fail += check_execveat(99, fullname, 0);
+	/*   fd + no path */
+	fail += check_execveat(fd, "", AT_EMPTY_PATH);
+	/*   O_CLOEXEC fd + no path */
+	fail += check_execveat(fd_cloexec, "", AT_EMPTY_PATH);
+	/*   O_PATH fd */
+	fail += check_execveat(fd_path, "", AT_EMPTY_PATH);
+
+	/* Mess with executable file that's already open: */
+	/*   fd + no path to a file that's been renamed */
+	rename("execveat.ephemeral", "execveat.moved");
+	fail += check_execveat(fd_ephemeral, "", AT_EMPTY_PATH);
+	/*   fd + no path to a file that's been deleted */
+	unlink("execveat.moved"); /* remove the file now fd open */
+	fail += check_execveat(fd_ephemeral, "", AT_EMPTY_PATH);
+
+	/* Mess with executable file that's already open with O_PATH */
+	/*   fd + no path to a file that's been deleted */
+	unlink("execveat.path.ephemeral");
+	fail += check_execveat(fd_ephemeral_path, "", AT_EMPTY_PATH);
+
+	/* Invalid argument failures */
+	fail += check_execveat_fail(fd, "", 0, ENOENT);
+	fail += check_execveat_fail(fd, NULL, AT_EMPTY_PATH, EFAULT);
+
+	/* Symlink to executable file: */
+	/*   dfd + path */
+	fail += check_execveat(dot_dfd, "execveat.symlink", 0);
+	fail += check_execveat(dot_dfd_path, "execveat.symlink", 0);
+	/*   absolute path */
+	fail += check_execveat(AT_FDCWD, fullname_symlink, 0);
+	/*   fd + no path, even with AT_SYMLINK_NOFOLLOW (already followed) */
+	fail += check_execveat(fd_symlink, "", AT_EMPTY_PATH);
+	fail += check_execveat(fd_symlink, "",
+			       AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW);
+
+	/* Symlink fails when AT_SYMLINK_NOFOLLOW set: */
+	/*   dfd + path */
+	fail += check_execveat_fail(dot_dfd, "execveat.symlink",
+				    AT_SYMLINK_NOFOLLOW, ELOOP);
+	fail += check_execveat_fail(dot_dfd_path, "execveat.symlink",
+				    AT_SYMLINK_NOFOLLOW, ELOOP);
+	/*   absolute path */
+	fail += check_execveat_fail(AT_FDCWD, fullname_symlink,
+				    AT_SYMLINK_NOFOLLOW, ELOOP);
+
+	/* Shell script wrapping executable file: */
+	/*   dfd + path */
+	fail += check_execveat(subdir_dfd, "../script", 0);
+	fail += check_execveat(dot_dfd, "script", 0);
+	fail += check_execveat(dot_dfd_path, "script", 0);
+	/*   absolute path */
+	fail += check_execveat(AT_FDCWD, fullname_script, 0);
+	/*   fd + no path */
+	fail += check_execveat(fd_script, "", AT_EMPTY_PATH);
+	fail += check_execveat(fd_script, "",
+			       AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW);
+	/*   O_CLOEXEC fd fails for a script (as script file inaccessible) */
+	fail += check_execveat_fail(fd_script_cloexec, "", AT_EMPTY_PATH,
+				    ENOENT);
+	fail += check_execveat_fail(dot_dfd_cloexec, "script", 0, ENOENT);
+
+	/* Mess with script file that's already open: */
+	/*   fd + no path to a file that's been renamed */
+	rename("script.ephemeral", "script.moved");
+	fail += check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH);
+	/*   fd + no path to a file that's been deleted */
+	unlink("script.moved"); /* remove the file while fd open */
+	fail += check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH);
+
+	/* Rename a subdirectory in the path: */
+	rename("subdir.ephemeral", "subdir.moved");
+	fail += check_execveat(subdir_dfd_ephemeral, "../script", 0);
+	fail += check_execveat(subdir_dfd_ephemeral, "script", 0);
+	/* Remove the subdir and its contents */
+	unlink("subdir.moved/script");
+	unlink("subdir.moved");
+	/* Shell loads via deleted subdir OK because name starts with .. */
+	fail += check_execveat(subdir_dfd_ephemeral, "../script", 0);
+	fail += check_execveat_fail(subdir_dfd_ephemeral, "script", 0, ENOENT);
+
+	/* Flag values other than AT_SYMLINK_NOFOLLOW => EINVAL */
+	fail += check_execveat_fail(dot_dfd, "execveat", 0xFFFF, EINVAL);
+	/* Invalid path => ENOENT */
+	fail += check_execveat_fail(dot_dfd, "no-such-file", 0, ENOENT);
+	fail += check_execveat_fail(dot_dfd_path, "no-such-file", 0, ENOENT);
+	fail += check_execveat_fail(AT_FDCWD, "no-such-file", 0, ENOENT);
+	/* Attempt to execute directory => EACCES */
+	fail += check_execveat_fail(dot_dfd, "", AT_EMPTY_PATH, EACCES);
+	/* Attempt to execute non-executable => EACCES */
+	fail += check_execveat_fail(dot_dfd, "Makefile", 0, EACCES);
+	fail += check_execveat_fail(fd_denatured, "", AT_EMPTY_PATH, EACCES);
+	fail += check_execveat_fail(fd_denatured_path, "", AT_EMPTY_PATH,
+				    EACCES);
+	/* Attempt to execute nonsense FD => EBADF */
+	fail += check_execveat_fail(99, "", AT_EMPTY_PATH, EBADF);
+	fail += check_execveat_fail(99, "execveat", 0, EBADF);
+	/* Attempt to execute relative to non-directory => ENOTDIR */
+	fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR);
+
+	fail += check_execveat_pathmax(dot_dfd, "execveat", 0);
+	fail += check_execveat_pathmax(dot_dfd, "script", 1);
+	return fail;
+}
+
+static void prerequisites(void)
+{
+	int fd;
+	const char *script = "#!/bin/sh\nexit $*\n";
+
+	/* Create ephemeral copies of files */
+	exe_cp("execveat", "execveat.ephemeral");
+	exe_cp("execveat", "execveat.path.ephemeral");
+	exe_cp("script", "script.ephemeral");
+	mkdir("subdir.ephemeral", 0755);
+
+	fd = open("subdir.ephemeral/script", O_RDWR|O_CREAT|O_TRUNC, 0755);
+	write(fd, script, strlen(script));
+	close(fd);
+}
+
+int main(int argc, char **argv)
+{
+	int ii;
+	int rc;
+	const char *verbose = getenv("VERBOSE");
+
+	if (argc >= 2) {
+		/* If we are invoked with an argument, don't run tests. */
+		const char *in_test = getenv("IN_TEST");
+
+		if (verbose) {
+			printf("  invoked with:");
+			for (ii = 0; ii < argc; ii++)
+				printf(" [%d]='%s'", ii, argv[ii]);
+			printf("\n");
+		}
+
+		/* Check expected environment transferred. */
+		if (!in_test || strcmp(in_test, "yes") != 0) {
+			printf("[FAIL] (no IN_TEST=yes in env)\n");
+			return 1;
+		}
+
+		/* Use the final argument as an exit code. */
+		rc = atoi(argv[argc - 1]);
+		fflush(stdout);
+	} else {
+		prerequisites();
+		if (verbose)
+			envp[1] = "VERBOSE=1";
+		rc = run_tests();
+		if (rc > 0)
+			printf("%d tests failed\n", rc);
+	}
+	return rc;
+}
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCHv10 3/5] syscalls: add selftest for execveat(2)
@ 2014-11-24 11:53   ` David Drysdale
  0 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2014-11-24 11:53 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner
  Cc: Stephen Rothwell, Oleg Nesterov, Michael Kerrisk, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux,
	David Drysdale

Signed-off-by: David Drysdale <drysdale@google.com>
---
 tools/testing/selftests/Makefile        |   1 +
 tools/testing/selftests/exec/.gitignore |   9 +
 tools/testing/selftests/exec/Makefile   |  25 ++
 tools/testing/selftests/exec/execveat.c | 397 ++++++++++++++++++++++++++++++++
 4 files changed, 432 insertions(+)
 create mode 100644 tools/testing/selftests/exec/.gitignore
 create mode 100644 tools/testing/selftests/exec/Makefile
 create mode 100644 tools/testing/selftests/exec/execveat.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 45f145c6f843..c14893b501a9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -15,6 +15,7 @@ TARGETS += user
 TARGETS += sysctl
 TARGETS += firmware
 TARGETS += ftrace
+TARGETS += exec
 
 TARGETS_HOTPLUG = cpu-hotplug
 TARGETS_HOTPLUG += memory-hotplug
diff --git a/tools/testing/selftests/exec/.gitignore b/tools/testing/selftests/exec/.gitignore
new file mode 100644
index 000000000000..64073e050c6a
--- /dev/null
+++ b/tools/testing/selftests/exec/.gitignore
@@ -0,0 +1,9 @@
+subdir*
+script*
+execveat
+execveat.symlink
+execveat.moved
+execveat.path.ephemeral
+execveat.ephemeral
+execveat.denatured
+xxxxxxxx*
\ No newline at end of file
diff --git a/tools/testing/selftests/exec/Makefile b/tools/testing/selftests/exec/Makefile
new file mode 100644
index 000000000000..66dfc2ce1788
--- /dev/null
+++ b/tools/testing/selftests/exec/Makefile
@@ -0,0 +1,25 @@
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+BINARIES = execveat
+DEPS = execveat.symlink execveat.denatured script subdir
+all: $(BINARIES) $(DEPS)
+
+subdir:
+	mkdir -p $@
+script:
+	echo '#!/bin/sh' > $@
+	echo 'exit $$*' >> $@
+	chmod +x $@
+execveat.symlink: execveat
+	ln -s -f $< $@
+execveat.denatured: execveat
+	cp $< $@
+	chmod -x $@
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+	./execveat
+
+clean:
+	rm -rf $(BINARIES) $(DEPS) subdir.moved execveat.moved xxxxx*
diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
new file mode 100644
index 000000000000..33a5c06d95ca
--- /dev/null
+++ b/tools/testing/selftests/exec/execveat.c
@@ -0,0 +1,397 @@
+/*
+ * Copyright (c) 2014 Google, Inc.
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Selftests for execveat(2).
+ */
+
+#define _GNU_SOURCE  /* to get O_PATH, AT_EMPTY_PATH */
+#include <sys/sendfile.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+static char longpath[2 * PATH_MAX] = "";
+static char *envp[] = { "IN_TEST=yes", NULL, NULL };
+static char *argv[] = { "execveat", "99", NULL };
+
+static int execveat_(int fd, const char *path, char **argv, char **envp,
+		     int flags)
+{
+#ifdef __NR_execveat
+	return syscall(__NR_execveat, fd, path, argv, envp, flags);
+#else
+	errno = -ENOSYS;
+	return -1;
+#endif
+}
+
+#define check_execveat_fail(fd, path, flags, errno)	\
+	_check_execveat_fail(fd, path, flags, errno, #errno)
+static int _check_execveat_fail(int fd, const char *path, int flags,
+				int expected_errno, const char *errno_str)
+{
+	int rc;
+
+	errno = 0;
+	printf("Check failure of execveat(%d, '%s', %d) with %s... ",
+		fd, path?:"(null)", flags, errno_str);
+	rc = execveat_(fd, path, argv, envp, flags);
+
+	if (rc > 0) {
+		printf("[FAIL] (unexpected success from execveat(2))\n");
+		return 1;
+	}
+	if (errno != expected_errno) {
+		printf("[FAIL] (expected errno %d (%s) not %d (%s)\n",
+			expected_errno, strerror(expected_errno),
+			errno, strerror(errno));
+		return 1;
+	}
+	printf("[OK]\n");
+	return 0;
+}
+
+static int check_execveat_invoked_rc(int fd, const char *path, int flags,
+				     int expected_rc)
+{
+	int status;
+	int rc;
+	pid_t child;
+	int pathlen = path ? strlen(path) : 0;
+
+	if (pathlen > 40)
+		printf("Check success of execveat(%d, '%.20s...%s', %d)... ",
+			fd, path, (path + pathlen - 20), flags);
+	else
+		printf("Check success of execveat(%d, '%s', %d)... ",
+			fd, path?:"(null)", flags);
+	child = fork();
+	if (child < 0) {
+		printf("[FAIL] (fork() failed)\n");
+		return 1;
+	}
+	if (child = 0) {
+		/* Child: do execveat(). */
+		rc = execveat_(fd, path, argv, envp, flags);
+		printf("[FAIL]: execveat() failed, rc=%d errno=%d (%s)\n",
+			rc, errno, strerror(errno));
+		exit(1);  /* should not reach here */
+	}
+	/* Parent: wait for & check child's exit status. */
+	rc = waitpid(child, &status, 0);
+	if (rc != child) {
+		printf("[FAIL] (waitpid(%d,...) returned %d)\n", child, rc);
+		return 1;
+	}
+	if (!WIFEXITED(status)) {
+		printf("[FAIL] (child %d did not exit cleanly, status=%08x)\n",
+			child, status);
+		return 1;
+	}
+	if (WEXITSTATUS(status) != expected_rc) {
+		printf("[FAIL] (child %d exited with %d not %d)\n",
+			child, WEXITSTATUS(status), expected_rc);
+		return 1;
+	}
+	printf("[OK]\n");
+	return 0;
+}
+
+static int check_execveat(int fd, const char *path, int flags)
+{
+	return check_execveat_invoked_rc(fd, path, flags, 99);
+}
+
+static char *concat(const char *left, const char *right)
+{
+	char *result = malloc(strlen(left) + strlen(right) + 1);
+
+	strcpy(result, left);
+	strcat(result, right);
+	return result;
+}
+
+static int open_or_die(const char *filename, int flags)
+{
+	int fd = open(filename, flags);
+
+	if (fd < 0) {
+		printf("Failed to open '%s'; "
+			"check prerequisites are available\n", filename);
+		exit(1);
+	}
+	return fd;
+}
+
+static void exe_cp(const char *src, const char *dest)
+{
+	int in_fd = open_or_die(src, O_RDONLY);
+	int out_fd = open(dest, O_RDWR|O_CREAT|O_TRUNC, 0755);
+	struct stat info;
+
+	fstat(in_fd, &info);
+	sendfile(out_fd, in_fd, NULL, info.st_size);
+	close(in_fd);
+	close(out_fd);
+}
+
+#define XX_DIR_LEN 200
+static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script)
+{
+	int fail = 0;
+	int ii, count, len;
+	char longname[XX_DIR_LEN + 1];
+	int fd;
+
+	if (*longpath = '\0') {
+		/* Create a filename close to PATH_MAX in length */
+		memset(longname, 'x', XX_DIR_LEN - 1);
+		longname[XX_DIR_LEN - 1] = '/';
+		longname[XX_DIR_LEN] = '\0';
+		count = (PATH_MAX - 3) / XX_DIR_LEN;
+		for (ii = 0; ii < count; ii++) {
+			strcat(longpath, longname);
+			mkdir(longpath, 0755);
+		}
+		len = (PATH_MAX - 3) - (count * XX_DIR_LEN);
+		if (len <= 0)
+			len = 1;
+		memset(longname, 'y', len);
+		longname[len] = '\0';
+		strcat(longpath, longname);
+	}
+	exe_cp(src, longpath);
+
+	/*
+	 * Execute as a pre-opened file descriptor, which works whether this is
+	 * a script or not (because the interpreter sees a filename like
+	 * "/dev/fd/20").
+	 */
+	fd = open(longpath, O_RDONLY);
+	if (fd > 0) {
+		printf("Invoke copy of '%s' via filename of length %lu:\n",
+			src, strlen(longpath));
+		fail += check_execveat(fd, "", AT_EMPTY_PATH);
+	} else {
+		printf("Failed to open length %lu filename, errno=%d (%s)\n",
+			strlen(longpath), errno, strerror(errno));
+		fail++;
+	}
+
+	/*
+	 * Execute as a long pathname relative to ".".  If this is a script,
+	 * the interpreter will launch but fail to open the script because its
+	 * name ("/dev/fd/5/xxx....") is bigger than PATH_MAX.
+	 */
+	if (is_script)
+		fail += check_execveat_invoked_rc(dot_dfd, longpath, 0, 127);
+	else
+		fail += check_execveat(dot_dfd, longpath, 0);
+
+	return fail;
+}
+
+static int run_tests(void)
+{
+	int fail = 0;
+	char *fullname = realpath("execveat", NULL);
+	char *fullname_script = realpath("script", NULL);
+	char *fullname_symlink = concat(fullname, ".symlink");
+	int subdir_dfd = open_or_die("subdir", O_DIRECTORY|O_RDONLY);
+	int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
+					       O_DIRECTORY|O_RDONLY);
+	int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+	int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
+	int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC);
+	int fd = open_or_die("execveat", O_RDONLY);
+	int fd_path = open_or_die("execveat", O_RDONLY|O_PATH);
+	int fd_symlink = open_or_die("execveat.symlink", O_RDONLY);
+	int fd_denatured = open_or_die("execveat.denatured", O_RDONLY);
+	int fd_denatured_path = open_or_die("execveat.denatured",
+					    O_RDONLY|O_PATH);
+	int fd_script = open_or_die("script", O_RDONLY);
+	int fd_ephemeral = open_or_die("execveat.ephemeral", O_RDONLY);
+	int fd_ephemeral_path = open_or_die("execveat.path.ephemeral",
+					    O_RDONLY|O_PATH);
+	int fd_script_ephemeral = open_or_die("script.ephemeral", O_RDONLY);
+	int fd_cloexec = open_or_die("execveat", O_RDONLY|O_CLOEXEC);
+	int fd_script_cloexec = open_or_die("script", O_RDONLY|O_CLOEXEC);
+
+	/* Change file position to confirm it doesn't affect anything */
+	lseek(fd, 10, SEEK_SET);
+
+	/* Normal executable file: */
+	/*   dfd + path */
+	fail += check_execveat(subdir_dfd, "../execveat", 0);
+	fail += check_execveat(dot_dfd, "execveat", 0);
+	fail += check_execveat(dot_dfd_path, "execveat", 0);
+	/*   absolute path */
+	fail += check_execveat(AT_FDCWD, fullname, 0);
+	/*   absolute path with nonsense dfd */
+	fail += check_execveat(99, fullname, 0);
+	/*   fd + no path */
+	fail += check_execveat(fd, "", AT_EMPTY_PATH);
+	/*   O_CLOEXEC fd + no path */
+	fail += check_execveat(fd_cloexec, "", AT_EMPTY_PATH);
+	/*   O_PATH fd */
+	fail += check_execveat(fd_path, "", AT_EMPTY_PATH);
+
+	/* Mess with executable file that's already open: */
+	/*   fd + no path to a file that's been renamed */
+	rename("execveat.ephemeral", "execveat.moved");
+	fail += check_execveat(fd_ephemeral, "", AT_EMPTY_PATH);
+	/*   fd + no path to a file that's been deleted */
+	unlink("execveat.moved"); /* remove the file now fd open */
+	fail += check_execveat(fd_ephemeral, "", AT_EMPTY_PATH);
+
+	/* Mess with executable file that's already open with O_PATH */
+	/*   fd + no path to a file that's been deleted */
+	unlink("execveat.path.ephemeral");
+	fail += check_execveat(fd_ephemeral_path, "", AT_EMPTY_PATH);
+
+	/* Invalid argument failures */
+	fail += check_execveat_fail(fd, "", 0, ENOENT);
+	fail += check_execveat_fail(fd, NULL, AT_EMPTY_PATH, EFAULT);
+
+	/* Symlink to executable file: */
+	/*   dfd + path */
+	fail += check_execveat(dot_dfd, "execveat.symlink", 0);
+	fail += check_execveat(dot_dfd_path, "execveat.symlink", 0);
+	/*   absolute path */
+	fail += check_execveat(AT_FDCWD, fullname_symlink, 0);
+	/*   fd + no path, even with AT_SYMLINK_NOFOLLOW (already followed) */
+	fail += check_execveat(fd_symlink, "", AT_EMPTY_PATH);
+	fail += check_execveat(fd_symlink, "",
+			       AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW);
+
+	/* Symlink fails when AT_SYMLINK_NOFOLLOW set: */
+	/*   dfd + path */
+	fail += check_execveat_fail(dot_dfd, "execveat.symlink",
+				    AT_SYMLINK_NOFOLLOW, ELOOP);
+	fail += check_execveat_fail(dot_dfd_path, "execveat.symlink",
+				    AT_SYMLINK_NOFOLLOW, ELOOP);
+	/*   absolute path */
+	fail += check_execveat_fail(AT_FDCWD, fullname_symlink,
+				    AT_SYMLINK_NOFOLLOW, ELOOP);
+
+	/* Shell script wrapping executable file: */
+	/*   dfd + path */
+	fail += check_execveat(subdir_dfd, "../script", 0);
+	fail += check_execveat(dot_dfd, "script", 0);
+	fail += check_execveat(dot_dfd_path, "script", 0);
+	/*   absolute path */
+	fail += check_execveat(AT_FDCWD, fullname_script, 0);
+	/*   fd + no path */
+	fail += check_execveat(fd_script, "", AT_EMPTY_PATH);
+	fail += check_execveat(fd_script, "",
+			       AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW);
+	/*   O_CLOEXEC fd fails for a script (as script file inaccessible) */
+	fail += check_execveat_fail(fd_script_cloexec, "", AT_EMPTY_PATH,
+				    ENOENT);
+	fail += check_execveat_fail(dot_dfd_cloexec, "script", 0, ENOENT);
+
+	/* Mess with script file that's already open: */
+	/*   fd + no path to a file that's been renamed */
+	rename("script.ephemeral", "script.moved");
+	fail += check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH);
+	/*   fd + no path to a file that's been deleted */
+	unlink("script.moved"); /* remove the file while fd open */
+	fail += check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH);
+
+	/* Rename a subdirectory in the path: */
+	rename("subdir.ephemeral", "subdir.moved");
+	fail += check_execveat(subdir_dfd_ephemeral, "../script", 0);
+	fail += check_execveat(subdir_dfd_ephemeral, "script", 0);
+	/* Remove the subdir and its contents */
+	unlink("subdir.moved/script");
+	unlink("subdir.moved");
+	/* Shell loads via deleted subdir OK because name starts with .. */
+	fail += check_execveat(subdir_dfd_ephemeral, "../script", 0);
+	fail += check_execveat_fail(subdir_dfd_ephemeral, "script", 0, ENOENT);
+
+	/* Flag values other than AT_SYMLINK_NOFOLLOW => EINVAL */
+	fail += check_execveat_fail(dot_dfd, "execveat", 0xFFFF, EINVAL);
+	/* Invalid path => ENOENT */
+	fail += check_execveat_fail(dot_dfd, "no-such-file", 0, ENOENT);
+	fail += check_execveat_fail(dot_dfd_path, "no-such-file", 0, ENOENT);
+	fail += check_execveat_fail(AT_FDCWD, "no-such-file", 0, ENOENT);
+	/* Attempt to execute directory => EACCES */
+	fail += check_execveat_fail(dot_dfd, "", AT_EMPTY_PATH, EACCES);
+	/* Attempt to execute non-executable => EACCES */
+	fail += check_execveat_fail(dot_dfd, "Makefile", 0, EACCES);
+	fail += check_execveat_fail(fd_denatured, "", AT_EMPTY_PATH, EACCES);
+	fail += check_execveat_fail(fd_denatured_path, "", AT_EMPTY_PATH,
+				    EACCES);
+	/* Attempt to execute nonsense FD => EBADF */
+	fail += check_execveat_fail(99, "", AT_EMPTY_PATH, EBADF);
+	fail += check_execveat_fail(99, "execveat", 0, EBADF);
+	/* Attempt to execute relative to non-directory => ENOTDIR */
+	fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR);
+
+	fail += check_execveat_pathmax(dot_dfd, "execveat", 0);
+	fail += check_execveat_pathmax(dot_dfd, "script", 1);
+	return fail;
+}
+
+static void prerequisites(void)
+{
+	int fd;
+	const char *script = "#!/bin/sh\nexit $*\n";
+
+	/* Create ephemeral copies of files */
+	exe_cp("execveat", "execveat.ephemeral");
+	exe_cp("execveat", "execveat.path.ephemeral");
+	exe_cp("script", "script.ephemeral");
+	mkdir("subdir.ephemeral", 0755);
+
+	fd = open("subdir.ephemeral/script", O_RDWR|O_CREAT|O_TRUNC, 0755);
+	write(fd, script, strlen(script));
+	close(fd);
+}
+
+int main(int argc, char **argv)
+{
+	int ii;
+	int rc;
+	const char *verbose = getenv("VERBOSE");
+
+	if (argc >= 2) {
+		/* If we are invoked with an argument, don't run tests. */
+		const char *in_test = getenv("IN_TEST");
+
+		if (verbose) {
+			printf("  invoked with:");
+			for (ii = 0; ii < argc; ii++)
+				printf(" [%d]='%s'", ii, argv[ii]);
+			printf("\n");
+		}
+
+		/* Check expected environment transferred. */
+		if (!in_test || strcmp(in_test, "yes") != 0) {
+			printf("[FAIL] (no IN_TEST=yes in env)\n");
+			return 1;
+		}
+
+		/* Use the final argument as an exit code. */
+		rc = atoi(argv[argc - 1]);
+		fflush(stdout);
+	} else {
+		prerequisites();
+		if (verbose)
+			envp[1] = "VERBOSE=1";
+		rc = run_tests();
+		if (rc > 0)
+			printf("%d tests failed\n", rc);
+	}
+	return rc;
+}
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCHv10 4/5] sparc: Hook up execveat system call.
  2014-11-24 11:53 ` David Drysdale
                   ` (3 preceding siblings ...)
  (?)
@ 2014-11-24 11:53 ` David Drysdale
  2014-11-24 18:36     ` David Miller
  -1 siblings, 1 reply; 123+ messages in thread
From: David Drysdale @ 2014-11-24 11:53 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner
  Cc: Stephen Rothwell, Oleg Nesterov, Michael Kerrisk, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux,
	David Drysdale

Signed-off-by: David Drysdale <drysdale@google.com>
---
 arch/sparc/include/uapi/asm/unistd.h |  3 ++-
 arch/sparc/kernel/syscalls.S         | 10 ++++++++++
 arch/sparc/kernel/systbls_32.S       |  1 +
 arch/sparc/kernel/systbls_64.S       |  2 ++
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/include/uapi/asm/unistd.h b/arch/sparc/include/uapi/asm/unistd.h
index 46d83842eddc..6f35f4df17f2 100644
--- a/arch/sparc/include/uapi/asm/unistd.h
+++ b/arch/sparc/include/uapi/asm/unistd.h
@@ -415,8 +415,9 @@
 #define __NR_getrandom		347
 #define __NR_memfd_create	348
 #define __NR_bpf		349
+#define __NR_execveat		350
 
-#define NR_syscalls		350
+#define NR_syscalls		351
 
 /* Bitmask values returned from kern_features system call.  */
 #define KERN_FEATURE_MIXED_MODE_STACK	0x00000001
diff --git a/arch/sparc/kernel/syscalls.S b/arch/sparc/kernel/syscalls.S
index 33a17e7b3ccd..bb0008927598 100644
--- a/arch/sparc/kernel/syscalls.S
+++ b/arch/sparc/kernel/syscalls.S
@@ -6,6 +6,11 @@ sys64_execve:
 	jmpl	%g1, %g0
 	 flushw
 
+sys64_execveat:
+	set	sys_execveat, %g1
+	jmpl	%g1, %g0
+	 flushw
+
 #ifdef CONFIG_COMPAT
 sunos_execv:
 	mov	%g0, %o2
@@ -13,6 +18,11 @@ sys32_execve:
 	set	compat_sys_execve, %g1
 	jmpl	%g1, %g0
 	 flushw
+
+sys32_execveat:
+	set	compat_sys_execveat, %g1
+	jmpl	%g1, %g0
+	 flushw
 #endif
 
 	.align	32
diff --git a/arch/sparc/kernel/systbls_32.S b/arch/sparc/kernel/systbls_32.S
index ad0cdf497b78..e31a9056a303 100644
--- a/arch/sparc/kernel/systbls_32.S
+++ b/arch/sparc/kernel/systbls_32.S
@@ -87,3 +87,4 @@ sys_call_table:
 /*335*/	.long sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
 /*340*/	.long sys_ni_syscall, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
 /*345*/	.long sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
+/*350*/	.long sys_execveat
diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
index 580cde9370c9..d72f76ae70eb 100644
--- a/arch/sparc/kernel/systbls_64.S
+++ b/arch/sparc/kernel/systbls_64.S
@@ -88,6 +88,7 @@ sys_call_table32:
 	.word sys_syncfs, compat_sys_sendmmsg, sys_setns, compat_sys_process_vm_readv, compat_sys_process_vm_writev
 /*340*/	.word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
 	.word sys32_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
+/*350*/	.word sys32_execveat
 
 #endif /* CONFIG_COMPAT */
 
@@ -167,3 +168,4 @@ sys_call_table:
 	.word sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
 /*340*/	.word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
 	.word sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
+/*350*/	.word sys64_execveat
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2014-11-24 11:53 ` David Drysdale
                   ` (4 preceding siblings ...)
  (?)
@ 2014-11-24 11:53 ` David Drysdale
  2015-01-09 15:47     ` Michael Kerrisk (man-pages)
  -1 siblings, 1 reply; 123+ messages in thread
From: David Drysdale @ 2014-11-24 11:53 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner
  Cc: Stephen Rothwell, Oleg Nesterov, Michael Kerrisk, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux,
	David Drysdale

Signed-off-by: David Drysdale <drysdale@google.com>
---
 man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 153 insertions(+)
 create mode 100644 man2/execveat.2

diff --git a/man2/execveat.2 b/man2/execveat.2
new file mode 100644
index 000000000000..937d79e4c4f0
--- /dev/null
+++ b/man2/execveat.2
@@ -0,0 +1,153 @@
+.\" Copyright (c) 2014 Google, Inc.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
+.SH NAME
+execveat \- execute program relative to a directory file descriptor
+.SH SYNOPSIS
+.B #include <unistd.h>
+.sp
+.BI "int execveat(int " fd ", const char *" pathname ","
+.br
+.BI "             char *const " argv "[],  char *const " envp "[],"
+.br
+.BI "             int " flags);
+.SH DESCRIPTION
+The
+.BR execveat ()
+system call executes the program pointed to by the combination of \fIfd\fP and \fIpathname\fP.
+The
+.BR execveat ()
+system call operates in exactly the same way as
+.BR execve (2),
+except for the differences described in this manual page.
+
+If the pathname given in
+.I pathname
+is relative, then it is interpreted relative to the directory
+referred to by the file descriptor
+.I fd
+(rather than relative to the current working directory of
+the calling process, as is done by
+.BR execve (2)
+for a relative pathname).
+
+If
+.I pathname
+is relative and
+.I fd
+is the special value
+.BR AT_FDCWD ,
+then
+.I pathname
+is interpreted relative to the current working
+directory of the calling process (like
+.BR execve (2)).
+
+If
+.I pathname
+is absolute, then
+.I fd
+is ignored.
+
+If
+.I pathname
+is an empty string and the
+.BR AT_EMPTY_PATH
+flag is specified, then the file descriptor
+.I fd
+specifies the file to be executed.
+
+.I flags
+can either be 0, or include the following flags:
+.TP
+.BR AT_EMPTY_PATH
+If
+.I pathname
+is an empty string, operate on the file referred to by
+.IR fd
+(which may have been obtained using the
+.BR open (2)
+.B O_PATH
+flag).
+.TP
+.B AT_SYMLINK_NOFOLLOW
+If the file identified by
+.I fd
+and a non-NULL
+.I pathname
+is a symbolic link, then the call fails with the error
+.BR EINVAL .
+.SH "RETURN VALUE"
+On success,
+.BR execveat ()
+does not return. On error \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+The same errors that occur for
+.BR execve (2)
+can also occur for
+.BR execveat ().
+The following additional errors can occur for
+.BR execveat ():
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B ENOENT
+The program identified by \fIfd\fP and \fIpathname\fP requires the
+use of an interpreter program (such as a script starting with
+"#!") but the file descriptor
+.I fd
+was opened with the
+.B O_CLOEXEC
+flag and so the program file is inaccessible to the launched interpreter.
+.TP
+.B EINVAL
+Invalid flag specified in
+.IR flags .
+.TP
+.B ENOTDIR
+.I pathname
+is relative and
+.I fd
+is a file descriptor referring to a file other than a directory.
+.SH VERSIONS
+.BR execveat ()
+was added to Linux in kernel 3.???.
+.SH NOTES
+In addition to the reasons explained in
+.BR openat (2),
+the
+.BR execveat ()
+system call is also needed to allow
+.BR fexecve (3)
+to be implemented on systems that do not have the
+.I /proc
+filesystem mounted.
+.SH SEE ALSO
+.BR execve (2),
+.BR fexecve (3)
--
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
@ 2014-11-24 12:45     ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2014-11-24 12:45 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Stephen Rothwell, Oleg Nesterov, Michael Kerrisk, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux

On Mon, 24 Nov 2014, David Drysdale wrote:

> Hook up x86-64, i386 and x32 ABIs.
> 
> Signed-off-by: David Drysdale <drysdale@google.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
@ 2014-11-24 12:45     ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2014-11-24 12:45 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, David Miller, Stephen Rothwell, Oleg Nesterov,
	Michael Kerrisk, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Rich Felker, Christoph Hellwig,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

On Mon, 24 Nov 2014, David Drysdale wrote:

> Hook up x86-64, i386 and x32 ABIs.
> 
> Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Reviewed-by: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
@ 2014-11-24 12:45     ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2014-11-24 12:45 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, David Miller, Stephen Rothwell, Oleg Nesterov,
	Michael Kerrisk, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Rich Felker, Christoph Hellwig,
	x86-DgEjT+Ai2ygdnm+yROfE0A, linux-arch-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

On Mon, 24 Nov 2014, David Drysdale wrote:

> Hook up x86-64, i386 and x32 ABIs.
> 
> Signed-off-by: David Drysdale <drysdale@google.com>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
  2014-11-24 11:53 ` [PATCHv10 2/5] x86: Hook up execveat " David Drysdale
  2014-11-24 12:45     ` Thomas Gleixner
@ 2014-11-24 17:06     ` Dan Carpenter
  1 sibling, 0 replies; 123+ messages in thread
From: Dan Carpenter @ 2014-11-24 17:06 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Michael Kerrisk, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Rich Felker, Christoph Hellwig, x86, linux-arch,
	linux-api, sparclinux

On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
> Hook up x86-64, i386 and x32 ABIs.
> 
> Signed-off-by: David Drysdale <drysdale@google.com>

This one has been breaking my linux-next build for the past week.  I'm
not sure what's going on.  I build with a script:

make allmodconfig

cat << EOF >> .config
CONFIG_DYNAMIC_DEBUG=n
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=n
CONFIG_DYNAMIC_DEBUG=y
EOF

make oldconfig

Here are the errors:

  CHK     include/generated/compile.h
  CHECK   arch/x86/ia32/audit.c
  CC      arch/x86/ia32/audit.o
arch/x86/ia32/audit.c: In function ‘ia32_classify_syscall’:
arch/x86/ia32/audit.c:38:7: error: ‘__NR_execveat’ undeclared (first use in this function)
arch/x86/ia32/audit.c:38:7: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [arch/x86/ia32/audit.o] Error 1
make[2]: Target `__build' not remade because of errors.
make[1]: *** [arch/x86/ia32] Error 2
  CHECK   arch/x86/kernel/audit_64.c
  CHK     kernel/config_data.h
  CC      arch/x86/kernel/audit_64.o
arch/x86/kernel/audit_64.c: In function ‘audit_classify_syscall’:
arch/x86/kernel/audit_64.c:53:7: error: ‘__NR_execveat’ undeclared (first use in this function)
arch/x86/kernel/audit_64.c:53:7: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [arch/x86/kernel/audit_64.o] Error 1
make[2]: Target `__build' not remade because of errors.
make[1]: *** [arch/x86/kernel] Error 2
make[1]: Target `__build' not remade because of errors.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
@ 2014-11-24 17:06     ` Dan Carpenter
  0 siblings, 0 replies; 123+ messages in thread
From: Dan Carpenter @ 2014-11-24 17:06 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Michael Kerrisk, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Rich Felker, Christoph Hellwig, x86, linux-arch,
	linux-api, sparclinux

On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
> Hook up x86-64, i386 and x32 ABIs.
> 
> Signed-off-by: David Drysdale <drysdale@google.com>

This one has been breaking my linux-next build for the past week.  I'm
not sure what's going on.  I build with a script:

make allmodconfig

cat << EOF >> .config
CONFIG_DYNAMIC_DEBUG=n
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=n
CONFIG_DYNAMIC_DEBUG=y
EOF

make oldconfig

Here are the errors:

  CHK     include/generated/compile.h
  CHECK   arch/x86/ia32/audit.c
  CC      arch/x86/ia32/audit.o
arch/x86/ia32/audit.c: In function ‘ia32_classify_syscall’:
arch/x86/ia32/audit.c:38:7: error: ‘__NR_execveat’ undeclared (first use in this function)
arch/x86/ia32/audit.c:38:7: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [arch/x86/ia32/audit.o] Error 1
make[2]: Target `__build' not remade because of errors.
make[1]: *** [arch/x86/ia32] Error 2
  CHECK   arch/x86/kernel/audit_64.c
  CHK     kernel/config_data.h
  CC      arch/x86/kernel/audit_64.o
arch/x86/kernel/audit_64.c: In function ‘audit_classify_syscall’:
arch/x86/kernel/audit_64.c:53:7: error: ‘__NR_execveat’ undeclared (first use in this function)
arch/x86/kernel/audit_64.c:53:7: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [arch/x86/kernel/audit_64.o] Error 1
make[2]: Target `__build' not remade because of errors.
make[1]: *** [arch/x86/kernel] Error 2
make[1]: Target `__build' not remade because of errors.

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
@ 2014-11-24 17:06     ` Dan Carpenter
  0 siblings, 0 replies; 123+ messages in thread
From: Dan Carpenter @ 2014-11-24 17:06 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Michael Kerrisk, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Rich Felker, Christoph Hellwig, x86, linux-arch,
	linux-api, sparclinux

On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
> Hook up x86-64, i386 and x32 ABIs.
> 
> Signed-off-by: David Drysdale <drysdale@google.com>

This one has been breaking my linux-next build for the past week.  I'm
not sure what's going on.  I build with a script:

make allmodconfig

cat << EOF >> .config
CONFIG_DYNAMIC_DEBUG=n
CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=n
CONFIG_DYNAMIC_DEBUG=y
EOF

make oldconfig

Here are the errors:

  CHK     include/generated/compile.h
  CHECK   arch/x86/ia32/audit.c
  CC      arch/x86/ia32/audit.o
arch/x86/ia32/audit.c: In function ‘ia32_classify_syscall’:
arch/x86/ia32/audit.c:38:7: error: ‘__NR_execveat’ undeclared (first use in this function)
arch/x86/ia32/audit.c:38:7: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [arch/x86/ia32/audit.o] Error 1
make[2]: Target `__build' not remade because of errors.
make[1]: *** [arch/x86/ia32] Error 2
  CHECK   arch/x86/kernel/audit_64.c
  CHK     kernel/config_data.h
  CC      arch/x86/kernel/audit_64.o
arch/x86/kernel/audit_64.c: In function ‘audit_classify_syscall’:
arch/x86/kernel/audit_64.c:53:7: error: ‘__NR_execveat’ undeclared (first use in this function)
arch/x86/kernel/audit_64.c:53:7: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [arch/x86/kernel/audit_64.o] Error 1
make[2]: Target `__build' not remade because of errors.
make[1]: *** [arch/x86/kernel] Error 2
make[1]: Target `__build' not remade because of errors.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
  2014-11-24 17:06     ` Dan Carpenter
@ 2014-11-24 18:26       ` David Drysdale
  -1 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2014-11-24 18:26 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Michael Kerrisk, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Rich Felker, Christoph Hellwig, X86 ML,
	linux-arch, Linux API, sparclinux

On Mon, Nov 24, 2014 at 5:06 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
>> Hook up x86-64, i386 and x32 ABIs.
>>
>> Signed-off-by: David Drysdale <drysdale@google.com>
>
> This one has been breaking my linux-next build for the past week.  I'm
> not sure what's going on.

Hi Dan,

Sorry if this has been causing you problems -- I've not had any
errors from the kbuild robots or my local builds.

> I build with a script:
>
> make allmodconfig
>
> cat << EOF >> .config
> CONFIG_DYNAMIC_DEBUG=n
> CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=n
> CONFIG_DYNAMIC_DEBUG=y
> EOF
>
> make oldconfig
>
> Here are the errors:
>
>   CHK     include/generated/compile.h
>   CHECK   arch/x86/ia32/audit.c
>   CC      arch/x86/ia32/audit.o
> arch/x86/ia32/audit.c: In function ‘ia32_classify_syscall’:
> arch/x86/ia32/audit.c:38:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/ia32/audit.c:38:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/ia32/audit.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/ia32] Error 2
>   CHECK   arch/x86/kernel/audit_64.c
>   CHK     kernel/config_data.h
>   CC      arch/x86/kernel/audit_64.o
> arch/x86/kernel/audit_64.c: In function ‘audit_classify_syscall’:
> arch/x86/kernel/audit_64.c:53:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/kernel/audit_64.c:53:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/kernel/audit_64.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/kernel] Error 2
> make[1]: Target `__build' not remade because of errors.

That seems odd -- the generic definition of __NR_execveat is in the first
patch in the series, and the various x86-specific definitions should get
generated from the table entries in the second patch in the series (at
least since the v9 set I sent on 19 Nov, which split out the x86 wiring
from the general implementation).

Are the syscall table generation steps happening in your build?  And does
__NR_execveat appear in the various generated x86 unistd*.h headers?

As an aside, I've just built next-20141124 (a4cfa44aa26a) fine from
scratch with your config steps.   The build output included the header
generation steps:

  SYSTBL  arch/x86/syscalls/../include/generated/asm/syscalls_32.h
  SYSHDR  arch/x86/syscalls/../include/generated/asm/unistd_32_ia32.h
  SYSHDR  arch/x86/syscalls/../include/generated/asm/unistd_64_x32.h
  SYSTBL  arch/x86/syscalls/../include/generated/asm/syscalls_64.h
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_32.h
  HOSTCC  scripts/basic/bin2c
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_64.h
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_x32.h

and the resulting files did include the __NR_execveat constant:

  % grep execveat usr/include/asm*/*.h arch/x86/include/generated/uapi/asm/*.h
  usr/include/asm-generic/unistd.h:#define __NR_execveat 281
  usr/include/asm-generic/unistd.h:__SC_COMP(__NR_execveat,
sys_execveat, compat_sys_execveat)
  usr/include/asm/unistd_32.h:#define __NR_execveat 358
  usr/include/asm/unistd_64.h:#define __NR_execveat 322
  usr/include/asm/unistd_x32.h:#define __NR_execveat (__X32_SYSCALL_BIT + 545)
  arch/x86/include/generated/uapi/asm/unistd_32.h:#define __NR_execveat 358
  arch/x86/include/generated/uapi/asm/unistd_64.h:#define __NR_execveat 322
  arch/x86/include/generated/uapi/asm/unistd_x32.h:#define
__NR_execveat (__X32_SYSCALL_BIT + 545)

So I can't (yet) reproduce your problem I'm afraid...

> regards,
> dan carpenter

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
@ 2014-11-24 18:26       ` David Drysdale
  0 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2014-11-24 18:26 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Michael Kerrisk, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Rich Felker, Christoph Hellwig, X86 ML,
	linux-arch, Linux API, sparclinux

On Mon, Nov 24, 2014 at 5:06 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
>> Hook up x86-64, i386 and x32 ABIs.
>>
>> Signed-off-by: David Drysdale <drysdale@google.com>
>
> This one has been breaking my linux-next build for the past week.  I'm
> not sure what's going on.

Hi Dan,

Sorry if this has been causing you problems -- I've not had any
errors from the kbuild robots or my local builds.

> I build with a script:
>
> make allmodconfig
>
> cat << EOF >> .config
> CONFIG_DYNAMIC_DEBUG=n
> CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=n
> CONFIG_DYNAMIC_DEBUG=y
> EOF
>
> make oldconfig
>
> Here are the errors:
>
>   CHK     include/generated/compile.h
>   CHECK   arch/x86/ia32/audit.c
>   CC      arch/x86/ia32/audit.o
> arch/x86/ia32/audit.c: In function ‘ia32_classify_syscall’:
> arch/x86/ia32/audit.c:38:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/ia32/audit.c:38:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/ia32/audit.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/ia32] Error 2
>   CHECK   arch/x86/kernel/audit_64.c
>   CHK     kernel/config_data.h
>   CC      arch/x86/kernel/audit_64.o
> arch/x86/kernel/audit_64.c: In function ‘audit_classify_syscall’:
> arch/x86/kernel/audit_64.c:53:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/kernel/audit_64.c:53:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/kernel/audit_64.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/kernel] Error 2
> make[1]: Target `__build' not remade because of errors.

That seems odd -- the generic definition of __NR_execveat is in the first
patch in the series, and the various x86-specific definitions should get
generated from the table entries in the second patch in the series (at
least since the v9 set I sent on 19 Nov, which split out the x86 wiring
from the general implementation).

Are the syscall table generation steps happening in your build?  And does
__NR_execveat appear in the various generated x86 unistd*.h headers?

As an aside, I've just built next-20141124 (a4cfa44aa26a) fine from
scratch with your config steps.   The build output included the header
generation steps:

  SYSTBL  arch/x86/syscalls/../include/generated/asm/syscalls_32.h
  SYSHDR  arch/x86/syscalls/../include/generated/asm/unistd_32_ia32.h
  SYSHDR  arch/x86/syscalls/../include/generated/asm/unistd_64_x32.h
  SYSTBL  arch/x86/syscalls/../include/generated/asm/syscalls_64.h
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_32.h
  HOSTCC  scripts/basic/bin2c
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_64.h
  SYSHDR  arch/x86/syscalls/../include/generated/uapi/asm/unistd_x32.h

and the resulting files did include the __NR_execveat constant:

  % grep execveat usr/include/asm*/*.h arch/x86/include/generated/uapi/asm/*.h
  usr/include/asm-generic/unistd.h:#define __NR_execveat 281
  usr/include/asm-generic/unistd.h:__SC_COMP(__NR_execveat,
sys_execveat, compat_sys_execveat)
  usr/include/asm/unistd_32.h:#define __NR_execveat 358
  usr/include/asm/unistd_64.h:#define __NR_execveat 322
  usr/include/asm/unistd_x32.h:#define __NR_execveat (__X32_SYSCALL_BIT + 545)
  arch/x86/include/generated/uapi/asm/unistd_32.h:#define __NR_execveat 358
  arch/x86/include/generated/uapi/asm/unistd_64.h:#define __NR_execveat 322
  arch/x86/include/generated/uapi/asm/unistd_x32.h:#define
__NR_execveat (__X32_SYSCALL_BIT + 545)

So I can't (yet) reproduce your problem I'm afraid...

> regards,
> dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 4/5] sparc: Hook up execveat system call.
  2014-11-24 11:53 ` [PATCHv10 4/5] sparc: Hook up execveat system call David Drysdale
@ 2014-11-24 18:36     ` David Miller
  0 siblings, 0 replies; 123+ messages in thread
From: David Miller @ 2014-11-24 18:36 UTC (permalink / raw)
  To: drysdale
  Cc: ebiederm, luto, viro, meredydd, linux-kernel, akpm, tglx, sfr,
	oleg, mtk.manpages, mingo, hpa, keescook, arnd, dalias, hch, x86,
	linux-arch, linux-api, sparclinux

From: David Drysdale <drysdale@google.com>
Date: Mon, 24 Nov 2014 11:53:58 +0000

> Signed-off-by: David Drysdale <drysdale@google.com>

Acked-by: David S. Miller <davem@davemloft.net>> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 4/5] sparc: Hook up execveat system call.
@ 2014-11-24 18:36     ` David Miller
  0 siblings, 0 replies; 123+ messages in thread
From: David Miller @ 2014-11-24 18:36 UTC (permalink / raw)
  To: drysdale
  Cc: ebiederm, luto, viro, meredydd, linux-kernel, akpm, tglx, sfr,
	oleg, mtk.manpages, mingo, hpa, keescook, arnd, dalias, hch, x86,
	linux-arch, linux-api, sparclinux

From: David Drysdale <drysdale@google.com>
Date: Mon, 24 Nov 2014 11:53:58 +0000

> Signed-off-by: David Drysdale <drysdale@google.com>

Acked-by: David S. Miller <davem@davemloft.net>> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
  2014-11-24 17:06     ` Dan Carpenter
@ 2014-11-24 18:53       ` Thomas Gleixner
  -1 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2014-11-24 18:53 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Stephen Rothwell, Oleg Nesterov, Michael Kerrisk,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Rich Felker, Christoph Hellwig, x86, linux-arch, linux-api,
	sparclinux

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1813 bytes --]

On Mon, 24 Nov 2014, Dan Carpenter wrote:

> On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
> > Hook up x86-64, i386 and x32 ABIs.
> > 
> > Signed-off-by: David Drysdale <drysdale@google.com>
> 
> This one has been breaking my linux-next build for the past week.  I'm
> not sure what's going on.  I build with a script:
> 
> make allmodconfig
> 
> cat << EOF >> .config
> CONFIG_DYNAMIC_DEBUG=n
> CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=n
> CONFIG_DYNAMIC_DEBUG=y
> EOF
> 
> make oldconfig
> 
> Here are the errors:
> 
>   CHK     include/generated/compile.h
>   CHECK   arch/x86/ia32/audit.c
>   CC      arch/x86/ia32/audit.o
> arch/x86/ia32/audit.c: In function ‘ia32_classify_syscall’:
> arch/x86/ia32/audit.c:38:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/ia32/audit.c:38:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/ia32/audit.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/ia32] Error 2
>   CHECK   arch/x86/kernel/audit_64.c
>   CHK     kernel/config_data.h
>   CC      arch/x86/kernel/audit_64.o
> arch/x86/kernel/audit_64.c: In function ‘audit_classify_syscall’:
> arch/x86/kernel/audit_64.c:53:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/kernel/audit_64.c:53:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/kernel/audit_64.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/kernel] Error 2
> make[1]: Target `__build' not remade because of errors.

I don't know what you're doing.

Tried the above and it rebuilds the relevant unistd*.h files and
compiles happily.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
@ 2014-11-24 18:53       ` Thomas Gleixner
  0 siblings, 0 replies; 123+ messages in thread
From: Thomas Gleixner @ 2014-11-24 18:53 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Stephen Rothwell, Oleg Nesterov, Michael Kerrisk,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Rich Felker, Christoph Hellwig, x86, linux-arch, linux-api,
	sparclinux

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1813 bytes --]

On Mon, 24 Nov 2014, Dan Carpenter wrote:

> On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
> > Hook up x86-64, i386 and x32 ABIs.
> > 
> > Signed-off-by: David Drysdale <drysdale@google.com>
> 
> This one has been breaking my linux-next build for the past week.  I'm
> not sure what's going on.  I build with a script:
> 
> make allmodconfig
> 
> cat << EOF >> .config
> CONFIG_DYNAMIC_DEBUG=n
> CONFIG_DEBUG_STRICT_USER_COPY_CHECKS=n
> CONFIG_DYNAMIC_DEBUG=y
> EOF
> 
> make oldconfig
> 
> Here are the errors:
> 
>   CHK     include/generated/compile.h
>   CHECK   arch/x86/ia32/audit.c
>   CC      arch/x86/ia32/audit.o
> arch/x86/ia32/audit.c: In function ‘ia32_classify_syscall’:
> arch/x86/ia32/audit.c:38:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/ia32/audit.c:38:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/ia32/audit.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/ia32] Error 2
>   CHECK   arch/x86/kernel/audit_64.c
>   CHK     kernel/config_data.h
>   CC      arch/x86/kernel/audit_64.o
> arch/x86/kernel/audit_64.c: In function ‘audit_classify_syscall’:
> arch/x86/kernel/audit_64.c:53:7: error: ‘__NR_execveat’ undeclared (first use in this function)
> arch/x86/kernel/audit_64.c:53:7: note: each undeclared identifier is reported only once for each function it appears in
> make[2]: *** [arch/x86/kernel/audit_64.o] Error 1
> make[2]: Target `__build' not remade because of errors.
> make[1]: *** [arch/x86/kernel] Error 2
> make[1]: Target `__build' not remade because of errors.

I don't know what you're doing.

Tried the above and it rebuilds the relevant unistd*.h files and
compiles happily.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
@ 2014-11-25 12:16         ` Dan Carpenter
  0 siblings, 0 replies; 123+ messages in thread
From: Dan Carpenter @ 2014-11-25 12:16 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Michael Kerrisk, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Rich Felker, Christoph Hellwig, X86 ML,
	linux-arch, Linux API, sparclinux

On Mon, Nov 24, 2014 at 06:26:24PM +0000, David Drysdale wrote:
> On Mon, Nov 24, 2014 at 5:06 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> > On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
> >> Hook up x86-64, i386 and x32 ABIs.
> >>
> >> Signed-off-by: David Drysdale <drysdale@google.com>
> >
> > This one has been breaking my linux-next build for the past week.  I'm
> > not sure what's going on.
> 
> Hi Dan,
> 
> Sorry if this has been causing you problems -- I've not had any
> errors from the kbuild robots or my local builds.
> 

For some reason I had a stale copy of
arch/x86/include/generated/asm/unistd_32.h and it was using that in
preference to the arch/x86/include/generated/uapi/asm/unistd_32.h file.
Once I did ran:

	arch/x86/include/generated/ -rf

Then it builds now.

I'm not sure what that's all about but it's fixed now.

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
@ 2014-11-25 12:16         ` Dan Carpenter
  0 siblings, 0 replies; 123+ messages in thread
From: Dan Carpenter @ 2014-11-25 12:16 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Michael Kerrisk, Ingo Molnar, H. Peter Anvin,
	Kees Cook, Arnd Bergmann, Rich Felker, Christoph Hellwig, X86 ML,
	linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Mon, Nov 24, 2014 at 06:26:24PM +0000, David Drysdale wrote:
> On Mon, Nov 24, 2014 at 5:06 PM, Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
> >> Hook up x86-64, i386 and x32 ABIs.
> >>
> >> Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >
> > This one has been breaking my linux-next build for the past week.  I'm
> > not sure what's going on.
> 
> Hi Dan,
> 
> Sorry if this has been causing you problems -- I've not had any
> errors from the kbuild robots or my local builds.
> 

For some reason I had a stale copy of
arch/x86/include/generated/asm/unistd_32.h and it was using that in
preference to the arch/x86/include/generated/uapi/asm/unistd_32.h file.
Once I did ran:

	arch/x86/include/generated/ -rf

Then it builds now.

I'm not sure what that's all about but it's fixed now.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 2/5] x86: Hook up execveat system call.
@ 2014-11-25 12:16         ` Dan Carpenter
  0 siblings, 0 replies; 123+ messages in thread
From: Dan Carpenter @ 2014-11-25 12:16 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Michael Kerrisk, Ingo Molnar, H. Peter Anvin,
	Kees Cook, Arnd Bergmann, Rich Felker, Christoph Hellwig, X86 ML,
	linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Mon, Nov 24, 2014 at 06:26:24PM +0000, David Drysdale wrote:
> On Mon, Nov 24, 2014 at 5:06 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> > On Mon, Nov 24, 2014 at 11:53:56AM +0000, David Drysdale wrote:
> >> Hook up x86-64, i386 and x32 ABIs.
> >>
> >> Signed-off-by: David Drysdale <drysdale@google.com>
> >
> > This one has been breaking my linux-next build for the past week.  I'm
> > not sure what's going on.
> 
> Hi Dan,
> 
> Sorry if this has been causing you problems -- I've not had any
> errors from the kbuild robots or my local builds.
> 

For some reason I had a stale copy of
arch/x86/include/generated/asm/unistd_32.h and it was using that in
preference to the arch/x86/include/generated/uapi/asm/unistd_32.h file.
Once I did ran:

	arch/x86/include/generated/ -rf

Then it builds now.

I'm not sure what that's all about but it's fixed now.

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2014-11-24 11:53 ` [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2) David Drysdale
@ 2015-01-09 15:47     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-09 15:47 UTC (permalink / raw)
  To: David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner
  Cc: mtk.manpages, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux

On 11/24/2014 12:53 PM, David Drysdale wrote:
> Signed-off-by: David Drysdale <drysdale@google.com>
> ---
>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 153 insertions(+)
>  create mode 100644 man2/execveat.2

David,

Thanks for the very nicely prepared man page. I've done 
a few very light edits, and will release the version below 
with the next man-pages release.

I have one question. In the message accompanying
commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:

  The filename fed to the executed program as argv[0] (or the name of the
  script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
  (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
  reflecting how the executable was found.  This does however mean that
  execution of a script in a /proc-less environment won't work; also, script
  execution via an O_CLOEXEC file descriptor fails (as the file will not be
  accessible after exec).

How does one produce this situation where the execed program sees 
argv[0] as a /dev/fd path? (i.e., what would the execveat()
call look like?) I tried to produce this scenario, but could not.

Cheers,

Michael

.\" Copyright (c) 2014 Google, Inc.
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date.  The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein.  The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.TH EXECVEAT 2 2015-01-09 "Linux" "Linux Programmer's Manual"
.SH NAME
execveat \- execute program relative to a directory file descriptor
.SH SYNOPSIS
.B #include <unistd.h>
.sp
.BI "int execveat(int " dirfd ", const char *" pathname ","
.br
.BI "             char *const " argv "[], char *const " envp "[],"
.br
.BI "             int " flags );
.SH DESCRIPTION
.\" commit 51f39a1f0cea1cacf8c787f652f26dfee9611874
The
.BR execveat ()
system call executes the program referred to by the combination of
.I dirfd
and
.IR pathname .
It operates in exactly the same way as
.BR execve (2),
except for the differences described in this manual page.

If the pathname given in
.I pathname
is relative, then it is interpreted relative to the directory
referred to by the file descriptor
.I dirfd
(rather than relative to the current working directory of
the calling process, as is done by
.BR execve (2)
for a relative pathname).

If
.I pathname
is relative and
.I dirfd
is the special value
.BR AT_FDCWD ,
then
.I pathname
is interpreted relative to the current working
directory of the calling process (like
.BR execve (2)).

If
.I pathname
is absolute, then
.I dirfd
is ignored.

If
.I pathname
is an empty string and the
.BR AT_EMPTY_PATH
flag is specified, then the file descriptor
.I dirfd
specifies the file to be executed (i.e.,
.IR dirfd
refers to an executable file, rather than a directory).

The
.I flags
argument is a bit mask that can include zero or more of the following flags:
.TP
.BR AT_EMPTY_PATH
If
.I pathname
is an empty string, operate on the file referred to by
.IR dirfd
(which may have been obtained using the
.BR open (2)
.B O_PATH
flag).
.TP
.B AT_SYMLINK_NOFOLLOW
If the file identified by
.I dirfd
and a non-NULL
.I pathname
is a symbolic link, then the call fails with the error
.BR EINVAL .
.SH "RETURN VALUE"
On success,
.BR execveat ()
does not return. On error \-1 is returned, and
.I errno
is set appropriately.
.SH ERRORS
The same errors that occur for
.BR execve (2)
can also occur for
.BR execveat ().
The following additional errors can occur for
.BR execveat ():
.TP
.B EBADF
.I dirfd
is not a valid file descriptor.
.TP
.B EINVAL
.I flags
includes
.BR AT_SYMLINK_NOFOLLOW
and the file identified by
.I dirfd
and a non-NULL
.I pathname
is a symbolic link.
.TP
.B EINVAL
Invalid flag specified in
.IR flags .
.TP
.B ENOENT
The program identified by
.I dirfd
and
.I pathname
requires the use of an interpreter program
(such as a script starting with "#!"), but the file descriptor
.I dirfd
was opened with the
.B O_CLOEXEC
flag, with the result that
the program file is inaccessible to the launched interpreter.
.TP
.B ENOTDIR
.I pathname
is relative and
.I dirfd
is a file descriptor referring to a file other than a directory.
.SH VERSIONS
.BR execveat ()
was added to Linux in kernel 3.19.
GNU C library support is pending.
.\" FIXME . check for glibc support in a future release
.SH CONFORMING TO
The
.BR execveat ()
system call is Linux-specific.
.SH NOTES
In addition to the reasons explained in
.BR openat (2),
the
.BR execveat ()
system call is also needed to allow
.BR fexecve (3)
to be implemented on systems that do not have the
.I /proc
filesystem mounted.
.SH SEE ALSO
.BR execve (2),
.BR openat (2),
.BR fexecve (3)

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 15:47     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-09 15:47 UTC (permalink / raw)
  To: David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner
  Cc: mtk.manpages, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux

On 11/24/2014 12:53 PM, David Drysdale wrote:
> Signed-off-by: David Drysdale <drysdale@google.com>
> ---
>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 153 insertions(+)
>  create mode 100644 man2/execveat.2

David,

Thanks for the very nicely prepared man page. I've done 
a few very light edits, and will release the version below 
with the next man-pages release.

I have one question. In the message accompanying
commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:

  The filename fed to the executed program as argv[0] (or the name of the
  script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
  (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
  reflecting how the executable was found.  This does however mean that
  execution of a script in a /proc-less environment won't work; also, script
  execution via an O_CLOEXEC file descriptor fails (as the file will not be
  accessible after exec).

How does one produce this situation where the execed program sees 
argv[0] as a /dev/fd path? (i.e., what would the execveat()
call look like?) I tried to produce this scenario, but could not.

Cheers,

Michael

.\" Copyright (c) 2014 Google, Inc.
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date.  The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein.  The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.TH EXECVEAT 2 2015-01-09 "Linux" "Linux Programmer's Manual"
.SH NAME
execveat \- execute program relative to a directory file descriptor
.SH SYNOPSIS
.B #include <unistd.h>
.sp
.BI "int execveat(int " dirfd ", const char *" pathname ","
.br
.BI "             char *const " argv "[], char *const " envp "[],"
.br
.BI "             int " flags );
.SH DESCRIPTION
.\" commit 51f39a1f0cea1cacf8c787f652f26dfee9611874
The
.BR execveat ()
system call executes the program referred to by the combination of
.I dirfd
and
.IR pathname .
It operates in exactly the same way as
.BR execve (2),
except for the differences described in this manual page.

If the pathname given in
.I pathname
is relative, then it is interpreted relative to the directory
referred to by the file descriptor
.I dirfd
(rather than relative to the current working directory of
the calling process, as is done by
.BR execve (2)
for a relative pathname).

If
.I pathname
is relative and
.I dirfd
is the special value
.BR AT_FDCWD ,
then
.I pathname
is interpreted relative to the current working
directory of the calling process (like
.BR execve (2)).

If
.I pathname
is absolute, then
.I dirfd
is ignored.

If
.I pathname
is an empty string and the
.BR AT_EMPTY_PATH
flag is specified, then the file descriptor
.I dirfd
specifies the file to be executed (i.e.,
.IR dirfd
refers to an executable file, rather than a directory).

The
.I flags
argument is a bit mask that can include zero or more of the following flags:
.TP
.BR AT_EMPTY_PATH
If
.I pathname
is an empty string, operate on the file referred to by
.IR dirfd
(which may have been obtained using the
.BR open (2)
.B O_PATH
flag).
.TP
.B AT_SYMLINK_NOFOLLOW
If the file identified by
.I dirfd
and a non-NULL
.I pathname
is a symbolic link, then the call fails with the error
.BR EINVAL .
.SH "RETURN VALUE"
On success,
.BR execveat ()
does not return. On error \-1 is returned, and
.I errno
is set appropriately.
.SH ERRORS
The same errors that occur for
.BR execve (2)
can also occur for
.BR execveat ().
The following additional errors can occur for
.BR execveat ():
.TP
.B EBADF
.I dirfd
is not a valid file descriptor.
.TP
.B EINVAL
.I flags
includes
.BR AT_SYMLINK_NOFOLLOW
and the file identified by
.I dirfd
and a non-NULL
.I pathname
is a symbolic link.
.TP
.B EINVAL
Invalid flag specified in
.IR flags .
.TP
.B ENOENT
The program identified by
.I dirfd
and
.I pathname
requires the use of an interpreter program
(such as a script starting with "#!"), but the file descriptor
.I dirfd
was opened with the
.B O_CLOEXEC
flag, with the result that
the program file is inaccessible to the launched interpreter.
.TP
.B ENOTDIR
.I pathname
is relative and
.I dirfd
is a file descriptor referring to a file other than a directory.
.SH VERSIONS
.BR execveat ()
was added to Linux in kernel 3.19.
GNU C library support is pending.
.\" FIXME . check for glibc support in a future release
.SH CONFORMING TO
The
.BR execveat ()
system call is Linux-specific.
.SH NOTES
In addition to the reasons explained in
.BR openat (2),
the
.BR execveat ()
system call is also needed to allow
.BR fexecve (3)
to be implemented on systems that do not have the
.I /proc
filesystem mounted.
.SH SEE ALSO
.BR execve (2),
.BR openat (2),
.BR fexecve (3)

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 15:47     ` Michael Kerrisk (man-pages)
@ 2015-01-09 16:13       ` Rich Felker
  -1 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 16:13 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux

On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
> On 11/24/2014 12:53 PM, David Drysdale wrote:
> > Signed-off-by: David Drysdale <drysdale@google.com>
> > ---
> >  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 153 insertions(+)
> >  create mode 100644 man2/execveat.2
> 
> David,
> 
> Thanks for the very nicely prepared man page. I've done 
> a few very light edits, and will release the version below 
> with the next man-pages release.
> 
> I have one question. In the message accompanying
> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
> 
>   The filename fed to the executed program as argv[0] (or the name of the
>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>   reflecting how the executable was found.  This does however mean that
>   execution of a script in a /proc-less environment won't work; also, script
>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>   accessible after exec).
> 
> How does one produce this situation where the execed program sees 
> argv[0] as a /dev/fd path? (i.e., what would the execveat()
> call look like?) I tried to produce this scenario, but could not.

I think this is wrong. argv[0] is an arbitrary string provided by the
caller and would never be derived from the fd passed. It's AT_EXECFN,
/proc/self/exe, and filenames shown elsewhere in /proc that may be
derived in odd ways.

I would also move the text about O_CLOEXEC to a BUGS or NOTES section
rather than the main description. The long-term intent should be that
script execution this way should work. IIRC this was discussed earlier
in the thread.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 16:13       ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 16:13 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux

On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
> On 11/24/2014 12:53 PM, David Drysdale wrote:
> > Signed-off-by: David Drysdale <drysdale@google.com>
> > ---
> >  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 153 insertions(+)
> >  create mode 100644 man2/execveat.2
> 
> David,
> 
> Thanks for the very nicely prepared man page. I've done 
> a few very light edits, and will release the version below 
> with the next man-pages release.
> 
> I have one question. In the message accompanying
> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
> 
>   The filename fed to the executed program as argv[0] (or the name of the
>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>   reflecting how the executable was found.  This does however mean that
>   execution of a script in a /proc-less environment won't work; also, script
>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>   accessible after exec).
> 
> How does one produce this situation where the execed program sees 
> argv[0] as a /dev/fd path? (i.e., what would the execveat()
> call look like?) I tried to produce this scenario, but could not.

I think this is wrong. argv[0] is an arbitrary string provided by the
caller and would never be derived from the fd passed. It's AT_EXECFN,
/proc/self/exe, and filenames shown elsewhere in /proc that may be
derived in odd ways.

I would also move the text about O_CLOEXEC to a BUGS or NOTES section
rather than the main description. The long-term intent should be that
script execution this way should work. IIRC this was discussed earlier
in the thread.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 17:46         ` David Drysdale
  0 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2015-01-09 17:46 UTC (permalink / raw)
  To: Rich Felker
  Cc: Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>> > Signed-off-by: David Drysdale <drysdale@google.com>
>> > ---
>> >  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 153 insertions(+)
>> >  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done
>> a few very light edits, and will release the version below
>> with the next man-pages release.
>>
>> I have one question. In the message accompanying
>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>
>>   The filename fed to the executed program as argv[0] (or the name of the
>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>   reflecting how the executable was found.  This does however mean that
>>   execution of a script in a /proc-less environment won't work; also, script
>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>   accessible after exec).
>>
>> How does one produce this situation where the execed program sees
>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>> call look like?) I tried to produce this scenario, but could not.
>
> I think this is wrong. argv[0] is an arbitrary string provided by the
> caller and would never be derived from the fd passed.

Yeah, I think I just wrote that wrong, it's only relevant for scripts.
As Rich says, for normal binaries argv[0] is just the argv[0] that
was passed into the execve[at] call.  For a script, the code in
fs/binfmt_script.c will remove the original argv[0] and put the
interpreter name and the script filename (e.g. "/bin/sh",
"/dev/fd/6/script") in as 2 arguments in its place.

[As an aside, IIRC the filename does get put into the new
process's memory, up above the environment strings -- but
that copy isn't visible via argv nor envp.]

> It's AT_EXECFN,
> /proc/self/exe, and filenames shown elsewhere in /proc that may be
> derived in odd ways.
>
> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> rather than the main description. The long-term intent should be that
> script execution this way should work. IIRC this was discussed earlier
> in the thread.

I may be misremembering, but I thought we hoped to be able to fix
execveat of a script without /proc in future, but didn't expect to fix
execveat of a script via an O_CLOEXEC fd (because in the latter
case the fd gets closed before the script interpreter runs, so even
if the interpreter (or a special filesystem) does clever things for names
starting with "/dev/fd/..." the file descriptor is already gone).

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 17:46         ` David Drysdale
  0 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2015-01-09 17:46 UTC (permalink / raw)
  To: Rich Felker
  Cc: Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>> > Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> > ---
>> >  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 153 insertions(+)
>> >  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done
>> a few very light edits, and will release the version below
>> with the next man-pages release.
>>
>> I have one question. In the message accompanying
>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>
>>   The filename fed to the executed program as argv[0] (or the name of the
>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>   reflecting how the executable was found.  This does however mean that
>>   execution of a script in a /proc-less environment won't work; also, script
>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>   accessible after exec).
>>
>> How does one produce this situation where the execed program sees
>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>> call look like?) I tried to produce this scenario, but could not.
>
> I think this is wrong. argv[0] is an arbitrary string provided by the
> caller and would never be derived from the fd passed.

Yeah, I think I just wrote that wrong, it's only relevant for scripts.
As Rich says, for normal binaries argv[0] is just the argv[0] that
was passed into the execve[at] call.  For a script, the code in
fs/binfmt_script.c will remove the original argv[0] and put the
interpreter name and the script filename (e.g. "/bin/sh",
"/dev/fd/6/script") in as 2 arguments in its place.

[As an aside, IIRC the filename does get put into the new
process's memory, up above the environment strings -- but
that copy isn't visible via argv nor envp.]

> It's AT_EXECFN,
> /proc/self/exe, and filenames shown elsewhere in /proc that may be
> derived in odd ways.
>
> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> rather than the main description. The long-term intent should be that
> script execution this way should work. IIRC this was discussed earlier
> in the thread.

I may be misremembering, but I thought we hoped to be able to fix
execveat of a script without /proc in future, but didn't expect to fix
execveat of a script via an O_CLOEXEC fd (because in the latter
case the fd gets closed before the script interpreter runs, so even
if the interpreter (or a special filesystem) does clever things for names
starting with "/dev/fd/..." the file descriptor is already gone).

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 17:46         ` David Drysdale
  0 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2015-01-09 17:46 UTC (permalink / raw)
  To: Rich Felker
  Cc: Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>> > Signed-off-by: David Drysdale <drysdale@google.com>
>> > ---
>> >  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >  1 file changed, 153 insertions(+)
>> >  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done
>> a few very light edits, and will release the version below
>> with the next man-pages release.
>>
>> I have one question. In the message accompanying
>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>
>>   The filename fed to the executed program as argv[0] (or the name of the
>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>   reflecting how the executable was found.  This does however mean that
>>   execution of a script in a /proc-less environment won't work; also, script
>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>   accessible after exec).
>>
>> How does one produce this situation where the execed program sees
>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>> call look like?) I tried to produce this scenario, but could not.
>
> I think this is wrong. argv[0] is an arbitrary string provided by the
> caller and would never be derived from the fd passed.

Yeah, I think I just wrote that wrong, it's only relevant for scripts.
As Rich says, for normal binaries argv[0] is just the argv[0] that
was passed into the execve[at] call.  For a script, the code in
fs/binfmt_script.c will remove the original argv[0] and put the
interpreter name and the script filename (e.g. "/bin/sh",
"/dev/fd/6/script") in as 2 arguments in its place.

[As an aside, IIRC the filename does get put into the new
process's memory, up above the environment strings -- but
that copy isn't visible via argv nor envp.]

> It's AT_EXECFN,
> /proc/self/exe, and filenames shown elsewhere in /proc that may be
> derived in odd ways.
>
> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> rather than the main description. The long-term intent should be that
> script execution this way should work. IIRC this was discussed earlier
> in the thread.

I may be misremembering, but I thought we hoped to be able to fix
execveat of a script without /proc in future, but didn't expect to fix
execveat of a script via an O_CLOEXEC fd (because in the latter
case the fd gets closed before the script interpreter runs, so even
if the interpreter (or a special filesystem) does clever things for names
starting with "/dev/fd/..." the file descriptor is already gone).

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 15:47     ` Michael Kerrisk (man-pages)
@ 2015-01-09 18:02       ` David Drysdale
  -1 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2015-01-09 18:02 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 9, 2015 at 3:47 PM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> On 11/24/2014 12:53 PM, David Drysdale wrote:
>> Signed-off-by: David Drysdale <drysdale@google.com>
>> ---
>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 153 insertions(+)
>>  create mode 100644 man2/execveat.2
>
> David,
>
> Thanks for the very nicely prepared man page. I've done
> a few very light edits, and will release the version below
> with the next man-pages release.

Many thanks, one error (of mine) in 2 places pointed out below.


> .TH EXECVEAT 2 2015-01-09 "Linux" "Linux Programmer's Manual"
> .SH NAME
> execveat \- execute program relative to a directory file descriptor
> .SH SYNOPSIS
> .B #include <unistd.h>
> .sp
> .BI "int execveat(int " dirfd ", const char *" pathname ","
> .br
> .BI "             char *const " argv "[], char *const " envp "[],"
> .br
> .BI "             int " flags );
> .SH DESCRIPTION
> .\" commit 51f39a1f0cea1cacf8c787f652f26dfee9611874
> The
> .BR execveat ()
> system call executes the program referred to by the combination of
> .I dirfd
> and
> .IR pathname .
> It operates in exactly the same way as
> .BR execve (2),
> except for the differences described in this manual page.
>
> If the pathname given in
> .I pathname
> is relative, then it is interpreted relative to the directory
> referred to by the file descriptor
> .I dirfd
> (rather than relative to the current working directory of
> the calling process, as is done by
> .BR execve (2)
> for a relative pathname).
>
> If
> .I pathname
> is relative and
> .I dirfd
> is the special value
> .BR AT_FDCWD ,
> then
> .I pathname
> is interpreted relative to the current working
> directory of the calling process (like
> .BR execve (2)).
>
> If
> .I pathname
> is absolute, then
> .I dirfd
> is ignored.
>
> If
> .I pathname
> is an empty string and the
> .BR AT_EMPTY_PATH
> flag is specified, then the file descriptor
> .I dirfd
> specifies the file to be executed (i.e.,
> .IR dirfd
> refers to an executable file, rather than a directory).
>
> The
> .I flags
> argument is a bit mask that can include zero or more of the following flags:
> .TP
> .BR AT_EMPTY_PATH
> If
> .I pathname
> is an empty string, operate on the file referred to by
> .IR dirfd
> (which may have been obtained using the
> .BR open (2)
> .B O_PATH
> flag).
> .TP
> .B AT_SYMLINK_NOFOLLOW
> If the file identified by
> .I dirfd
> and a non-NULL
> .I pathname
> is a symbolic link, then the call fails with the error
> .BR EINVAL .

Apologies, I think this should be ELOOP.

> .SH "RETURN VALUE"
> On success,
> .BR execveat ()
> does not return. On error \-1 is returned, and
> .I errno
> is set appropriately.
> .SH ERRORS
> The same errors that occur for
> .BR execve (2)
> can also occur for
> .BR execveat ().
> The following additional errors can occur for
> .BR execveat ():
> .TP
> .B EBADF
> .I dirfd
> is not a valid file descriptor.
> .TP
> .B EINVAL

ELOOP here too.

> .I flags
> includes
> .BR AT_SYMLINK_NOFOLLOW
> and the file identified by
> .I dirfd
> and a non-NULL
> .I pathname
> is a symbolic link.
> .TP
> .B EINVAL
> Invalid flag specified in
> .IR flags .
> .TP
> .B ENOENT
> The program identified by
> .I dirfd
> and
> .I pathname
> requires the use of an interpreter program
> (such as a script starting with "#!"), but the file descriptor
> .I dirfd
> was opened with the
> .B O_CLOEXEC
> flag, with the result that
> the program file is inaccessible to the launched interpreter.
> .TP
> .B ENOTDIR
> .I pathname
> is relative and
> .I dirfd
> is a file descriptor referring to a file other than a directory.
> .SH VERSIONS
> .BR execveat ()
> was added to Linux in kernel 3.19.
> GNU C library support is pending.
> .\" FIXME . check for glibc support in a future release
> .SH CONFORMING TO
> The
> .BR execveat ()
> system call is Linux-specific.
> .SH NOTES
> In addition to the reasons explained in
> .BR openat (2),
> the
> .BR execveat ()
> system call is also needed to allow
> .BR fexecve (3)
> to be implemented on systems that do not have the
> .I /proc
> filesystem mounted.
> .SH SEE ALSO
> .BR execve (2),
> .BR openat (2),
> .BR fexecve (3)
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 18:02       ` David Drysdale
  0 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2015-01-09 18:02 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 9, 2015 at 3:47 PM, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
> On 11/24/2014 12:53 PM, David Drysdale wrote:
>> Signed-off-by: David Drysdale <drysdale@google.com>
>> ---
>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 153 insertions(+)
>>  create mode 100644 man2/execveat.2
>
> David,
>
> Thanks for the very nicely prepared man page. I've done
> a few very light edits, and will release the version below
> with the next man-pages release.

Many thanks, one error (of mine) in 2 places pointed out below.


> .TH EXECVEAT 2 2015-01-09 "Linux" "Linux Programmer's Manual"
> .SH NAME
> execveat \- execute program relative to a directory file descriptor
> .SH SYNOPSIS
> .B #include <unistd.h>
> .sp
> .BI "int execveat(int " dirfd ", const char *" pathname ","
> .br
> .BI "             char *const " argv "[], char *const " envp "[],"
> .br
> .BI "             int " flags );
> .SH DESCRIPTION
> .\" commit 51f39a1f0cea1cacf8c787f652f26dfee9611874
> The
> .BR execveat ()
> system call executes the program referred to by the combination of
> .I dirfd
> and
> .IR pathname .
> It operates in exactly the same way as
> .BR execve (2),
> except for the differences described in this manual page.
>
> If the pathname given in
> .I pathname
> is relative, then it is interpreted relative to the directory
> referred to by the file descriptor
> .I dirfd
> (rather than relative to the current working directory of
> the calling process, as is done by
> .BR execve (2)
> for a relative pathname).
>
> If
> .I pathname
> is relative and
> .I dirfd
> is the special value
> .BR AT_FDCWD ,
> then
> .I pathname
> is interpreted relative to the current working
> directory of the calling process (like
> .BR execve (2)).
>
> If
> .I pathname
> is absolute, then
> .I dirfd
> is ignored.
>
> If
> .I pathname
> is an empty string and the
> .BR AT_EMPTY_PATH
> flag is specified, then the file descriptor
> .I dirfd
> specifies the file to be executed (i.e.,
> .IR dirfd
> refers to an executable file, rather than a directory).
>
> The
> .I flags
> argument is a bit mask that can include zero or more of the following flags:
> .TP
> .BR AT_EMPTY_PATH
> If
> .I pathname
> is an empty string, operate on the file referred to by
> .IR dirfd
> (which may have been obtained using the
> .BR open (2)
> .B O_PATH
> flag).
> .TP
> .B AT_SYMLINK_NOFOLLOW
> If the file identified by
> .I dirfd
> and a non-NULL
> .I pathname
> is a symbolic link, then the call fails with the error
> .BR EINVAL .

Apologies, I think this should be ELOOP.

> .SH "RETURN VALUE"
> On success,
> .BR execveat ()
> does not return. On error \-1 is returned, and
> .I errno
> is set appropriately.
> .SH ERRORS
> The same errors that occur for
> .BR execve (2)
> can also occur for
> .BR execveat ().
> The following additional errors can occur for
> .BR execveat ():
> .TP
> .B EBADF
> .I dirfd
> is not a valid file descriptor.
> .TP
> .B EINVAL

ELOOP here too.

> .I flags
> includes
> .BR AT_SYMLINK_NOFOLLOW
> and the file identified by
> .I dirfd
> and a non-NULL
> .I pathname
> is a symbolic link.
> .TP
> .B EINVAL
> Invalid flag specified in
> .IR flags .
> .TP
> .B ENOENT
> The program identified by
> .I dirfd
> and
> .I pathname
> requires the use of an interpreter program
> (such as a script starting with "#!"), but the file descriptor
> .I dirfd
> was opened with the
> .B O_CLOEXEC
> flag, with the result that
> the program file is inaccessible to the launched interpreter.
> .TP
> .B ENOTDIR
> .I pathname
> is relative and
> .I dirfd
> is a file descriptor referring to a file other than a directory.
> .SH VERSIONS
> .BR execveat ()
> was added to Linux in kernel 3.19.
> GNU C library support is pending.
> .\" FIXME . check for glibc support in a future release
> .SH CONFORMING TO
> The
> .BR execveat ()
> system call is Linux-specific.
> .SH NOTES
> In addition to the reasons explained in
> .BR openat (2),
> the
> .BR execveat ()
> system call is also needed to allow
> .BR fexecve (3)
> to be implemented on systems that do not have the
> .I /proc
> filesystem mounted.
> .SH SEE ALSO
> .BR execve (2),
> .BR openat (2),
> .BR fexecve (3)
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 20:48           ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 20:48 UTC (permalink / raw)
  To: David Drysdale
  Cc: Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 09, 2015 at 05:46:28PM +0000, David Drysdale wrote:
> > It's AT_EXECFN,
> > /proc/self/exe, and filenames shown elsewhere in /proc that may be
> > derived in odd ways.
> >
> > I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> > rather than the main description. The long-term intent should be that
> > script execution this way should work. IIRC this was discussed earlier
> > in the thread.
> 
> I may be misremembering, but I thought we hoped to be able to fix
> execveat of a script without /proc in future, but didn't expect to fix
> execveat of a script via an O_CLOEXEC fd (because in the latter
> case the fd gets closed before the script interpreter runs, so even
> if the interpreter (or a special filesystem) does clever things for names
> starting with "/dev/fd/..." the file descriptor is already gone).

I think this is a case that needs to be fixed, though it's hard. The
normal correct usage for fexecve is to always pass an O_CLOEXEC file
descriptor, and the caller can't really be expected to know whether
the file is a script or not. We discussed workarounds before and one
idea I proposed was having fexecve provide a "one open only" magic
symlink in /proc/self/ to pass to the interpreter. It would behave
like an O_PATH file descriptor magic symlink in /proc/self/fd, but
would automatically cease to exist on the first open (at which point
the interpreter would have a real O_RDONLY file descriptor for the
underlying file).

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 20:48           ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 20:48 UTC (permalink / raw)
  To: David Drysdale
  Cc: Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 05:46:28PM +0000, David Drysdale wrote:
> > It's AT_EXECFN,
> > /proc/self/exe, and filenames shown elsewhere in /proc that may be
> > derived in odd ways.
> >
> > I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> > rather than the main description. The long-term intent should be that
> > script execution this way should work. IIRC this was discussed earlier
> > in the thread.
> 
> I may be misremembering, but I thought we hoped to be able to fix
> execveat of a script without /proc in future, but didn't expect to fix
> execveat of a script via an O_CLOEXEC fd (because in the latter
> case the fd gets closed before the script interpreter runs, so even
> if the interpreter (or a special filesystem) does clever things for names
> starting with "/dev/fd/..." the file descriptor is already gone).

I think this is a case that needs to be fixed, though it's hard. The
normal correct usage for fexecve is to always pass an O_CLOEXEC file
descriptor, and the caller can't really be expected to know whether
the file is a script or not. We discussed workarounds before and one
idea I proposed was having fexecve provide a "one open only" magic
symlink in /proc/self/ to pass to the interpreter. It would behave
like an O_PATH file descriptor magic symlink in /proc/self/fd, but
would automatically cease to exist on the first open (at which point
the interpreter would have a real O_RDONLY file descriptor for the
underlying file).

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 20:48           ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 20:48 UTC (permalink / raw)
  To: David Drysdale
  Cc: Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 05:46:28PM +0000, David Drysdale wrote:
> > It's AT_EXECFN,
> > /proc/self/exe, and filenames shown elsewhere in /proc that may be
> > derived in odd ways.
> >
> > I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> > rather than the main description. The long-term intent should be that
> > script execution this way should work. IIRC this was discussed earlier
> > in the thread.
> 
> I may be misremembering, but I thought we hoped to be able to fix
> execveat of a script without /proc in future, but didn't expect to fix
> execveat of a script via an O_CLOEXEC fd (because in the latter
> case the fd gets closed before the script interpreter runs, so even
> if the interpreter (or a special filesystem) does clever things for names
> starting with "/dev/fd/..." the file descriptor is already gone).

I think this is a case that needs to be fixed, though it's hard. The
normal correct usage for fexecve is to always pass an O_CLOEXEC file
descriptor, and the caller can't really be expected to know whether
the file is a script or not. We discussed workarounds before and one
idea I proposed was having fexecve provide a "one open only" magic
symlink in /proc/self/ to pass to the interpreter. It would behave
like an O_PATH file descriptor magic symlink in /proc/self/fd, but
would automatically cease to exist on the first open (at which point
the interpreter would have a real O_RDONLY file descriptor for the
underlying file).

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 20:48           ` Rich Felker
@ 2015-01-09 20:56             ` Al Viro
  -1 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 20:56 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
> I think this is a case that needs to be fixed, though it's hard. The
> normal correct usage for fexecve is to always pass an O_CLOEXEC file
> descriptor, and the caller can't really be expected to know whether
> the file is a script or not. We discussed workarounds before and one
> idea I proposed was having fexecve provide a "one open only" magic
> symlink in /proc/self/ to pass to the interpreter. It would behave
> like an O_PATH file descriptor magic symlink in /proc/self/fd, but
> would automatically cease to exist on the first open (at which point
> the interpreter would have a real O_RDONLY file descriptor for the
> underlying file).

For fsck sake, folks, if you have bloody /proc, you don't need that shite
at all!  Just do execve on /proc/self/fd/n, and be done with that.

The sole excuse for merging that thing in the first place had been
"would anybody think of children^Wsclerotic^Whardened environments
where they have no /proc at all".

Sheesh...

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 20:56             ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 20:56 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
> I think this is a case that needs to be fixed, though it's hard. The
> normal correct usage for fexecve is to always pass an O_CLOEXEC file
> descriptor, and the caller can't really be expected to know whether
> the file is a script or not. We discussed workarounds before and one
> idea I proposed was having fexecve provide a "one open only" magic
> symlink in /proc/self/ to pass to the interpreter. It would behave
> like an O_PATH file descriptor magic symlink in /proc/self/fd, but
> would automatically cease to exist on the first open (at which point
> the interpreter would have a real O_RDONLY file descriptor for the
> underlying file).

For fsck sake, folks, if you have bloody /proc, you don't need that shite
at all!  Just do execve on /proc/self/fd/n, and be done with that.

The sole excuse for merging that thing in the first place had been
"would anybody think of children^Wsclerotic^Whardened environments
where they have no /proc at all".

Sheesh...

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 20:59               ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 20:59 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
> > I think this is a case that needs to be fixed, though it's hard. The
> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
> > descriptor, and the caller can't really be expected to know whether
> > the file is a script or not. We discussed workarounds before and one
> > idea I proposed was having fexecve provide a "one open only" magic
> > symlink in /proc/self/ to pass to the interpreter. It would behave
> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
> > would automatically cease to exist on the first open (at which point
> > the interpreter would have a real O_RDONLY file descriptor for the
> > underlying file).
> 
> For fsck sake, folks, if you have bloody /proc, you don't need that shite
> at all!  Just do execve on /proc/self/fd/n, and be done with that.
> 
> The sole excuse for merging that thing in the first place had been
> "would anybody think of children^Wsclerotic^Whardened environments
> where they have no /proc at all".

That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
the time the interpreter runs, whether you're using fexecveat or
execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
problem. This breaks the intended idiom for fexecve.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 20:59               ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 20:59 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
> > I think this is a case that needs to be fixed, though it's hard. The
> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
> > descriptor, and the caller can't really be expected to know whether
> > the file is a script or not. We discussed workarounds before and one
> > idea I proposed was having fexecve provide a "one open only" magic
> > symlink in /proc/self/ to pass to the interpreter. It would behave
> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
> > would automatically cease to exist on the first open (at which point
> > the interpreter would have a real O_RDONLY file descriptor for the
> > underlying file).
> 
> For fsck sake, folks, if you have bloody /proc, you don't need that shite
> at all!  Just do execve on /proc/self/fd/n, and be done with that.
> 
> The sole excuse for merging that thing in the first place had been
> "would anybody think of children^Wsclerotic^Whardened environments
> where they have no /proc at all".

That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
the time the interpreter runs, whether you're using fexecveat or
execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
problem. This breaks the intended idiom for fexecve.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 20:59               ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 20:59 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
> > I think this is a case that needs to be fixed, though it's hard. The
> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
> > descriptor, and the caller can't really be expected to know whether
> > the file is a script or not. We discussed workarounds before and one
> > idea I proposed was having fexecve provide a "one open only" magic
> > symlink in /proc/self/ to pass to the interpreter. It would behave
> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
> > would automatically cease to exist on the first open (at which point
> > the interpreter would have a real O_RDONLY file descriptor for the
> > underlying file).
> 
> For fsck sake, folks, if you have bloody /proc, you don't need that shite
> at all!  Just do execve on /proc/self/fd/n, and be done with that.
> 
> The sole excuse for merging that thing in the first place had been
> "would anybody think of children^Wsclerotic^Whardened environments
> where they have no /proc at all".

That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
the time the interpreter runs, whether you're using fexecveat or
execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
problem. This breaks the intended idiom for fexecve.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 21:09                 ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 21:09 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 03:59:26PM -0500, Rich Felker wrote:

> > For fsck sake, folks, if you have bloody /proc, you don't need that shite
> > at all!  Just do execve on /proc/self/fd/n, and be done with that.
> > 
> > The sole excuse for merging that thing in the first place had been
> > "would anybody think of children^Wsclerotic^Whardened environments
> > where they have no /proc at all".
> 
> That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> the time the interpreter runs, whether you're using fexecveat or
> execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> problem. This breaks the intended idiom for fexecve.

Just what will your magical symlink do in case when the file is opened,
unlinked and marked O_CLOEXEC?  When should actual freeing of disk blocks,
etc. happen?  And no, you can't assume that interpreter will open the
damn thing even once - there's nothing to oblige it to do so.

Al, more and more tempted to ask reverting the whole thing - this hardcoded
/dev/fd/... (in fs/exec.c, no less) is disgraceful enough, but threats of
even more revolting kludges in the name of "intended idiom for fexecve"...

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 21:09                 ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 21:09 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 03:59:26PM -0500, Rich Felker wrote:

> > For fsck sake, folks, if you have bloody /proc, you don't need that shite
> > at all!  Just do execve on /proc/self/fd/n, and be done with that.
> > 
> > The sole excuse for merging that thing in the first place had been
> > "would anybody think of children^Wsclerotic^Whardened environments
> > where they have no /proc at all".
> 
> That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> the time the interpreter runs, whether you're using fexecveat or
> execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> problem. This breaks the intended idiom for fexecve.

Just what will your magical symlink do in case when the file is opened,
unlinked and marked O_CLOEXEC?  When should actual freeing of disk blocks,
etc. happen?  And no, you can't assume that interpreter will open the
damn thing even once - there's nothing to oblige it to do so.

Al, more and more tempted to ask reverting the whole thing - this hardcoded
/dev/fd/... (in fs/exec.c, no less) is disgraceful enough, but threats of
even more revolting kludges in the name of "intended idiom for fexecve"...

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 21:09                 ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 21:09 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 03:59:26PM -0500, Rich Felker wrote:

> > For fsck sake, folks, if you have bloody /proc, you don't need that shite
> > at all!  Just do execve on /proc/self/fd/n, and be done with that.
> > 
> > The sole excuse for merging that thing in the first place had been
> > "would anybody think of children^Wsclerotic^Whardened environments
> > where they have no /proc at all".
> 
> That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> the time the interpreter runs, whether you're using fexecveat or
> execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> problem. This breaks the intended idiom for fexecve.

Just what will your magical symlink do in case when the file is opened,
unlinked and marked O_CLOEXEC?  When should actual freeing of disk blocks,
etc. happen?  And no, you can't assume that interpreter will open the
damn thing even once - there's nothing to oblige it to do so.

Al, more and more tempted to ask reverting the whole thing - this hardcoded
/dev/fd/... (in fs/exec.c, no less) is disgraceful enough, but threats of
even more revolting kludges in the name of "intended idiom for fexecve"...

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 20:59               ` Rich Felker
  (?)
@ 2015-01-09 21:20                 ` Eric W. Biederman
  -1 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-09 21:20 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
>> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
>> > I think this is a case that needs to be fixed, though it's hard. The
>> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
>> > descriptor, and the caller can't really be expected to know whether
>> > the file is a script or not. We discussed workarounds before and one
>> > idea I proposed was having fexecve provide a "one open only" magic
>> > symlink in /proc/self/ to pass to the interpreter. It would behave
>> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
>> > would automatically cease to exist on the first open (at which point
>> > the interpreter would have a real O_RDONLY file descriptor for the
>> > underlying file).
>> 
>> For fsck sake, folks, if you have bloody /proc, you don't need that shite
>> at all!  Just do execve on /proc/self/fd/n, and be done with that.
>> 
>> The sole excuse for merging that thing in the first place had been
>> "would anybody think of children^Wsclerotic^Whardened environments
>> where they have no /proc at all".
>
> That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> the time the interpreter runs, whether you're using fexecveat or
> execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> problem. This breaks the intended idiom for fexecve.

O_CLOEXEC with a #! intepreter can not work.  If the file descriptor is
closed a #! interpreter can not open it.   So I don't know why or how
you want that to work but it is nonsense.

This certainly does not break the intended usage for execveat.

Eric


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 21:20                 ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-09 21:20 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
>> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
>> > I think this is a case that needs to be fixed, though it's hard. The
>> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
>> > descriptor, and the caller can't really be expected to know whether
>> > the file is a script or not. We discussed workarounds before and one
>> > idea I proposed was having fexecve provide a "one open only" magic
>> > symlink in /proc/self/ to pass to the interpreter. It would behave
>> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
>> > would automatically cease to exist on the first open (at which point
>> > the interpreter would have a real O_RDONLY file descriptor for the
>> > underlying file).
>> 
>> For fsck sake, folks, if you have bloody /proc, you don't need that shite
>> at all!  Just do execve on /proc/self/fd/n, and be done with that.
>> 
>> The sole excuse for merging that thing in the first place had been
>> "would anybody think of children^Wsclerotic^Whardened environments
>> where they have no /proc at all".
>
> That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> the time the interpreter runs, whether you're using fexecveat or
> execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> problem. This breaks the intended idiom for fexecve.

O_CLOEXEC with a #! intepreter can not work.  If the file descriptor is
closed a #! interpreter can not open it.   So I don't know why or how
you want that to work but it is nonsense.

This certainly does not break the intended usage for execveat.

Eric


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 21:20                 ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-09 21:20 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
>> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
>> > I think this is a case that needs to be fixed, though it's hard. The
>> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
>> > descriptor, and the caller can't really be expected to know whether
>> > the file is a script or not. We discussed workarounds before and one
>> > idea I proposed was having fexecve provide a "one open only" magic
>> > symlink in /proc/self/ to pass to the interpreter. It would behave
>> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
>> > would automatically cease to exist on the first open (at which point
>> > the interpreter would have a real O_RDONLY file descriptor for the
>> > underlying file).
>> 
>> For fsck sake, folks, if you have bloody /proc, you don't need that shite
>> at all!  Just do execve on /proc/self/fd/n, and be done with that.
>> 
>> The sole excuse for merging that thing in the first place had been
>> "would anybody think of children^Wsclerotic^Whardened environments
>> where they have no /proc at all".
>
> That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> the time the interpreter runs, whether you're using fexecveat or
> execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> problem. This breaks the intended idiom for fexecve.

O_CLOEXEC with a #! intepreter can not work.  If the file descriptor is
closed a #! interpreter can not open it.   So I don't know why or how
you want that to work but it is nonsense.

This certainly does not break the intended usage for execveat.

Eric


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 21:09                 ` Al Viro
@ 2015-01-09 21:28                   ` Rich Felker
  -1 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 21:28 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 03:59:26PM -0500, Rich Felker wrote:
> 
> > > For fsck sake, folks, if you have bloody /proc, you don't need that shite
> > > at all!  Just do execve on /proc/self/fd/n, and be done with that.
> > > 
> > > The sole excuse for merging that thing in the first place had been
> > > "would anybody think of children^Wsclerotic^Whardened environments
> > > where they have no /proc at all".
> > 
> > That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> > the time the interpreter runs, whether you're using fexecveat or
> > execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> > problem. This breaks the intended idiom for fexecve.
> 
> Just what will your magical symlink do in case when the file is opened,
> unlinked and marked O_CLOEXEC?  When should actual freeing of disk blocks,
> etc. happen?  And no, you can't assume that interpreter will open the
> damn thing even once - there's nothing to oblige it to do so.

Unlinking is not relevant. Magical symlinks refer to open file
descriptions (either real ones or O_PATH inode-reference-only ones),
not files. There is no new complexity proposed for freeing disk blocks
here. Semantics are identical to existing O_PATH inode references.

> Al, more and more tempted to ask reverting the whole thing - this hardcoded
> /dev/fd/... (in fs/exec.c, no less) is disgraceful enough, but threats of
> even more revolting kludges in the name of "intended idiom for fexecve"...

If you have a multithreaded process that's executing an external
program via fexecve, then unless it has specialized knowledge about
what other parts of the program/libraries are doing, it needs to be
using O_CLOEXEC for the file descriptor. Otherwise, the file
descriptor could be leaked to child processes started by other
threads. This is what I mean by the "intended idiom". Note that it's
easier to use pathnames instead of fexecve, but doing so may not be an
option if the program needs to verify the file before exec'ing it.

This issue can be avoided if you're going to fork-and-fexecve rather
than replacing the calling process, since after forking it's safe to
remove the close-on-exec flag. But then you still have the issue that
the child process, after exec, keeps a spurious file descriptor to its
own process image (executable file) open which it can never close
(because it doesn't know the number). This could eventually lead to fd
exhaustion after many generations.

The "magic open-once magic symlink" approach is really the cleanest
solution I can find. In the case where the interpreter does not open
the script, nothing terribly bad happens; the magic symlink just
sticks around until _exit or exec. In the case where the interpreter
opens it more than once, you get a failure, but as far as I know
existing interpreters don't do this, and it's arguably bad design. In
any case it's a caught error.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 21:28                   ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 21:28 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 03:59:26PM -0500, Rich Felker wrote:
> 
> > > For fsck sake, folks, if you have bloody /proc, you don't need that shite
> > > at all!  Just do execve on /proc/self/fd/n, and be done with that.
> > > 
> > > The sole excuse for merging that thing in the first place had been
> > > "would anybody think of children^Wsclerotic^Whardened environments
> > > where they have no /proc at all".
> > 
> > That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> > the time the interpreter runs, whether you're using fexecveat or
> > execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> > problem. This breaks the intended idiom for fexecve.
> 
> Just what will your magical symlink do in case when the file is opened,
> unlinked and marked O_CLOEXEC?  When should actual freeing of disk blocks,
> etc. happen?  And no, you can't assume that interpreter will open the
> damn thing even once - there's nothing to oblige it to do so.

Unlinking is not relevant. Magical symlinks refer to open file
descriptions (either real ones or O_PATH inode-reference-only ones),
not files. There is no new complexity proposed for freeing disk blocks
here. Semantics are identical to existing O_PATH inode references.

> Al, more and more tempted to ask reverting the whole thing - this hardcoded
> /dev/fd/... (in fs/exec.c, no less) is disgraceful enough, but threats of
> even more revolting kludges in the name of "intended idiom for fexecve"...

If you have a multithreaded process that's executing an external
program via fexecve, then unless it has specialized knowledge about
what other parts of the program/libraries are doing, it needs to be
using O_CLOEXEC for the file descriptor. Otherwise, the file
descriptor could be leaked to child processes started by other
threads. This is what I mean by the "intended idiom". Note that it's
easier to use pathnames instead of fexecve, but doing so may not be an
option if the program needs to verify the file before exec'ing it.

This issue can be avoided if you're going to fork-and-fexecve rather
than replacing the calling process, since after forking it's safe to
remove the close-on-exec flag. But then you still have the issue that
the child process, after exec, keeps a spurious file descriptor to its
own process image (executable file) open which it can never close
(because it doesn't know the number). This could eventually lead to fd
exhaustion after many generations.

The "magic open-once magic symlink" approach is really the cleanest
solution I can find. In the case where the interpreter does not open
the script, nothing terribly bad happens; the magic symlink just
sticks around until _exit or exec. In the case where the interpreter
opens it more than once, you get a failure, but as far as I know
existing interpreters don't do this, and it's arguably bad design. In
any case it's a caught error.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 21:31                   ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 21:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 09, 2015 at 03:20:04PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
> >> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
> >> > I think this is a case that needs to be fixed, though it's hard. The
> >> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
> >> > descriptor, and the caller can't really be expected to know whether
> >> > the file is a script or not. We discussed workarounds before and one
> >> > idea I proposed was having fexecve provide a "one open only" magic
> >> > symlink in /proc/self/ to pass to the interpreter. It would behave
> >> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
> >> > would automatically cease to exist on the first open (at which point
> >> > the interpreter would have a real O_RDONLY file descriptor for the
> >> > underlying file).
> >> 
> >> For fsck sake, folks, if you have bloody /proc, you don't need that shite
> >> at all!  Just do execve on /proc/self/fd/n, and be done with that.
> >> 
> >> The sole excuse for merging that thing in the first place had been
> >> "would anybody think of children^Wsclerotic^Whardened environments
> >> where they have no /proc at all".
> >
> > That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> > the time the interpreter runs, whether you're using fexecveat or
> > execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> > problem. This breaks the intended idiom for fexecve.
> 
> O_CLOEXEC with a #! intepreter can not work.  If the file descriptor is
> closed a #! interpreter can not open it.   So I don't know why or how
> you want that to work but it is nonsense.

The why is simple: fexecve always expects a close-on-exec file
descriptor. Otherwise the program being executed would need to take a
special option telling it to close the spurious fd it inherits. Most
programs don't have such an option, and there's no way to do it
without application-specific knowledge.

The how is difficult, but it can be done.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 21:31                   ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 21:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 03:20:04PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> writes:
> 
> > On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
> >> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
> >> > I think this is a case that needs to be fixed, though it's hard. The
> >> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
> >> > descriptor, and the caller can't really be expected to know whether
> >> > the file is a script or not. We discussed workarounds before and one
> >> > idea I proposed was having fexecve provide a "one open only" magic
> >> > symlink in /proc/self/ to pass to the interpreter. It would behave
> >> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
> >> > would automatically cease to exist on the first open (at which point
> >> > the interpreter would have a real O_RDONLY file descriptor for the
> >> > underlying file).
> >> 
> >> For fsck sake, folks, if you have bloody /proc, you don't need that shite
> >> at all!  Just do execve on /proc/self/fd/n, and be done with that.
> >> 
> >> The sole excuse for merging that thing in the first place had been
> >> "would anybody think of children^Wsclerotic^Whardened environments
> >> where they have no /proc at all".
> >
> > That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> > the time the interpreter runs, whether you're using fexecveat or
> > execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> > problem. This breaks the intended idiom for fexecve.
> 
> O_CLOEXEC with a #! intepreter can not work.  If the file descriptor is
> closed a #! interpreter can not open it.   So I don't know why or how
> you want that to work but it is nonsense.

The why is simple: fexecve always expects a close-on-exec file
descriptor. Otherwise the program being executed would need to take a
special option telling it to close the spurious fd it inherits. Most
programs don't have such an option, and there's no way to do it
without application-specific knowledge.

The how is difficult, but it can be done.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 21:31                   ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 21:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 03:20:04PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > On Fri, Jan 09, 2015 at 08:56:26PM +0000, Al Viro wrote:
> >> On Fri, Jan 09, 2015 at 03:48:15PM -0500, Rich Felker wrote:
> >> > I think this is a case that needs to be fixed, though it's hard. The
> >> > normal correct usage for fexecve is to always pass an O_CLOEXEC file
> >> > descriptor, and the caller can't really be expected to know whether
> >> > the file is a script or not. We discussed workarounds before and one
> >> > idea I proposed was having fexecve provide a "one open only" magic
> >> > symlink in /proc/self/ to pass to the interpreter. It would behave
> >> > like an O_PATH file descriptor magic symlink in /proc/self/fd, but
> >> > would automatically cease to exist on the first open (at which point
> >> > the interpreter would have a real O_RDONLY file descriptor for the
> >> > underlying file).
> >> 
> >> For fsck sake, folks, if you have bloody /proc, you don't need that shite
> >> at all!  Just do execve on /proc/self/fd/n, and be done with that.
> >> 
> >> The sole excuse for merging that thing in the first place had been
> >> "would anybody think of children^Wsclerotic^Whardened environments
> >> where they have no /proc at all".
> >
> > That doesn't work. With O_CLOEXEC, /proc/self/fd/n is already gone at
> > the time the interpreter runs, whether you're using fexecveat or
> > execve with "/proc/self/fd/n" to implement POSIX fexecve(). That's the
> > problem. This breaks the intended idiom for fexecve.
> 
> O_CLOEXEC with a #! intepreter can not work.  If the file descriptor is
> closed a #! interpreter can not open it.   So I don't know why or how
> you want that to work but it is nonsense.

The why is simple: fexecve always expects a close-on-exec file
descriptor. Otherwise the program being executed would need to take a
special option telling it to close the spurious fd it inherits. Most
programs don't have such an option, and there's no way to do it
without application-specific knowledge.

The how is difficult, but it can be done.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 21:28                   ` Rich Felker
@ 2015-01-09 21:50                     ` Al Viro
  -1 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 21:50 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 04:28:52PM -0500, Rich Felker wrote:

> The "magic open-once magic symlink" approach is really the cleanest
> solution I can find. In the case where the interpreter does not open
> the script, nothing terribly bad happens; the magic symlink just
> sticks around until _exit or exec. In the case where the interpreter
> opens it more than once, you get a failure, but as far as I know
> existing interpreters don't do this, and it's arguably bad design. In
> any case it's a caught error.

You know what's cleaner than that?  git revert 27d6ec7ad
It has just been merged; until 3.19 it's fair game for removal.

And yes, I should've NAKed the damn thing loud and clear, rather than
asking questions back then, getting no answers and letting it slip.
Mea culpa.

Back then the procfs-free environments had been pushed as a serious argument
in favour of merging the damn thing.  Now you guys turn around and say that
we not only need procfs mounted, we need a yet-to-be-added kludge in there
to cope with the actual intended uses.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 21:50                     ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 21:50 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 04:28:52PM -0500, Rich Felker wrote:

> The "magic open-once magic symlink" approach is really the cleanest
> solution I can find. In the case where the interpreter does not open
> the script, nothing terribly bad happens; the magic symlink just
> sticks around until _exit or exec. In the case where the interpreter
> opens it more than once, you get a failure, but as far as I know
> existing interpreters don't do this, and it's arguably bad design. In
> any case it's a caught error.

You know what's cleaner than that?  git revert 27d6ec7ad
It has just been merged; until 3.19 it's fair game for removal.

And yes, I should've NAKed the damn thing loud and clear, rather than
asking questions back then, getting no answers and letting it slip.
Mea culpa.

Back then the procfs-free environments had been pushed as a serious argument
in favour of merging the damn thing.  Now you guys turn around and say that
we not only need procfs mounted, we need a yet-to-be-added kludge in there
to cope with the actual intended uses.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 22:13                     ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-09 22:13 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:

> The "magic open-once magic symlink" approach is really the cleanest
> solution I can find. In the case where the interpreter does not open
> the script, nothing terribly bad happens; the magic symlink just
> sticks around until _exit or exec. In the case where the interpreter
> opens it more than once, you get a failure, but as far as I know
> existing interpreters don't do this, and it's arguably bad design. In
> any case it's a caught error.

And it doesn't work without introducing security vulnerabilities into
the kernel, because it breaks close-on-exec semantics.

All you have to do is pick a file descriptor, good canidates are 0 and
255 and make it a convention that that file descriptor is used for
fexecve.  At least when you want to support scripts.  Otherwise you can
set close-on-exec.

That results in no accumulation of file descriptors  because everyone
always uses the same file descriptor.

Regardless you don't have a patch and you aren't proposing code and the
code isn't actually broken so please go away.

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 22:13                     ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-09 22:13 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel@vger.kernel.org,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> writes:

> On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:

> The "magic open-once magic symlink" approach is really the cleanest
> solution I can find. In the case where the interpreter does not open
> the script, nothing terribly bad happens; the magic symlink just
> sticks around until _exit or exec. In the case where the interpreter
> opens it more than once, you get a failure, but as far as I know
> existing interpreters don't do this, and it's arguably bad design. In
> any case it's a caught error.

And it doesn't work without introducing security vulnerabilities into
the kernel, because it breaks close-on-exec semantics.

All you have to do is pick a file descriptor, good canidates are 0 and
255 and make it a convention that that file descriptor is used for
fexecve.  At least when you want to support scripts.  Otherwise you can
set close-on-exec.

That results in no accumulation of file descriptors  because everyone
always uses the same file descriptor.

Regardless you don't have a patch and you aren't proposing code and the
code isn't actually broken so please go away.

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 22:13                     ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-09 22:13 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:

> The "magic open-once magic symlink" approach is really the cleanest
> solution I can find. In the case where the interpreter does not open
> the script, nothing terribly bad happens; the magic symlink just
> sticks around until _exit or exec. In the case where the interpreter
> opens it more than once, you get a failure, but as far as I know
> existing interpreters don't do this, and it's arguably bad design. In
> any case it's a caught error.

And it doesn't work without introducing security vulnerabilities into
the kernel, because it breaks close-on-exec semantics.

All you have to do is pick a file descriptor, good canidates are 0 and
255 and make it a convention that that file descriptor is used for
fexecve.  At least when you want to support scripts.  Otherwise you can
set close-on-exec.

That results in no accumulation of file descriptors  because everyone
always uses the same file descriptor.

Regardless you don't have a patch and you aren't proposing code and the
code isn't actually broken so please go away.

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 22:13                     ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-09 22:13 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel@vger.kernel.org,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

Rich Felker <dalias@aerifal.cx> writes:

> On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:

> The "magic open-once magic symlink" approach is really the cleanest
> solution I can find. In the case where the interpreter does not open
> the script, nothing terribly bad happens; the magic symlink just
> sticks around until _exit or exec. In the case where the interpreter
> opens it more than once, you get a failure, but as far as I know
> existing interpreters don't do this, and it's arguably bad design. In
> any case it's a caught error.

And it doesn't work without introducing security vulnerabilities into
the kernel, because it breaks close-on-exec semantics.

All you have to do is pick a file descriptor, good canidates are 0 and
255 and make it a convention that that file descriptor is used for
fexecve.  At least when you want to support scripts.  Otherwise you can
set close-on-exec.

That results in no accumulation of file descriptors  because everyone
always uses the same file descriptor.

Regardless you don't have a patch and you aren't proposing code and the
code isn't actually broken so please go away.

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 21:50                     ` Al Viro
@ 2015-01-09 22:17                       ` Rich Felker
  -1 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 22:17 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 09:50:42PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 04:28:52PM -0500, Rich Felker wrote:
> 
> > The "magic open-once magic symlink" approach is really the cleanest
> > solution I can find. In the case where the interpreter does not open
> > the script, nothing terribly bad happens; the magic symlink just
> > sticks around until _exit or exec. In the case where the interpreter
> > opens it more than once, you get a failure, but as far as I know
> > existing interpreters don't do this, and it's arguably bad design. In
> > any case it's a caught error.
> 
> You know what's cleaner than that?  git revert 27d6ec7ad
> It has just been merged; until 3.19 it's fair game for removal.
> 
> And yes, I should've NAKed the damn thing loud and clear, rather than
> asking questions back then, getting no answers and letting it slip.
> Mea culpa.
> 
> Back then the procfs-free environments had been pushed as a serious argument
> in favour of merging the damn thing.  Now you guys turn around and say that
> we not only need procfs mounted, we need a yet-to-be-added kludge in there
> to cope with the actual intended uses.

Reverting does not fix the problem. There is no way to make fexecve
work for scripts without kernel support, and the needed kernel support
without fexecve would be even nastier, since handling of /proc/self/fd
magic-symlinks would need to be special-cased. The added fexecveat
syscall supports fully /proc-less operation for non-scripts.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 22:17                       ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 22:17 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 09:50:42PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 04:28:52PM -0500, Rich Felker wrote:
> 
> > The "magic open-once magic symlink" approach is really the cleanest
> > solution I can find. In the case where the interpreter does not open
> > the script, nothing terribly bad happens; the magic symlink just
> > sticks around until _exit or exec. In the case where the interpreter
> > opens it more than once, you get a failure, but as far as I know
> > existing interpreters don't do this, and it's arguably bad design. In
> > any case it's a caught error.
> 
> You know what's cleaner than that?  git revert 27d6ec7ad
> It has just been merged; until 3.19 it's fair game for removal.
> 
> And yes, I should've NAKed the damn thing loud and clear, rather than
> asking questions back then, getting no answers and letting it slip.
> Mea culpa.
> 
> Back then the procfs-free environments had been pushed as a serious argument
> in favour of merging the damn thing.  Now you guys turn around and say that
> we not only need procfs mounted, we need a yet-to-be-added kludge in there
> to cope with the actual intended uses.

Reverting does not fix the problem. There is no way to make fexecve
work for scripts without kernel support, and the needed kernel support
without fexecve would be even nastier, since handling of /proc/self/fd
magic-symlinks would need to be special-cased. The added fexecveat
syscall supports fully /proc-less operation for non-scripts.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 22:17                       ` Rich Felker
@ 2015-01-09 22:33                         ` Al Viro
  -1 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 22:33 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 05:17:28PM -0500, Rich Felker wrote:
> > Back then the procfs-free environments had been pushed as a serious argument
> > in favour of merging the damn thing.  Now you guys turn around and say that
> > we not only need procfs mounted, we need a yet-to-be-added kludge in there
> > to cope with the actual intended uses.
> 
> Reverting does not fix the problem. There is no way to make fexecve
> work for scripts without kernel support, and the needed kernel support
> without fexecve would be even nastier, since handling of /proc/self/fd
> magic-symlinks would need to be special-cased. The added fexecveat
> syscall supports fully /proc-less operation for non-scripts.

Oh, yes it does.  It's not *our* problem if it's out of tree and not
a part of ABI.  That way if you need it, *you* get to come up with clean
implementation.  If it's in-tree you get leverage to push ugly kludges
further in.  And frankly, I don't trust you to abstain from using that
leverage in rather nasty ways.

Out of curiosity, how would you expect that "open only once" to work?
All reliable variants I see are beyond sick...

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 22:33                         ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 22:33 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 05:17:28PM -0500, Rich Felker wrote:
> > Back then the procfs-free environments had been pushed as a serious argument
> > in favour of merging the damn thing.  Now you guys turn around and say that
> > we not only need procfs mounted, we need a yet-to-be-added kludge in there
> > to cope with the actual intended uses.
> 
> Reverting does not fix the problem. There is no way to make fexecve
> work for scripts without kernel support, and the needed kernel support
> without fexecve would be even nastier, since handling of /proc/self/fd
> magic-symlinks would need to be special-cased. The added fexecveat
> syscall supports fully /proc-less operation for non-scripts.

Oh, yes it does.  It's not *our* problem if it's out of tree and not
a part of ABI.  That way if you need it, *you* get to come up with clean
implementation.  If it's in-tree you get leverage to push ugly kludges
further in.  And frankly, I don't trust you to abstain from using that
leverage in rather nasty ways.

Out of curiosity, how would you expect that "open only once" to work?
All reliable variants I see are beyond sick...

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 22:13                     ` Eric W. Biederman
@ 2015-01-09 22:38                       ` Rich Felker
  -1 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 22:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 09, 2015 at 04:13:27PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:
> 
> > The "magic open-once magic symlink" approach is really the cleanest
> > solution I can find. In the case where the interpreter does not open
> > the script, nothing terribly bad happens; the magic symlink just
> > sticks around until _exit or exec. In the case where the interpreter
> > opens it more than once, you get a failure, but as far as I know
> > existing interpreters don't do this, and it's arguably bad design. In
> > any case it's a caught error.
> 
> And it doesn't work without introducing security vulnerabilities into
> the kernel, because it breaks close-on-exec semantics.

I'm curious what those security vulnerabilities would be. The standard
issue with close-on-exec failure (e.g. races) is the leaking of
arbitrary file descriptors (typically, ones opened by other threads or
other unrelated portions of the program) to resources the new process
should not have. "Leaking" of an inode-reference-only (no permissions)
O_PATH fd or pseudo-fd to the script that's to be run does not seem
like a vulnerability to me, and it would only be "leaked" if the
interpreter does something unexpected.

> All you have to do is pick a file descriptor, good canidates are 0 and
> 255 and make it a convention that that file descriptor is used for
> fexecve.  At least when you want to support scripts.  Otherwise you can
> set close-on-exec.

0 is obviously not a candidate; it's stdin. 255 is also not a
candidate though. Consider for example something like irssi's /upgrade
that's going to have the child inheriting an arbitrary set of file
descriptors that need to keep their original numbers, possibly
including 255. Imposing a script in between should not cause arbitrary
file descriptors to be lost.

> That results in no accumulation of file descriptors  because everyone
> always uses the same file descriptor.
> 
> Regardless you don't have a patch and you aren't proposing code and the
> code isn't actually broken so please go away.

I'm not proposing code because I'm a libc developer not a kernel
developer. I know what's needed for userspace to provide a conforming
fexecve to applications, not how to implement that on the kernel side,
although I'm trying to provide constructive ideas. The hostility is
really not necessary.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 22:38                       ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 22:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 09, 2015 at 04:13:27PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:
> 
> > The "magic open-once magic symlink" approach is really the cleanest
> > solution I can find. In the case where the interpreter does not open
> > the script, nothing terribly bad happens; the magic symlink just
> > sticks around until _exit or exec. In the case where the interpreter
> > opens it more than once, you get a failure, but as far as I know
> > existing interpreters don't do this, and it's arguably bad design. In
> > any case it's a caught error.
> 
> And it doesn't work without introducing security vulnerabilities into
> the kernel, because it breaks close-on-exec semantics.

I'm curious what those security vulnerabilities would be. The standard
issue with close-on-exec failure (e.g. races) is the leaking of
arbitrary file descriptors (typically, ones opened by other threads or
other unrelated portions of the program) to resources the new process
should not have. "Leaking" of an inode-reference-only (no permissions)
O_PATH fd or pseudo-fd to the script that's to be run does not seem
like a vulnerability to me, and it would only be "leaked" if the
interpreter does something unexpected.

> All you have to do is pick a file descriptor, good canidates are 0 and
> 255 and make it a convention that that file descriptor is used for
> fexecve.  At least when you want to support scripts.  Otherwise you can
> set close-on-exec.

0 is obviously not a candidate; it's stdin. 255 is also not a
candidate though. Consider for example something like irssi's /upgrade
that's going to have the child inheriting an arbitrary set of file
descriptors that need to keep their original numbers, possibly
including 255. Imposing a script in between should not cause arbitrary
file descriptors to be lost.

> That results in no accumulation of file descriptors  because everyone
> always uses the same file descriptor.
> 
> Regardless you don't have a patch and you aren't proposing code and the
> code isn't actually broken so please go away.

I'm not proposing code because I'm a libc developer not a kernel
developer. I know what's needed for userspace to provide a conforming
fexecve to applications, not how to implement that on the kernel side,
although I'm trying to provide constructive ideas. The hostility is
really not necessary.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 22:33                         ` Al Viro
@ 2015-01-09 22:42                           ` Rich Felker
  -1 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 22:42 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 10:33:00PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 05:17:28PM -0500, Rich Felker wrote:
> > > Back then the procfs-free environments had been pushed as a serious argument
> > > in favour of merging the damn thing.  Now you guys turn around and say that
> > > we not only need procfs mounted, we need a yet-to-be-added kludge in there
> > > to cope with the actual intended uses.
> > 
> > Reverting does not fix the problem. There is no way to make fexecve
> > work for scripts without kernel support, and the needed kernel support
> > without fexecve would be even nastier, since handling of /proc/self/fd
> > magic-symlinks would need to be special-cased. The added fexecveat
> > syscall supports fully /proc-less operation for non-scripts.
> 
> Oh, yes it does.  It's not *our* problem if it's out of tree and not
> a part of ABI.  That way if you need it, *you* get to come up with clean
> implementation.  If it's in-tree you get leverage to push ugly kludges
> further in.  And frankly, I don't trust you to abstain from using that
> leverage in rather nasty ways.
> 
> Out of curiosity, how would you expect that "open only once" to work?
> All reliable variants I see are beyond sick...

Here's a very simple way it could work -- it could put the O_PATH fd
on a previously-unused fd number, and put a special flag on the fd,
like FD_CLOEXEC, but that causes the kernel to close it whenever it's
opened. The pathname passed could then simply be /dev/fd/%d or
/proc/self/fd/%d, and although this is presently dependent on /proc
being mounted, virtual /dev/fd/* could someday be something completely
independent of procfs. The kernel keeps all the freedom to choose how
to pass the name to the interpreter. I'm not proposing any kernel
API/ABI lock-in and I'm with you in opposing such lock-in.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 22:42                           ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 22:42 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 10:33:00PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 05:17:28PM -0500, Rich Felker wrote:
> > > Back then the procfs-free environments had been pushed as a serious argument
> > > in favour of merging the damn thing.  Now you guys turn around and say that
> > > we not only need procfs mounted, we need a yet-to-be-added kludge in there
> > > to cope with the actual intended uses.
> > 
> > Reverting does not fix the problem. There is no way to make fexecve
> > work for scripts without kernel support, and the needed kernel support
> > without fexecve would be even nastier, since handling of /proc/self/fd
> > magic-symlinks would need to be special-cased. The added fexecveat
> > syscall supports fully /proc-less operation for non-scripts.
> 
> Oh, yes it does.  It's not *our* problem if it's out of tree and not
> a part of ABI.  That way if you need it, *you* get to come up with clean
> implementation.  If it's in-tree you get leverage to push ugly kludges
> further in.  And frankly, I don't trust you to abstain from using that
> leverage in rather nasty ways.
> 
> Out of curiosity, how would you expect that "open only once" to work?
> All reliable variants I see are beyond sick...

Here's a very simple way it could work -- it could put the O_PATH fd
on a previously-unused fd number, and put a special flag on the fd,
like FD_CLOEXEC, but that causes the kernel to close it whenever it's
opened. The pathname passed could then simply be /dev/fd/%d or
/proc/self/fd/%d, and although this is presently dependent on /proc
being mounted, virtual /dev/fd/* could someday be something completely
independent of procfs. The kernel keeps all the freedom to choose how
to pass the name to the interpreter. I'm not proposing any kernel
API/ABI lock-in and I'm with you in opposing such lock-in.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 22:57                             ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 22:57 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:

> Here's a very simple way it could work -- it could put the O_PATH fd
> on a previously-unused fd number, and put a special flag on the fd,
> like FD_CLOEXEC, but that causes the kernel to close it whenever it's
> opened. The pathname passed could then simply be /dev/fd/%d or
> /proc/self/fd/%d, and although this is presently dependent on /proc
> being mounted, virtual /dev/fd/* could someday be something completely
> independent of procfs. The kernel keeps all the freedom to choose how
> to pass the name to the interpreter. I'm not proposing any kernel
> API/ABI lock-in and I'm with you in opposing such lock-in.

Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
traversed and after that point there is no information whatsoever how we
got to that vfsmount/dentry pair.  I can imagine several kludges that would
work, but they are unspeakably ugly, and do_last() is already far too
convoluted as it is.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 22:57                             ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 22:57 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:

> Here's a very simple way it could work -- it could put the O_PATH fd
> on a previously-unused fd number, and put a special flag on the fd,
> like FD_CLOEXEC, but that causes the kernel to close it whenever it's
> opened. The pathname passed could then simply be /dev/fd/%d or
> /proc/self/fd/%d, and although this is presently dependent on /proc
> being mounted, virtual /dev/fd/* could someday be something completely
> independent of procfs. The kernel keeps all the freedom to choose how
> to pass the name to the interpreter. I'm not proposing any kernel
> API/ABI lock-in and I'm with you in opposing such lock-in.

Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
traversed and after that point there is no information whatsoever how we
got to that vfsmount/dentry pair.  I can imagine several kludges that would
work, but they are unspeakably ugly, and do_last() is already far too
convoluted as it is.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 22:57                             ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 22:57 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:

> Here's a very simple way it could work -- it could put the O_PATH fd
> on a previously-unused fd number, and put a special flag on the fd,
> like FD_CLOEXEC, but that causes the kernel to close it whenever it's
> opened. The pathname passed could then simply be /dev/fd/%d or
> /proc/self/fd/%d, and although this is presently dependent on /proc
> being mounted, virtual /dev/fd/* could someday be something completely
> independent of procfs. The kernel keeps all the freedom to choose how
> to pass the name to the interpreter. I'm not proposing any kernel
> API/ABI lock-in and I'm with you in opposing such lock-in.

Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
traversed and after that point there is no information whatsoever how we
got to that vfsmount/dentry pair.  I can imagine several kludges that would
work, but they are unspeakably ugly, and do_last() is already far too
convoluted as it is.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 22:57                             ` Al Viro
@ 2015-01-09 23:12                               ` Rich Felker
  -1 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 23:12 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 10:57:43PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:
> 
> > Here's a very simple way it could work -- it could put the O_PATH fd
> > on a previously-unused fd number, and put a special flag on the fd,
> > like FD_CLOEXEC, but that causes the kernel to close it whenever it's
> > opened. The pathname passed could then simply be /dev/fd/%d or
> > /proc/self/fd/%d, and although this is presently dependent on /proc
> > being mounted, virtual /dev/fd/* could someday be something completely
> > independent of procfs. The kernel keeps all the freedom to choose how
> > to pass the name to the interpreter. I'm not proposing any kernel
> > API/ABI lock-in and I'm with you in opposing such lock-in.
> 
> Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
> traversed and after that point there is no information whatsoever how we
> got to that vfsmount/dentry pair.  I can imagine several kludges that would
> work, but they are unspeakably ugly, and do_last() is already far too
> convoluted as it is.

I'm not sure where you're disagreeing with me. open of procfs symlinks
does not resolve the symlink and open the resulting pathname. They are
"magic symlinks" which are bound to the inode of the open file. I
don't see why this action, which is already special for magic
symlinks, can't check a flag on the magic symlink and possibly close
the corresponding file descriptor as part of its action.

In any case, whether/how fexecve works with interpreters is something
the kernel can change without breaking userspace expectations. My goal
is to avoid creating any new API/ABI requirement here.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 23:12                               ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 23:12 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 10:57:43PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:
> 
> > Here's a very simple way it could work -- it could put the O_PATH fd
> > on a previously-unused fd number, and put a special flag on the fd,
> > like FD_CLOEXEC, but that causes the kernel to close it whenever it's
> > opened. The pathname passed could then simply be /dev/fd/%d or
> > /proc/self/fd/%d, and although this is presently dependent on /proc
> > being mounted, virtual /dev/fd/* could someday be something completely
> > independent of procfs. The kernel keeps all the freedom to choose how
> > to pass the name to the interpreter. I'm not proposing any kernel
> > API/ABI lock-in and I'm with you in opposing such lock-in.
> 
> Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
> traversed and after that point there is no information whatsoever how we
> got to that vfsmount/dentry pair.  I can imagine several kludges that would
> work, but they are unspeakably ugly, and do_last() is already far too
> convoluted as it is.

I'm not sure where you're disagreeing with me. open of procfs symlinks
does not resolve the symlink and open the resulting pathname. They are
"magic symlinks" which are bound to the inode of the open file. I
don't see why this action, which is already special for magic
symlinks, can't check a flag on the magic symlink and possibly close
the corresponding file descriptor as part of its action.

In any case, whether/how fexecve works with interpreters is something
the kernel can change without breaking userspace expectations. My goal
is to avoid creating any new API/ABI requirement here.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 23:12                               ` Rich Felker
@ 2015-01-09 23:24                                 ` Andy Lutomirski
  -1 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2015-01-09 23:24 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 9, 2015 at 3:12 PM, Rich Felker <dalias@aerifal.cx> wrote:
> On Fri, Jan 09, 2015 at 10:57:43PM +0000, Al Viro wrote:
>> On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:
>>
>> > Here's a very simple way it could work -- it could put the O_PATH fd
>> > on a previously-unused fd number, and put a special flag on the fd,
>> > like FD_CLOEXEC, but that causes the kernel to close it whenever it's
>> > opened. The pathname passed could then simply be /dev/fd/%d or
>> > /proc/self/fd/%d, and although this is presently dependent on /proc
>> > being mounted, virtual /dev/fd/* could someday be something completely
>> > independent of procfs. The kernel keeps all the freedom to choose how
>> > to pass the name to the interpreter. I'm not proposing any kernel
>> > API/ABI lock-in and I'm with you in opposing such lock-in.
>>
>> Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
>> traversed and after that point there is no information whatsoever how we
>> got to that vfsmount/dentry pair.  I can imagine several kludges that would
>> work, but they are unspeakably ugly, and do_last() is already far too
>> convoluted as it is.
>
> I'm not sure where you're disagreeing with me. open of procfs symlinks
> does not resolve the symlink and open the resulting pathname. They are
> "magic symlinks" which are bound to the inode of the open file. I
> don't see why this action, which is already special for magic
> symlinks, can't check a flag on the magic symlink and possibly close
> the corresponding file descriptor as part of its action.
>
> In any case, whether/how fexecve works with interpreters is something
> the kernel can change without breaking userspace expectations. My goal
> is to avoid creating any new API/ABI requirement here.
>

I think that, if we really want to support clean fexecve on O_CLOEXEC
scripts some day, the right way to do it is to fix the script
interface for real.  Have a special flag in the headers of script
interpreters that support a new interface that says "when I'm a script
interpreter, I expect an auxv entry AT_SCRIPT_FD with an  open fd with
CLOEXEC set".  Then we can directly exec scripts by fd, even with
O_CLOEXEC set, without any races.

--Andy

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 23:24                                 ` Andy Lutomirski
  0 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2015-01-09 23:24 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 9, 2015 at 3:12 PM, Rich Felker <dalias@aerifal.cx> wrote:
> On Fri, Jan 09, 2015 at 10:57:43PM +0000, Al Viro wrote:
>> On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:
>>
>> > Here's a very simple way it could work -- it could put the O_PATH fd
>> > on a previously-unused fd number, and put a special flag on the fd,
>> > like FD_CLOEXEC, but that causes the kernel to close it whenever it's
>> > opened. The pathname passed could then simply be /dev/fd/%d or
>> > /proc/self/fd/%d, and although this is presently dependent on /proc
>> > being mounted, virtual /dev/fd/* could someday be something completely
>> > independent of procfs. The kernel keeps all the freedom to choose how
>> > to pass the name to the interpreter. I'm not proposing any kernel
>> > API/ABI lock-in and I'm with you in opposing such lock-in.
>>
>> Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
>> traversed and after that point there is no information whatsoever how we
>> got to that vfsmount/dentry pair.  I can imagine several kludges that would
>> work, but they are unspeakably ugly, and do_last() is already far too
>> convoluted as it is.
>
> I'm not sure where you're disagreeing with me. open of procfs symlinks
> does not resolve the symlink and open the resulting pathname. They are
> "magic symlinks" which are bound to the inode of the open file. I
> don't see why this action, which is already special for magic
> symlinks, can't check a flag on the magic symlink and possibly close
> the corresponding file descriptor as part of its action.
>
> In any case, whether/how fexecve works with interpreters is something
> the kernel can change without breaking userspace expectations. My goal
> is to avoid creating any new API/ABI requirement here.
>

I think that, if we really want to support clean fexecve on O_CLOEXEC
scripts some day, the right way to do it is to fix the script
interface for real.  Have a special flag in the headers of script
interpreters that support a new interface that says "when I'm a script
interpreter, I expect an auxv entry AT_SCRIPT_FD with an  open fd with
CLOEXEC set".  Then we can directly exec scripts by fd, even with
O_CLOEXEC set, without any races.

--Andy

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 23:12                               ` Rich Felker
@ 2015-01-09 23:36                                 ` Al Viro
  -1 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 23:36 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 06:12:48PM -0500, Rich Felker wrote:

> I'm not sure where you're disagreeing with me. open of procfs symlinks
> does not resolve the symlink and open the resulting pathname. They are
> "magic symlinks" which are bound to the inode of the open file. I
> don't see why this action, which is already special for magic
> symlinks, can't check a flag on the magic symlink and possibly close
> the corresponding file descriptor as part of its action.

_What_ action?  ->follow_link()?  As in "the same thing that e.g.
stat(2) would trigger"?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 23:36                                 ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-09 23:36 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 06:12:48PM -0500, Rich Felker wrote:

> I'm not sure where you're disagreeing with me. open of procfs symlinks
> does not resolve the symlink and open the resulting pathname. They are
> "magic symlinks" which are bound to the inode of the open file. I
> don't see why this action, which is already special for magic
> symlinks, can't check a flag on the magic symlink and possibly close
> the corresponding file descriptor as part of its action.

_What_ action?  ->follow_link()?  As in "the same thing that e.g.
stat(2) would trigger"?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 23:24                                 ` Andy Lutomirski
@ 2015-01-09 23:37                                   ` Rich Felker
  -1 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 23:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 09, 2015 at 03:24:12PM -0800, Andy Lutomirski wrote:
> On Fri, Jan 9, 2015 at 3:12 PM, Rich Felker <dalias@aerifal.cx> wrote:
> > On Fri, Jan 09, 2015 at 10:57:43PM +0000, Al Viro wrote:
> >> On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:
> >>
> >> > Here's a very simple way it could work -- it could put the O_PATH fd
> >> > on a previously-unused fd number, and put a special flag on the fd,
> >> > like FD_CLOEXEC, but that causes the kernel to close it whenever it's
> >> > opened. The pathname passed could then simply be /dev/fd/%d or
> >> > /proc/self/fd/%d, and although this is presently dependent on /proc
> >> > being mounted, virtual /dev/fd/* could someday be something completely
> >> > independent of procfs. The kernel keeps all the freedom to choose how
> >> > to pass the name to the interpreter. I'm not proposing any kernel
> >> > API/ABI lock-in and I'm with you in opposing such lock-in.
> >>
> >> Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
> >> traversed and after that point there is no information whatsoever how we
> >> got to that vfsmount/dentry pair.  I can imagine several kludges that would
> >> work, but they are unspeakably ugly, and do_last() is already far too
> >> convoluted as it is.
> >
> > I'm not sure where you're disagreeing with me. open of procfs symlinks
> > does not resolve the symlink and open the resulting pathname. They are
> > "magic symlinks" which are bound to the inode of the open file. I
> > don't see why this action, which is already special for magic
> > symlinks, can't check a flag on the magic symlink and possibly close
> > the corresponding file descriptor as part of its action.
> >
> > In any case, whether/how fexecve works with interpreters is something
> > the kernel can change without breaking userspace expectations. My goal
> > is to avoid creating any new API/ABI requirement here.
> 
> I think that, if we really want to support clean fexecve on O_CLOEXEC
> scripts some day, the right way to do it is to fix the script
> interface for real.  Have a special flag in the headers of script
> interpreters that support a new interface that says "when I'm a script
> interpreter, I expect an auxv entry AT_SCRIPT_FD with an  open fd with
> CLOEXEC set".  Then we can directly exec scripts by fd, even with
> O_CLOEXEC set, without any races.

This is also acceptable, but I don't think you'd really need a special
header flag. Just pass it, and also pass /dev/fd/%d or
/proc/self/fd/%d in argv[]. If the interpreter supports it, everything
works fine. If not, it still works as long as /proc is mounted, but
with a partial fd leak. (Note: the leak is not so bad since the
interpreter would inherit a close-on-exec fd and thus would not leak
it further.)

Aside from setting up the new auxv entry, the main trick the kernel
would have to do is bypassing FD_CLOEXEC at exec time while keeping
the FD_CLOEXEC flag present on the fd after exec.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-09 23:37                                   ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-09 23:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 09, 2015 at 03:24:12PM -0800, Andy Lutomirski wrote:
> On Fri, Jan 9, 2015 at 3:12 PM, Rich Felker <dalias@aerifal.cx> wrote:
> > On Fri, Jan 09, 2015 at 10:57:43PM +0000, Al Viro wrote:
> >> On Fri, Jan 09, 2015 at 05:42:52PM -0500, Rich Felker wrote:
> >>
> >> > Here's a very simple way it could work -- it could put the O_PATH fd
> >> > on a previously-unused fd number, and put a special flag on the fd,
> >> > like FD_CLOEXEC, but that causes the kernel to close it whenever it's
> >> > opened. The pathname passed could then simply be /dev/fd/%d or
> >> > /proc/self/fd/%d, and although this is presently dependent on /proc
> >> > being mounted, virtual /dev/fd/* could someday be something completely
> >> > independent of procfs. The kernel keeps all the freedom to choose how
> >> > to pass the name to the interpreter. I'm not proposing any kernel
> >> > API/ABI lock-in and I'm with you in opposing such lock-in.
> >>
> >> Huh?  open() on procfs symlinks does *NOT* work the way - the symlink is
> >> traversed and after that point there is no information whatsoever how we
> >> got to that vfsmount/dentry pair.  I can imagine several kludges that would
> >> work, but they are unspeakably ugly, and do_last() is already far too
> >> convoluted as it is.
> >
> > I'm not sure where you're disagreeing with me. open of procfs symlinks
> > does not resolve the symlink and open the resulting pathname. They are
> > "magic symlinks" which are bound to the inode of the open file. I
> > don't see why this action, which is already special for magic
> > symlinks, can't check a flag on the magic symlink and possibly close
> > the corresponding file descriptor as part of its action.
> >
> > In any case, whether/how fexecve works with interpreters is something
> > the kernel can change without breaking userspace expectations. My goal
> > is to avoid creating any new API/ABI requirement here.
> 
> I think that, if we really want to support clean fexecve on O_CLOEXEC
> scripts some day, the right way to do it is to fix the script
> interface for real.  Have a special flag in the headers of script
> interpreters that support a new interface that says "when I'm a script
> interpreter, I expect an auxv entry AT_SCRIPT_FD with an  open fd with
> CLOEXEC set".  Then we can directly exec scripts by fd, even with
> O_CLOEXEC set, without any races.

This is also acceptable, but I don't think you'd really need a special
header flag. Just pass it, and also pass /dev/fd/%d or
/proc/self/fd/%d in argv[]. If the interpreter supports it, everything
works fine. If not, it still works as long as /proc is mounted, but
with a partial fd leak. (Note: the leak is not so bad since the
interpreter would inherit a close-on-exec fd and thus would not leak
it further.)

Aside from setting up the new auxv entry, the main trick the kernel
would have to do is bypassing FD_CLOEXEC at exec time while keeping
the FD_CLOEXEC flag present on the fd after exec.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 23:24                                 ` Andy Lutomirski
  (?)
  (?)
@ 2015-01-10  0:01                                 ` Al Viro
  -1 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-10  0:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rich Felker, David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 09, 2015 at 03:24:12PM -0800, Andy Lutomirski wrote:

> I think that, if we really want to support clean fexecve on O_CLOEXEC
> scripts some day, the right way to do it is to fix the script
> interface for real.  Have a special flag in the headers of script
> interpreters that support a new interface that says "when I'm a script
> interpreter, I expect an auxv entry AT_SCRIPT_FD with an  open fd with
> CLOEXEC set".  Then we can directly exec scripts by fd, even with
> O_CLOEXEC set, without any races.

Amazing.  Let me see if I got it straight - you want a magical Linux-only
flag to mark the binaries that might be used as interpreters.  _Plus_ the
Linux-only logics in their source to go with that.  With corresponding kludges
to parsing the command line (you know, like #!/usr/bin/make -f as the first
line in a script - somehow it should recognize the deep magic of the oh
so fucking superior interface and suppress the normal behaviour).  Maintained
by hell knows whom.  Onna stick.  Inna bun.  CMOT Dibbler would be proud...

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  1:17                         ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-10  1:17 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> I'm not proposing code because I'm a libc developer not a kernel
> developer. I know what's needed for userspace to provide a conforming
> fexecve to applications, not how to implement that on the kernel side,
> although I'm trying to provide constructive ideas. The hostility is
> really not necessary.

Conforming to what?

The open group fexecve says nothing about requiring a file descriptor
passed to fexecve to have O_CLOEXEC.

Further looking at open group specification of exec it seems to indicate
the preferred way to handle this is for the kernel to return O_NOEXEC
and then libc gets to figure out how to run the shell script.  Is that
the kind of ``conforming'' implementation you are looking for?

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  1:17                         ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-10  1:17 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel@vger.kernel.org,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> writes:

> I'm not proposing code because I'm a libc developer not a kernel
> developer. I know what's needed for userspace to provide a conforming
> fexecve to applications, not how to implement that on the kernel side,
> although I'm trying to provide constructive ideas. The hostility is
> really not necessary.

Conforming to what?

The open group fexecve says nothing about requiring a file descriptor
passed to fexecve to have O_CLOEXEC.

Further looking at open group specification of exec it seems to indicate
the preferred way to handle this is for the kernel to return O_NOEXEC
and then libc gets to figure out how to run the shell script.  Is that
the kind of ``conforming'' implementation you are looking for?

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  1:17                         ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-10  1:17 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> I'm not proposing code because I'm a libc developer not a kernel
> developer. I know what's needed for userspace to provide a conforming
> fexecve to applications, not how to implement that on the kernel side,
> although I'm trying to provide constructive ideas. The hostility is
> really not necessary.

Conforming to what?

The open group fexecve says nothing about requiring a file descriptor
passed to fexecve to have O_CLOEXEC.

Further looking at open group specification of exec it seems to indicate
the preferred way to handle this is for the kernel to return O_NOEXEC
and then libc gets to figure out how to run the shell script.  Is that
the kind of ``conforming'' implementation you are looking for?

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  1:17                         ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-10  1:17 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel@vger.kernel.org,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

Rich Felker <dalias@aerifal.cx> writes:

> I'm not proposing code because I'm a libc developer not a kernel
> developer. I know what's needed for userspace to provide a conforming
> fexecve to applications, not how to implement that on the kernel side,
> although I'm trying to provide constructive ideas. The hostility is
> really not necessary.

Conforming to what?

The open group fexecve says nothing about requiring a file descriptor
passed to fexecve to have O_CLOEXEC.

Further looking at open group specification of exec it seems to indicate
the preferred way to handle this is for the kernel to return O_NOEXEC
and then libc gets to figure out how to run the shell script.  Is that
the kind of ``conforming'' implementation you are looking for?

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  1:33                           ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-10  1:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Fri, Jan 09, 2015 at 07:17:41PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > I'm not proposing code because I'm a libc developer not a kernel
> > developer. I know what's needed for userspace to provide a conforming
> > fexecve to applications, not how to implement that on the kernel side,
> > although I'm trying to provide constructive ideas. The hostility is
> > really not necessary.
> 
> Conforming to what?
> 
> The open group fexecve says nothing about requiring a file descriptor
> passed to fexecve to have O_CLOEXEC.

It doesn't require it but it allows it, and in multithreaded programs
that might run child processes (or library code that might be used in
such situations), O_CLOEXEC is mandatory everywhere to avoid fd leaks.

> Further looking at open group specification of exec it seems to indicate
> the preferred way to handle this is for the kernel to return O_NOEXEC
> and then libc gets to figure out how to run the shell script.  Is that
> the kind of ``conforming'' implementation you are looking for?

This is a complex issue, and does not apply to native #! support
(which is a supported executable format and thus not ENOEXEC) but
rather standard POSIX shell scripts (which don't have a #! line at
all). In this case the behavior of fexecve is perhaps under-specified.
However, in cases where execve would succeed (without causing
ENOEXEC), I think it's at least undesirable, if not non-conforming,
for fexecve to fail.

Should we request clarification from the Austin Group?

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  1:33                           ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-10  1:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 07:17:41PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> writes:
> 
> > I'm not proposing code because I'm a libc developer not a kernel
> > developer. I know what's needed for userspace to provide a conforming
> > fexecve to applications, not how to implement that on the kernel side,
> > although I'm trying to provide constructive ideas. The hostility is
> > really not necessary.
> 
> Conforming to what?
> 
> The open group fexecve says nothing about requiring a file descriptor
> passed to fexecve to have O_CLOEXEC.

It doesn't require it but it allows it, and in multithreaded programs
that might run child processes (or library code that might be used in
such situations), O_CLOEXEC is mandatory everywhere to avoid fd leaks.

> Further looking at open group specification of exec it seems to indicate
> the preferred way to handle this is for the kernel to return O_NOEXEC
> and then libc gets to figure out how to run the shell script.  Is that
> the kind of ``conforming'' implementation you are looking for?

This is a complex issue, and does not apply to native #! support
(which is a supported executable format and thus not ENOEXEC) but
rather standard POSIX shell scripts (which don't have a #! line at
all). In this case the behavior of fexecve is perhaps under-specified.
However, in cases where execve would succeed (without causing
ENOEXEC), I think it's at least undesirable, if not non-conforming,
for fexecve to fail.

Should we request clarification from the Austin Group?

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  1:33                           ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-10  1:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 07:17:41PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > I'm not proposing code because I'm a libc developer not a kernel
> > developer. I know what's needed for userspace to provide a conforming
> > fexecve to applications, not how to implement that on the kernel side,
> > although I'm trying to provide constructive ideas. The hostility is
> > really not necessary.
> 
> Conforming to what?
> 
> The open group fexecve says nothing about requiring a file descriptor
> passed to fexecve to have O_CLOEXEC.

It doesn't require it but it allows it, and in multithreaded programs
that might run child processes (or library code that might be used in
such situations), O_CLOEXEC is mandatory everywhere to avoid fd leaks.

> Further looking at open group specification of exec it seems to indicate
> the preferred way to handle this is for the kernel to return O_NOEXEC
> and then libc gets to figure out how to run the shell script.  Is that
> the kind of ``conforming'' implementation you are looking for?

This is a complex issue, and does not apply to native #! support
(which is a supported executable format and thus not ENOEXEC) but
rather standard POSIX shell scripts (which don't have a #! line at
all). In this case the behavior of fexecve is perhaps under-specified.
However, in cases where execve would succeed (without causing
ENOEXEC), I think it's at least undesirable, if not non-conforming,
for fexecve to fail.

Should we request clarification from the Austin Group?

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  3:03                                   ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-10  3:03 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 11:36:44PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 06:12:48PM -0500, Rich Felker wrote:
> 
> > I'm not sure where you're disagreeing with me. open of procfs symlinks
> > does not resolve the symlink and open the resulting pathname. They are
> > "magic symlinks" which are bound to the inode of the open file. I
> > don't see why this action, which is already special for magic
> > symlinks, can't check a flag on the magic symlink and possibly close
> > the corresponding file descriptor as part of its action.
> 
> _What_ action?  ->follow_link()?  As in "the same thing that e.g.
> stat(2) would trigger"?

To elaborate a bit: the fundamental method for symlink traversal is
->follow_link().  It gets dentry of the object itself + opaque context.
Usually it just obtains some string (== symlink contents) and calls
nd_set_link(context, string).  In that case the string will be interpreted
by its callers in usual way.  Another possibility is to call
nd_jump_link(context, location), which will reset the current position
(directory in which the symlink has been found and relative to which it
would be interpreted) to given location in tree.  It might actually do
both - then the string will be interpreted relative to the new location.
Once the pathname resolution is done with the string stored by nd_set_link(),
it calls another method - ->put_link().  That one releases the object
that contains this string; it gets an opaque pointer returned by
->follow_link().  Returning ERR_PTR(-Esomething) indicates an error, so does
nd_set_link(context, ERR_PTR(-Esomething)).

readlink(2) is using a different method (->readlink()) and any object whose
->follow_link() only uses nd_set_link() can use generic_readlink as its
->readlink instance - that will call ->follow_link(), copy the string
stored by nd_set_link() to userland buffer and use ->put_link() to release
whatever needs to be released.  Most of the symlinks are doing just that.

procfs "magical" symlinks have ->follow_link() that uses nd_jump_link();
they obviously can't use generic_readlink() (there is no string left
by ->follow_link() for caller to traverse), so they have non-standard
->readlink() instances - ones that use d_path() to generate a plausible
pathname of the would-be destination of their ->follow_link().  Or something
like pipe:[696969], etc.

Note, however, that ->readlink() is used only by readlink(2) syscall; as far
as pathname resolution is concerned it is completely irrelevant.  What matters
is ->follow_link().

Now, the callers do not know (and do not care) what a particular symlink _is_.
A symlink is just a dentry with inode that has non-NULL ->follow_link()
method.  That's it.  Moreover, _any_ pathname resolution is using the
same method for symlink traversal, be it open(2), stat(2), whatever.  If
a symlink is to be traversed, that's it - the only choice VFS has is whether
to traverse it at all or not (think of stat(2) vs lstat(2) difference, or
O_NOFOLLOW, etc.)

_After_ the traversal it's too late to do this sort of thing - after all,
how do you tell if your current position had been set by the traversal of
your symlink or that of any normal /proc/self/fd/<n>?

And doing that _during_ the traversal would really suck - stray ls -lR /proc
could race with that open() done by script interpreter.

It might be possible to work around that, but trying that rapidly gets into
very ugly territory, *especially* since the handling of the final component
of open(2) (fs/namei.c:do_last()) is already far too convoluted.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  3:03                                   ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-10  3:03 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 11:36:44PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 06:12:48PM -0500, Rich Felker wrote:
> 
> > I'm not sure where you're disagreeing with me. open of procfs symlinks
> > does not resolve the symlink and open the resulting pathname. They are
> > "magic symlinks" which are bound to the inode of the open file. I
> > don't see why this action, which is already special for magic
> > symlinks, can't check a flag on the magic symlink and possibly close
> > the corresponding file descriptor as part of its action.
> 
> _What_ action?  ->follow_link()?  As in "the same thing that e.g.
> stat(2) would trigger"?

To elaborate a bit: the fundamental method for symlink traversal is
->follow_link().  It gets dentry of the object itself + opaque context.
Usually it just obtains some string (== symlink contents) and calls
nd_set_link(context, string).  In that case the string will be interpreted
by its callers in usual way.  Another possibility is to call
nd_jump_link(context, location), which will reset the current position
(directory in which the symlink has been found and relative to which it
would be interpreted) to given location in tree.  It might actually do
both - then the string will be interpreted relative to the new location.
Once the pathname resolution is done with the string stored by nd_set_link(),
it calls another method - ->put_link().  That one releases the object
that contains this string; it gets an opaque pointer returned by
->follow_link().  Returning ERR_PTR(-Esomething) indicates an error, so does
nd_set_link(context, ERR_PTR(-Esomething)).

readlink(2) is using a different method (->readlink()) and any object whose
->follow_link() only uses nd_set_link() can use generic_readlink as its
->readlink instance - that will call ->follow_link(), copy the string
stored by nd_set_link() to userland buffer and use ->put_link() to release
whatever needs to be released.  Most of the symlinks are doing just that.

procfs "magical" symlinks have ->follow_link() that uses nd_jump_link();
they obviously can't use generic_readlink() (there is no string left
by ->follow_link() for caller to traverse), so they have non-standard
->readlink() instances - ones that use d_path() to generate a plausible
pathname of the would-be destination of their ->follow_link().  Or something
like pipe:[696969], etc.

Note, however, that ->readlink() is used only by readlink(2) syscall; as far
as pathname resolution is concerned it is completely irrelevant.  What matters
is ->follow_link().

Now, the callers do not know (and do not care) what a particular symlink _is_.
A symlink is just a dentry with inode that has non-NULL ->follow_link()
method.  That's it.  Moreover, _any_ pathname resolution is using the
same method for symlink traversal, be it open(2), stat(2), whatever.  If
a symlink is to be traversed, that's it - the only choice VFS has is whether
to traverse it at all or not (think of stat(2) vs lstat(2) difference, or
O_NOFOLLOW, etc.)

_After_ the traversal it's too late to do this sort of thing - after all,
how do you tell if your current position had been set by the traversal of
your symlink or that of any normal /proc/self/fd/<n>?

And doing that _during_ the traversal would really suck - stray ls -lR /proc
could race with that open() done by script interpreter.

It might be possible to work around that, but trying that rapidly gets into
very ugly territory, *especially* since the handling of the final component
of open(2) (fs/namei.c:do_last()) is already far too convoluted.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  3:03                                   ` Al Viro
  0 siblings, 0 replies; 123+ messages in thread
From: Al Viro @ 2015-01-10  3:03 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Fri, Jan 09, 2015 at 11:36:44PM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 06:12:48PM -0500, Rich Felker wrote:
> 
> > I'm not sure where you're disagreeing with me. open of procfs symlinks
> > does not resolve the symlink and open the resulting pathname. They are
> > "magic symlinks" which are bound to the inode of the open file. I
> > don't see why this action, which is already special for magic
> > symlinks, can't check a flag on the magic symlink and possibly close
> > the corresponding file descriptor as part of its action.
> 
> _What_ action?  ->follow_link()?  As in "the same thing that e.g.
> stat(2) would trigger"?

To elaborate a bit: the fundamental method for symlink traversal is
->follow_link().  It gets dentry of the object itself + opaque context.
Usually it just obtains some string (= symlink contents) and calls
nd_set_link(context, string).  In that case the string will be interpreted
by its callers in usual way.  Another possibility is to call
nd_jump_link(context, location), which will reset the current position
(directory in which the symlink has been found and relative to which it
would be interpreted) to given location in tree.  It might actually do
both - then the string will be interpreted relative to the new location.
Once the pathname resolution is done with the string stored by nd_set_link(),
it calls another method - ->put_link().  That one releases the object
that contains this string; it gets an opaque pointer returned by
->follow_link().  Returning ERR_PTR(-Esomething) indicates an error, so does
nd_set_link(context, ERR_PTR(-Esomething)).

readlink(2) is using a different method (->readlink()) and any object whose
->follow_link() only uses nd_set_link() can use generic_readlink as its
->readlink instance - that will call ->follow_link(), copy the string
stored by nd_set_link() to userland buffer and use ->put_link() to release
whatever needs to be released.  Most of the symlinks are doing just that.

procfs "magical" symlinks have ->follow_link() that uses nd_jump_link();
they obviously can't use generic_readlink() (there is no string left
by ->follow_link() for caller to traverse), so they have non-standard
->readlink() instances - ones that use d_path() to generate a plausible
pathname of the would-be destination of their ->follow_link().  Or something
like pipe:[696969], etc.

Note, however, that ->readlink() is used only by readlink(2) syscall; as far
as pathname resolution is concerned it is completely irrelevant.  What matters
is ->follow_link().

Now, the callers do not know (and do not care) what a particular symlink _is_.
A symlink is just a dentry with inode that has non-NULL ->follow_link()
method.  That's it.  Moreover, _any_ pathname resolution is using the
same method for symlink traversal, be it open(2), stat(2), whatever.  If
a symlink is to be traversed, that's it - the only choice VFS has is whether
to traverse it at all or not (think of stat(2) vs lstat(2) difference, or
O_NOFOLLOW, etc.)

_After_ the traversal it's too late to do this sort of thing - after all,
how do you tell if your current position had been set by the traversal of
your symlink or that of any normal /proc/self/fd/<n>?

And doing that _during_ the traversal would really suck - stray ls -lR /proc
could race with that open() done by script interpreter.

It might be possible to work around that, but trying that rapidly gets into
very ugly territory, *especially* since the handling of the final component
of open(2) (fs/namei.c:do_last()) is already far too convoluted.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-10  3:03                                   ` Al Viro
@ 2015-01-10  3:41                                     ` Rich Felker
  -1 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-10  3:41 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Sat, Jan 10, 2015 at 03:03:00AM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 11:36:44PM +0000, Al Viro wrote:
> > On Fri, Jan 09, 2015 at 06:12:48PM -0500, Rich Felker wrote:
> > 
> > > I'm not sure where you're disagreeing with me. open of procfs symlinks
> > > does not resolve the symlink and open the resulting pathname. They are
> > > "magic symlinks" which are bound to the inode of the open file. I
> > > don't see why this action, which is already special for magic
> > > symlinks, can't check a flag on the magic symlink and possibly close
> > > the corresponding file descriptor as part of its action.
> > 
> > _What_ action?  ->follow_link()?  As in "the same thing that e.g.
> > stat(2) would trigger"?
> 
> To elaborate a bit: the fundamental method for symlink traversal is
> ->follow_link().  It gets dentry of the object itself + opaque context.
> Usually it just obtains some string (== symlink contents) and calls
> nd_set_link(context, string).  In that case the string will be interpreted
> by its callers in usual way.  Another possibility is to call
> nd_jump_link(context, location), which will reset the current position
> (directory in which the symlink has been found and relative to which it
> would be interpreted) to given location in tree.  It might actually do
> both - then the string will be interpreted relative to the new location.
> Once the pathname resolution is done with the string stored by nd_set_link(),
> it calls another method - ->put_link().  That one releases the object
> that contains this string; it gets an opaque pointer returned by
> ->follow_link().  Returning ERR_PTR(-Esomething) indicates an error, so does
> nd_set_link(context, ERR_PTR(-Esomething)).
> 
> readlink(2) is using a different method (->readlink()) and any object whose
> ->follow_link() only uses nd_set_link() can use generic_readlink as its
> ->readlink instance - that will call ->follow_link(), copy the string
> stored by nd_set_link() to userland buffer and use ->put_link() to release
> whatever needs to be released.  Most of the symlinks are doing just that.
> 
> procfs "magical" symlinks have ->follow_link() that uses nd_jump_link();
> they obviously can't use generic_readlink() (there is no string left
> by ->follow_link() for caller to traverse), so they have non-standard
> ->readlink() instances - ones that use d_path() to generate a plausible
> pathname of the would-be destination of their ->follow_link().  Or something
> like pipe:[696969], etc.
> 
> Note, however, that ->readlink() is used only by readlink(2) syscall; as far
> as pathname resolution is concerned it is completely irrelevant.  What matters
> is ->follow_link().
> 
> Now, the callers do not know (and do not care) what a particular symlink _is_.
> A symlink is just a dentry with inode that has non-NULL ->follow_link()
> method.  That's it.  Moreover, _any_ pathname resolution is using the
> same method for symlink traversal, be it open(2), stat(2), whatever.  If
> a symlink is to be traversed, that's it - the only choice VFS has is whether
> to traverse it at all or not (think of stat(2) vs lstat(2) difference, or
> O_NOFOLLOW, etc.)
> 
> _After_ the traversal it's too late to do this sort of thing - after all,
> how do you tell if your current position had been set by the traversal of
> your symlink or that of any normal /proc/self/fd/<n>?

Thanks for clarifying how this all works in the kernel. It makes it
easier to understand what the costs (especially complexity costs) of
different implementation options might be for the kernel.

> And doing that _during_ the traversal would really suck - stray ls -lR /proc
> could race with that open() done by script interpreter.

IMO this one issue is easily solvable by limiting the special action
to calls by the owning pid.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  3:41                                     ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-10  3:41 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Sat, Jan 10, 2015 at 03:03:00AM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 11:36:44PM +0000, Al Viro wrote:
> > On Fri, Jan 09, 2015 at 06:12:48PM -0500, Rich Felker wrote:
> > 
> > > I'm not sure where you're disagreeing with me. open of procfs symlinks
> > > does not resolve the symlink and open the resulting pathname. They are
> > > "magic symlinks" which are bound to the inode of the open file. I
> > > don't see why this action, which is already special for magic
> > > symlinks, can't check a flag on the magic symlink and possibly close
> > > the corresponding file descriptor as part of its action.
> > 
> > _What_ action?  ->follow_link()?  As in "the same thing that e.g.
> > stat(2) would trigger"?
> 
> To elaborate a bit: the fundamental method for symlink traversal is
> ->follow_link().  It gets dentry of the object itself + opaque context.
> Usually it just obtains some string (= symlink contents) and calls
> nd_set_link(context, string).  In that case the string will be interpreted
> by its callers in usual way.  Another possibility is to call
> nd_jump_link(context, location), which will reset the current position
> (directory in which the symlink has been found and relative to which it
> would be interpreted) to given location in tree.  It might actually do
> both - then the string will be interpreted relative to the new location.
> Once the pathname resolution is done with the string stored by nd_set_link(),
> it calls another method - ->put_link().  That one releases the object
> that contains this string; it gets an opaque pointer returned by
> ->follow_link().  Returning ERR_PTR(-Esomething) indicates an error, so does
> nd_set_link(context, ERR_PTR(-Esomething)).
> 
> readlink(2) is using a different method (->readlink()) and any object whose
> ->follow_link() only uses nd_set_link() can use generic_readlink as its
> ->readlink instance - that will call ->follow_link(), copy the string
> stored by nd_set_link() to userland buffer and use ->put_link() to release
> whatever needs to be released.  Most of the symlinks are doing just that.
> 
> procfs "magical" symlinks have ->follow_link() that uses nd_jump_link();
> they obviously can't use generic_readlink() (there is no string left
> by ->follow_link() for caller to traverse), so they have non-standard
> ->readlink() instances - ones that use d_path() to generate a plausible
> pathname of the would-be destination of their ->follow_link().  Or something
> like pipe:[696969], etc.
> 
> Note, however, that ->readlink() is used only by readlink(2) syscall; as far
> as pathname resolution is concerned it is completely irrelevant.  What matters
> is ->follow_link().
> 
> Now, the callers do not know (and do not care) what a particular symlink _is_.
> A symlink is just a dentry with inode that has non-NULL ->follow_link()
> method.  That's it.  Moreover, _any_ pathname resolution is using the
> same method for symlink traversal, be it open(2), stat(2), whatever.  If
> a symlink is to be traversed, that's it - the only choice VFS has is whether
> to traverse it at all or not (think of stat(2) vs lstat(2) difference, or
> O_NOFOLLOW, etc.)
> 
> _After_ the traversal it's too late to do this sort of thing - after all,
> how do you tell if your current position had been set by the traversal of
> your symlink or that of any normal /proc/self/fd/<n>?

Thanks for clarifying how this all works in the kernel. It makes it
easier to understand what the costs (especially complexity costs) of
different implementation options might be for the kernel.

> And doing that _during_ the traversal would really suck - stray ls -lR /proc
> could race with that open() done by script interpreter.

IMO this one issue is easily solvable by limiting the special action
to calls by the owning pid.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-10  3:41                                     ` Rich Felker
  (?)
@ 2015-01-10  4:14                                     ` Al Viro
  2015-01-10  5:57                                         ` Rich Felker
  -1 siblings, 1 reply; 123+ messages in thread
From: Al Viro @ 2015-01-10  4:14 UTC (permalink / raw)
  To: Rich Felker
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 09, 2015 at 10:41:44PM -0500, Rich Felker wrote:
> > _After_ the traversal it's too late to do this sort of thing - after all,
> > how do you tell if your current position had been set by the traversal of
> > your symlink or that of any normal /proc/self/fd/<n>?
> 
> Thanks for clarifying how this all works in the kernel. It makes it
> easier to understand what the costs (especially complexity costs) of
> different implementation options might be for the kernel.
> 
> > And doing that _during_ the traversal would really suck - stray ls -lR /proc
> > could race with that open() done by script interpreter.
> 
> IMO this one issue is easily solvable by limiting the special action
> to calls by the owning pid.

Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
etc.) before bothering with open(2), you'll get screwed.  Moreover, if it
does so only in case when you have something specific in environment,
you'll have the devil of the time trying to figure out how to reproduce
such a bug report...

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-10  4:14                                     ` Al Viro
@ 2015-01-10  5:57                                         ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-10  5:57 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 10:41:44PM -0500, Rich Felker wrote:
> > > _After_ the traversal it's too late to do this sort of thing - after all,
> > > how do you tell if your current position had been set by the traversal of
> > > your symlink or that of any normal /proc/self/fd/<n>?
> > 
> > Thanks for clarifying how this all works in the kernel. It makes it
> > easier to understand what the costs (especially complexity costs) of
> > different implementation options might be for the kernel.
> > 
> > > And doing that _during_ the traversal would really suck - stray ls -lR /proc
> > > could race with that open() done by script interpreter.
> > 
> > IMO this one issue is easily solvable by limiting the special action
> > to calls by the owning pid.
> 
> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
> etc.) before bothering with open(2), you'll get screwed.

Yes, but I think that would be very bad interpreter design.
stat/getxattr/access/whatever followed by open is always a TOCTOU
race. The correct sequence of actions is always open followed by
fstat/fgetxattr/...

> Moreover, if it
> does so only in case when you have something specific in environment,
> you'll have the devil of the time trying to figure out how to reproduce
> such a bug report...

Yes, this is a more serious concern. For example, if a shell processes
$HISTFILE or something before opening the script. I'm starting to
prefer the idea of just refusing to honor the close-on-exec flag for
the fd passed to fexecve but preserving it, and letting the
interpreter close the file itself if it wants to. This could be done
with or without the new auxv entry stuff.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  5:57                                         ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-10  5:57 UTC (permalink / raw)
  To: Al Viro
  Cc: David Drysdale, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:
> On Fri, Jan 09, 2015 at 10:41:44PM -0500, Rich Felker wrote:
> > > _After_ the traversal it's too late to do this sort of thing - after all,
> > > how do you tell if your current position had been set by the traversal of
> > > your symlink or that of any normal /proc/self/fd/<n>?
> > 
> > Thanks for clarifying how this all works in the kernel. It makes it
> > easier to understand what the costs (especially complexity costs) of
> > different implementation options might be for the kernel.
> > 
> > > And doing that _during_ the traversal would really suck - stray ls -lR /proc
> > > could race with that open() done by script interpreter.
> > 
> > IMO this one issue is easily solvable by limiting the special action
> > to calls by the owning pid.
> 
> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
> etc.) before bothering with open(2), you'll get screwed.

Yes, but I think that would be very bad interpreter design.
stat/getxattr/access/whatever followed by open is always a TOCTOU
race. The correct sequence of actions is always open followed by
fstat/fgetxattr/...

> Moreover, if it
> does so only in case when you have something specific in environment,
> you'll have the devil of the time trying to figure out how to reproduce
> such a bug report...

Yes, this is a more serious concern. For example, if a shell processes
$HISTFILE or something before opening the script. I'm starting to
prefer the idea of just refusing to honor the close-on-exec flag for
the fd passed to fexecve but preserving it, and letting the
interpreter close the file itself if it wants to. This could be done
with or without the new auxv entry stuff.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 22:13                     ` Eric W. Biederman
@ 2015-01-10  7:13                       ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  7:13 UTC (permalink / raw)
  To: Eric W. Biederman, Rich Felker
  Cc: mtk.manpages, Al Viro, David Drysdale, Andy Lutomirski,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux

On 01/09/2015 11:13 PM, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
>> On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:
> 
>> The "magic open-once magic symlink" approach is really the cleanest
>> solution I can find. In the case where the interpreter does not open
>> the script, nothing terribly bad happens; the magic symlink just
>> sticks around until _exit or exec. In the case where the interpreter
>> opens it more than once, you get a failure, but as far as I know
>> existing interpreters don't do this, and it's arguably bad design. In
>> any case it's a caught error.
> 
> And it doesn't work without introducing security vulnerabilities into
> the kernel, because it breaks close-on-exec semantics.
> 
> All you have to do is pick a file descriptor, good canidates are 0 and
> 255 and make it a convention that that file descriptor is used for
> fexecve.  At least when you want to support scripts.  Otherwise you can
> set close-on-exec.
> 
> That results in no accumulation of file descriptors  because everyone
> always uses the same file descriptor.
> 
> Regardless you don't have a patch and you aren't proposing code and the
> code isn't actually broken so please go away.

Eric,

This style of response isn't helpful. Suggesting that people must have
a patch in hand in order to have a conversation about kernel development
means a lot of clever people are going to be excluded from important
conversations. Those clever people are some user-space developers
who develop the software that the kernel interacts with--you know, the
user-space that is the kernel's raison-d'être.

Rich, as far as I've seen, is one of those clever people--he implemented
and maintains a (pretty much complete?) standard C library, so when he
comes to a conversation like this, I think it's best to start with
the assumption that he's thought long and hard about the problem, and 
seemingly hostile responses as you (and Al) make above don't do much 
to advance the conversation to a solution.

And there is a problem [*] and nothing I've seen so far in this
conversation seems to provide a solution within the current 
kernel implementation (but, maybe I am not clever enough to see it).

==

[*] A summary of the problem for bystanders:

[0.a] Some people want a solution to implementing fexecve() 
      (http://man7.org/linux/man-pages/man3/fexecve.3.html )
      in the absence of /proc (which is currently used for 
      the implementation). The new execveat() is a stepping
      stone to that solution.

[0.b] POSIX permits, but does not require, the FD_CLOEXEC
      (close-on-exec) file descriptor flag to be set on the
      file descriptor passed to fexecve().

[1]   The sequence:
          * Open a script file, to get a descriptor, 'fd'
          * Set the close-on-exec flag on 'fd'
	  * execveat(fd, NULL, argv, envp, AT_EMPTY_PATH)

      fails in the execveat() because by the time the script 
      interpreter has been loaded, 'fd' has been closed because
      of the close-on-exec flag.

[2]   Omitting the use of close-on-exec on the FD given to
      fexecve()/execveat() means that the execed script
      receives a superfluous file descriptor that refers to the
      script file. The script cannot determine that there is such 
      an FD or which FD it is without some some messy special-case
      hacking to inspect its environment (and that hacking must be
      based on /proc, AFAICT!)

[3]   Scripts won't do the check in [2], with the result that
      that there'll be descriptor leaks in some cases where
      fexecve()/execveat() is used repeatedly.

[4]   (As Rich points out in a reply to the parent message, the
      solution suggested above of using a fixed file descriptor 
      for fexecve() does not solve the problem either.)

For an example of the leak, consider the following simple program 
and script. The program is just a simple command-line interface to 
exercise execveat():

=====
/* t_execveat.c
*/
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>

#define __NR_execveat 322 /* x86-64 */

static int execveat(int dirfd, const char *pathname, char *const argv[],
                    char *const envp[], int flags)
{
            return syscall(__NR_execveat, dirfd, pathname, argv, envp, flags);
}

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

extern char **environ;

int
main(int argc, char *argv[])
{
    int flags, dirfd;
    char *path;

    flags = 0;

    if (argc < 4) {
        fprintf(stderr, "%s dirfd-path path argv0 [argvN...]\n", argv[0]);
        fprintf(stderr, "\tSpecify 'dirfd' as '-' to get AT_FDCWD\n");
        fprintf(stderr, "\tSpecify 'path' as an empty string to get "
                "AT_EMPTY_PATH\n");
        exit(EXIT_FAILURE);
    }

    if (argv[1][0] == '-')
        dirfd = AT_FDCWD;
    else {
        dirfd = open(argv[1], O_RDONLY);
        if (dirfd == -1)
            errExit("open");
    }

    path = argv[2];
    if (strlen(path) == 0)
        flags = AT_EMPTY_PATH;

    execveat(dirfd, path, &argv[3], environ, flags);
    errExit("execveat");

    exit(EXIT_SUCCESS);
}
=====

And then a simple script (necho.sh) that recursively invokes itself using
the above program demonstrates the problem.

=====
#!/bin/sh
echo 
echo '$0 =' $0
ls -l /proc/$$/fd
./t_execveat ./necho.sh "" arg1 # $arg
=====

When we run this script, we see:

=====

# chmod +x necho.sh
# ./t_execveat ./necho.sh "" arg1

$0 = /dev/fd/3
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh

$0 = /dev/fd/4
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh

$0 = /dev/fd/5
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 5 -> /home/mtk/necho.sh

$0 = /dev/fd/6
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 5 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 6 -> /home/mtk/necho.sh

$0 = /dev/fd/7
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 5 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 6 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 7 -> /home/mtk/necho.sh


[and so on until we run out of file descriptors]
=====

(I think the FD 199 in the above output is some bash(1) artifact, unrelated 
to the  conversation at hand.)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  7:13                       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  7:13 UTC (permalink / raw)
  To: Eric W. Biederman, Rich Felker
  Cc: mtk.manpages, Al Viro, David Drysdale, Andy Lutomirski,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux

On 01/09/2015 11:13 PM, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
>> On Fri, Jan 09, 2015 at 09:09:41PM +0000, Al Viro wrote:
> 
>> The "magic open-once magic symlink" approach is really the cleanest
>> solution I can find. In the case where the interpreter does not open
>> the script, nothing terribly bad happens; the magic symlink just
>> sticks around until _exit or exec. In the case where the interpreter
>> opens it more than once, you get a failure, but as far as I know
>> existing interpreters don't do this, and it's arguably bad design. In
>> any case it's a caught error.
> 
> And it doesn't work without introducing security vulnerabilities into
> the kernel, because it breaks close-on-exec semantics.
> 
> All you have to do is pick a file descriptor, good canidates are 0 and
> 255 and make it a convention that that file descriptor is used for
> fexecve.  At least when you want to support scripts.  Otherwise you can
> set close-on-exec.
> 
> That results in no accumulation of file descriptors  because everyone
> always uses the same file descriptor.
> 
> Regardless you don't have a patch and you aren't proposing code and the
> code isn't actually broken so please go away.

Eric,

This style of response isn't helpful. Suggesting that people must have
a patch in hand in order to have a conversation about kernel development
means a lot of clever people are going to be excluded from important
conversations. Those clever people are some user-space developers
who develop the software that the kernel interacts with--you know, the
user-space that is the kernel's raison-d'être.

Rich, as far as I've seen, is one of those clever people--he implemented
and maintains a (pretty much complete?) standard C library, so when he
comes to a conversation like this, I think it's best to start with
the assumption that he's thought long and hard about the problem, and 
seemingly hostile responses as you (and Al) make above don't do much 
to advance the conversation to a solution.

And there is a problem [*] and nothing I've seen so far in this
conversation seems to provide a solution within the current 
kernel implementation (but, maybe I am not clever enough to see it).

=

[*] A summary of the problem for bystanders:

[0.a] Some people want a solution to implementing fexecve() 
      (http://man7.org/linux/man-pages/man3/fexecve.3.html )
      in the absence of /proc (which is currently used for 
      the implementation). The new execveat() is a stepping
      stone to that solution.

[0.b] POSIX permits, but does not require, the FD_CLOEXEC
      (close-on-exec) file descriptor flag to be set on the
      file descriptor passed to fexecve().

[1]   The sequence:
          * Open a script file, to get a descriptor, 'fd'
          * Set the close-on-exec flag on 'fd'
	  * execveat(fd, NULL, argv, envp, AT_EMPTY_PATH)

      fails in the execveat() because by the time the script 
      interpreter has been loaded, 'fd' has been closed because
      of the close-on-exec flag.

[2]   Omitting the use of close-on-exec on the FD given to
      fexecve()/execveat() means that the execed script
      receives a superfluous file descriptor that refers to the
      script file. The script cannot determine that there is such 
      an FD or which FD it is without some some messy special-case
      hacking to inspect its environment (and that hacking must be
      based on /proc, AFAICT!)

[3]   Scripts won't do the check in [2], with the result that
      that there'll be descriptor leaks in some cases where
      fexecve()/execveat() is used repeatedly.

[4]   (As Rich points out in a reply to the parent message, the
      solution suggested above of using a fixed file descriptor 
      for fexecve() does not solve the problem either.)

For an example of the leak, consider the following simple program 
and script. The program is just a simple command-line interface to 
exercise execveat():

==/* t_execveat.c
*/
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>

#define __NR_execveat 322 /* x86-64 */

static int execveat(int dirfd, const char *pathname, char *const argv[],
                    char *const envp[], int flags)
{
            return syscall(__NR_execveat, dirfd, pathname, argv, envp, flags);
}

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

extern char **environ;

int
main(int argc, char *argv[])
{
    int flags, dirfd;
    char *path;

    flags = 0;

    if (argc < 4) {
        fprintf(stderr, "%s dirfd-path path argv0 [argvN...]\n", argv[0]);
        fprintf(stderr, "\tSpecify 'dirfd' as '-' to get AT_FDCWD\n");
        fprintf(stderr, "\tSpecify 'path' as an empty string to get "
                "AT_EMPTY_PATH\n");
        exit(EXIT_FAILURE);
    }

    if (argv[1][0] = '-')
        dirfd = AT_FDCWD;
    else {
        dirfd = open(argv[1], O_RDONLY);
        if (dirfd = -1)
            errExit("open");
    }

    path = argv[2];
    if (strlen(path) = 0)
        flags = AT_EMPTY_PATH;

    execveat(dirfd, path, &argv[3], environ, flags);
    errExit("execveat");

    exit(EXIT_SUCCESS);
}
==
And then a simple script (necho.sh) that recursively invokes itself using
the above program demonstrates the problem.

==#!/bin/sh
echo 
echo '$0 =' $0
ls -l /proc/$$/fd
./t_execveat ./necho.sh "" arg1 # $arg
==
When we run this script, we see:

==
# chmod +x necho.sh
# ./t_execveat ./necho.sh "" arg1

$0 = /dev/fd/3
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh

$0 = /dev/fd/4
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh

$0 = /dev/fd/5
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 5 -> /home/mtk/necho.sh

$0 = /dev/fd/6
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 5 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 6 -> /home/mtk/necho.sh

$0 = /dev/fd/7
total 0
lrwx------. 1 root root 64 Jan 10 07:59 0 -> /dev/pts/0
lrwx------. 1 root root 64 Jan 10 07:59 1 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 199 -> /home/mtk/necho.sh
lrwx------. 1 root root 64 Jan 10 07:59 2 -> /dev/pts/0
lr-x------. 1 root root 64 Jan 10 07:59 3 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 4 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 5 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 6 -> /home/mtk/necho.sh
lr-x------. 1 root root 64 Jan 10 07:59 7 -> /home/mtk/necho.sh


[and so on until we run out of file descriptors]
==
(I think the FD 199 in the above output is some bash(1) artifact, unrelated 
to the  conversation at hand.)

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 16:13       ` Rich Felker
  (?)
@ 2015-01-10  7:38         ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  7:38 UTC (permalink / raw)
  To: Rich Felker
  Cc: mtk.manpages, David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux

On 01/09/2015 05:13 PM, Rich Felker wrote:
> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>> ---
>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 153 insertions(+)
>>>  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done 
>> a few very light edits, and will release the version below 
>> with the next man-pages release.
>>
>> I have one question. In the message accompanying
>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>
>>   The filename fed to the executed program as argv[0] (or the name of the
>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>   reflecting how the executable was found.  This does however mean that
>>   execution of a script in a /proc-less environment won't work; also, script
>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>   accessible after exec).
>>
>> How does one produce this situation where the execed program sees 
>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>> call look like?) I tried to produce this scenario, but could not.
> 
> I think this is wrong. argv[0] is an arbitrary string provided by the
> caller and would never be derived from the fd passed. It's AT_EXECFN,
> /proc/self/exe, and filenames shown elsewhere in /proc that may be
> derived in odd ways.
> 
> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> rather than the main description. The long-term intent should be that
> script execution this way should work. IIRC this was discussed earlier
> in the thread.

I agree, that something needs to be said. What I instead did was 
added "See BUGS" to the ENOEXEC error, and then this text:

   BUGS
       The  ENOENT  error  described above means that it is not possible
       possible to set the close-on-exec flag  on  the  file  descriptor
       given to a call of the form:

           execveat(fd, "", argv, envp, AT_EMPTY_PATH);

       However, the inability to set the close-on-exec flag means that a
       file descriptor referring to the  script  leaks  through  to  the
       script  itself.  As well as wasting a file descriptor, this leak‐
       age can lead to file-descriptor  exhaustion  in  scenarios  where
       scripts  recursively  employ  exceveat()  (or a future fexecve(3)
       implementation that might be based on execveat()).

Okay?

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  7:38         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  7:38 UTC (permalink / raw)
  To: Rich Felker
  Cc: mtk.manpages, David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux

On 01/09/2015 05:13 PM, Rich Felker wrote:
> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>> ---
>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 153 insertions(+)
>>>  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done 
>> a few very light edits, and will release the version below 
>> with the next man-pages release.
>>
>> I have one question. In the message accompanying
>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>
>>   The filename fed to the executed program as argv[0] (or the name of the
>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>   reflecting how the executable was found.  This does however mean that
>>   execution of a script in a /proc-less environment won't work; also, script
>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>   accessible after exec).
>>
>> How does one produce this situation where the execed program sees 
>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>> call look like?) I tried to produce this scenario, but could not.
> 
> I think this is wrong. argv[0] is an arbitrary string provided by the
> caller and would never be derived from the fd passed. It's AT_EXECFN,
> /proc/self/exe, and filenames shown elsewhere in /proc that may be
> derived in odd ways.
> 
> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> rather than the main description. The long-term intent should be that
> script execution this way should work. IIRC this was discussed earlier
> in the thread.

I agree, that something needs to be said. What I instead did was 
added "See BUGS" to the ENOEXEC error, and then this text:

   BUGS
       The  ENOENT  error  described above means that it is not possible
       possible to set the close-on-exec flag  on  the  file  descriptor
       given to a call of the form:

           execveat(fd, "", argv, envp, AT_EMPTY_PATH);

       However, the inability to set the close-on-exec flag means that a
       file descriptor referring to the  script  leaks  through  to  the
       script  itself.  As well as wasting a file descriptor, this leak‐
       age can lead to file-descriptor  exhaustion  in  scenarios  where
       scripts  recursively  employ  exceveat()  (or a future fexecve(3)
       implementation that might be based on execveat()).

Okay?

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  7:38         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  7:38 UTC (permalink / raw)
  To: Rich Felker
  Cc: mtk.manpages, David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, x86, linux-arch, linux-api, sparclinux

On 01/09/2015 05:13 PM, Rich Felker wrote:
> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>> ---
>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 153 insertions(+)
>>>  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done 
>> a few very light edits, and will release the version below 
>> with the next man-pages release.
>>
>> I have one question. In the message accompanying
>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>
>>   The filename fed to the executed program as argv[0] (or the name of the
>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>   reflecting how the executable was found.  This does however mean that
>>   execution of a script in a /proc-less environment won't work; also, script
>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>   accessible after exec).
>>
>> How does one produce this situation where the execed program sees 
>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>> call look like?) I tried to produce this scenario, but could not.
> 
> I think this is wrong. argv[0] is an arbitrary string provided by the
> caller and would never be derived from the fd passed. It's AT_EXECFN,
> /proc/self/exe, and filenames shown elsewhere in /proc that may be
> derived in odd ways.
> 
> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
> rather than the main description. The long-term intent should be that
> script execution this way should work. IIRC this was discussed earlier
> in the thread.

I agree, that something needs to be said. What I instead did was 
added "See BUGS" to the ENOEXEC error, and then this text:

   BUGS
       The  ENOENT  error  described above means that it is not possible
       possible to set the close-on-exec flag  on  the  file  descriptor
       given to a call of the form:

           execveat(fd, "", argv, envp, AT_EMPTY_PATH);

       However, the inability to set the close-on-exec flag means that a
       file descriptor referring to the  script  leaks  through  to  the
       script  itself.  As well as wasting a file descriptor, this leak‐
       age can lead to file-descriptor  exhaustion  in  scenarios  where
       scripts  recursively  employ  exceveat()  (or a future fexecve(3)
       implementation that might be based on execveat()).

Okay?

Thanks,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  7:43           ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  7:43 UTC (permalink / raw)
  To: David Drysdale, Rich Felker
  Cc: mtk.manpages, Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux

On 01/09/2015 06:46 PM, David Drysdale wrote:
> On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
>> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>>> ---
>>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 153 insertions(+)
>>>>  create mode 100644 man2/execveat.2
>>>
>>> David,
>>>
>>> Thanks for the very nicely prepared man page. I've done
>>> a few very light edits, and will release the version below
>>> with the next man-pages release.
>>>
>>> I have one question. In the message accompanying
>>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>>
>>>   The filename fed to the executed program as argv[0] (or the name of the
>>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>>   reflecting how the executable was found.  This does however mean that
>>>   execution of a script in a /proc-less environment won't work; also, script
>>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>>   accessible after exec).
>>>
>>> How does one produce this situation where the execed program sees
>>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>>> call look like?) I tried to produce this scenario, but could not.
>>
>> I think this is wrong. argv[0] is an arbitrary string provided by the
>> caller and would never be derived from the fd passed.
> 
> Yeah, I think I just wrote that wrong, it's only relevant for scripts.
> As Rich says, for normal binaries argv[0] is just the argv[0] that
> was passed into the execve[at] call.  For a script, the code in
> fs/binfmt_script.c will remove the original argv[0] and put the
> interpreter name and the script filename (e.g. "/bin/sh",
> "/dev/fd/6/script") in as 2 arguments in its place.

Yep, got it now.

> [As an aside, IIRC the filename does get put into the new
> process's memory, up above the environment strings -- but
> that copy isn't visible via argv nor envp.]
> 
>> It's AT_EXECFN,
>> /proc/self/exe, and filenames shown elsewhere in /proc that may be
>> derived in odd ways.
>>
>> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
>> rather than the main description. The long-term intent should be that
>> script execution this way should work. IIRC this was discussed earlier
>> in the thread.
> 
> I may be misremembering, but I thought we hoped to be able to fix
> execveat of a script without /proc in future, but didn't expect to fix
> execveat of a script via an O_CLOEXEC fd (because in the latter
> case the fd gets closed before the script interpreter runs, so even
> if the interpreter (or a special filesystem) does clever things for names
> starting with "/dev/fd/..." the file descriptor is already gone).

See my other replies (and of course, Rich's). It does seem there is 
a real problem to be solved here.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  7:43           ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  7:43 UTC (permalink / raw)
  To: David Drysdale, Rich Felker
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman,
	Andy Lutomirski, Alexander Viro, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On 01/09/2015 06:46 PM, David Drysdale wrote:
> On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias-/miJ2pyFWUyWIDz0JBNUog@public.gmane.org> wrote:
>> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>>> Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 153 insertions(+)
>>>>  create mode 100644 man2/execveat.2
>>>
>>> David,
>>>
>>> Thanks for the very nicely prepared man page. I've done
>>> a few very light edits, and will release the version below
>>> with the next man-pages release.
>>>
>>> I have one question. In the message accompanying
>>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>>
>>>   The filename fed to the executed program as argv[0] (or the name of the
>>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>>   reflecting how the executable was found.  This does however mean that
>>>   execution of a script in a /proc-less environment won't work; also, script
>>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>>   accessible after exec).
>>>
>>> How does one produce this situation where the execed program sees
>>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>>> call look like?) I tried to produce this scenario, but could not.
>>
>> I think this is wrong. argv[0] is an arbitrary string provided by the
>> caller and would never be derived from the fd passed.
> 
> Yeah, I think I just wrote that wrong, it's only relevant for scripts.
> As Rich says, for normal binaries argv[0] is just the argv[0] that
> was passed into the execve[at] call.  For a script, the code in
> fs/binfmt_script.c will remove the original argv[0] and put the
> interpreter name and the script filename (e.g. "/bin/sh",
> "/dev/fd/6/script") in as 2 arguments in its place.

Yep, got it now.

> [As an aside, IIRC the filename does get put into the new
> process's memory, up above the environment strings -- but
> that copy isn't visible via argv nor envp.]
> 
>> It's AT_EXECFN,
>> /proc/self/exe, and filenames shown elsewhere in /proc that may be
>> derived in odd ways.
>>
>> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
>> rather than the main description. The long-term intent should be that
>> script execution this way should work. IIRC this was discussed earlier
>> in the thread.
> 
> I may be misremembering, but I thought we hoped to be able to fix
> execveat of a script without /proc in future, but didn't expect to fix
> execveat of a script via an O_CLOEXEC fd (because in the latter
> case the fd gets closed before the script interpreter runs, so even
> if the interpreter (or a special filesystem) does clever things for names
> starting with "/dev/fd/..." the file descriptor is already gone).

See my other replies (and of course, Rich's). It does seem there is 
a real problem to be solved here.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  7:43           ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  7:43 UTC (permalink / raw)
  To: David Drysdale, Rich Felker
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman,
	Andy Lutomirski, Alexander Viro, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On 01/09/2015 06:46 PM, David Drysdale wrote:
> On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
>> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>>> ---
>>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 153 insertions(+)
>>>>  create mode 100644 man2/execveat.2
>>>
>>> David,
>>>
>>> Thanks for the very nicely prepared man page. I've done
>>> a few very light edits, and will release the version below
>>> with the next man-pages release.
>>>
>>> I have one question. In the message accompanying
>>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>>
>>>   The filename fed to the executed program as argv[0] (or the name of the
>>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>>   reflecting how the executable was found.  This does however mean that
>>>   execution of a script in a /proc-less environment won't work; also, script
>>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>>   accessible after exec).
>>>
>>> How does one produce this situation where the execed program sees
>>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>>> call look like?) I tried to produce this scenario, but could not.
>>
>> I think this is wrong. argv[0] is an arbitrary string provided by the
>> caller and would never be derived from the fd passed.
> 
> Yeah, I think I just wrote that wrong, it's only relevant for scripts.
> As Rich says, for normal binaries argv[0] is just the argv[0] that
> was passed into the execve[at] call.  For a script, the code in
> fs/binfmt_script.c will remove the original argv[0] and put the
> interpreter name and the script filename (e.g. "/bin/sh",
> "/dev/fd/6/script") in as 2 arguments in its place.

Yep, got it now.

> [As an aside, IIRC the filename does get put into the new
> process's memory, up above the environment strings -- but
> that copy isn't visible via argv nor envp.]
> 
>> It's AT_EXECFN,
>> /proc/self/exe, and filenames shown elsewhere in /proc that may be
>> derived in odd ways.
>>
>> I would also move the text about O_CLOEXEC to a BUGS or NOTES section
>> rather than the main description. The long-term intent should be that
>> script execution this way should work. IIRC this was discussed earlier
>> in the thread.
> 
> I may be misremembering, but I thought we hoped to be able to fix
> execveat of a script without /proc in future, but didn't expect to fix
> execveat of a script via an O_CLOEXEC fd (because in the latter
> case the fd gets closed before the script interpreter runs, so even
> if the interpreter (or a special filesystem) does clever things for names
> starting with "/dev/fd/..." the file descriptor is already gone).

See my other replies (and of course, Rich's). It does seem there is 
a real problem to be solved here.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  7:56         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  7:56 UTC (permalink / raw)
  To: David Drysdale
  Cc: mtk.manpages, Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On 01/09/2015 07:02 PM, David Drysdale wrote:
> On Fri, Jan 9, 2015 at 3:47 PM, Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>> ---
>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 153 insertions(+)
>>>  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done
>> a few very light edits, and will release the version below
>> with the next man-pages release.
> 
> Many thanks, one error (of mine) in 2 places pointed out below.

Well, the first error was yours. The second error was mine,
when I replicated your info about AT_SYMLINK_NOFOLLOW
into the ERRORS without verifying it. (Sorry about that!)

Both cases fixed now.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  7:56         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  7:56 UTC (permalink / raw)
  To: David Drysdale
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman,
	Andy Lutomirski, Alexander Viro, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

On 01/09/2015 07:02 PM, David Drysdale wrote:
> On Fri, Jan 9, 2015 at 3:47 PM, Michael Kerrisk (man-pages)
> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>> Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 153 insertions(+)
>>>  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done
>> a few very light edits, and will release the version below
>> with the next man-pages release.
> 
> Many thanks, one error (of mine) in 2 places pointed out below.

Well, the first error was yours. The second error was mine,
when I replicated your info about AT_SYMLINK_NOFOLLOW
into the ERRORS without verifying it. (Sorry about that!)

Both cases fixed now.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  7:56         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  7:56 UTC (permalink / raw)
  To: David Drysdale
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Eric W. Biederman,
	Andy Lutomirski, Alexander Viro, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Rich Felker,
	Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux-u79uwXL29TY76Z2rM5mHXA

On 01/09/2015 07:02 PM, David Drysdale wrote:
> On Fri, Jan 9, 2015 at 3:47 PM, Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>> ---
>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 153 insertions(+)
>>>  create mode 100644 man2/execveat.2
>>
>> David,
>>
>> Thanks for the very nicely prepared man page. I've done
>> a few very light edits, and will release the version below
>> with the next man-pages release.
> 
> Many thanks, one error (of mine) in 2 places pointed out below.

Well, the first error was yours. The second error was mine,
when I replicated your info about AT_SYMLINK_NOFOLLOW
into the ERRORS without verifying it. (Sorry about that!)

Both cases fixed now.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 17:46         ` David Drysdale
@ 2015-01-10  8:27           ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  8:27 UTC (permalink / raw)
  To: David Drysdale, Rich Felker
  Cc: mtk.manpages, Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux

On 01/09/2015 06:46 PM, David Drysdale wrote:
> On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
>> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>>> ---
>>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 153 insertions(+)
>>>>  create mode 100644 man2/execveat.2
>>>
>>> David,
>>>
>>> Thanks for the very nicely prepared man page. I've done
>>> a few very light edits, and will release the version below
>>> with the next man-pages release.
>>>
>>> I have one question. In the message accompanying
>>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>>
>>>   The filename fed to the executed program as argv[0] (or the name of the
>>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>>   reflecting how the executable was found.  This does however mean that
>>>   execution of a script in a /proc-less environment won't work; also, script
>>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>>   accessible after exec).
>>>
>>> How does one produce this situation where the execed program sees
>>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>>> call look like?) I tried to produce this scenario, but could not.
>>
>> I think this is wrong. argv[0] is an arbitrary string provided by the
>> caller and would never be derived from the fd passed.
> 
> Yeah, I think I just wrote that wrong, it's only relevant for scripts.
> As Rich says, for normal binaries argv[0] is just the argv[0] that
> was passed into the execve[at] call.  For a script, the code in
> fs/binfmt_script.c will remove the original argv[0] and put the
> interpreter name and the script filename (e.g. "/bin/sh",
> "/dev/fd/6/script") in as 2 arguments in its place.

So, on reflection, I think it's worth saying something about this, and 
I added the following text to the man page:

   NOTES
       When asked to execute a script file, the argv[0] that  is  passed
       to  the  script  interpreter is a string of the form /dev/fd/N or
       /dev/fd/N/P, where N is the number of the file descriptor  passed
       via  the  dirfd argument.  A string of the first form occurs when
       AT_EMPTY_PATH is employed.  A string of the  second  form  occurs
       when the script is specified via both dirfd and pathname; in this
       case, P is the value given in pathname.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10  8:27           ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 123+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-01-10  8:27 UTC (permalink / raw)
  To: David Drysdale, Rich Felker
  Cc: mtk.manpages, Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux

On 01/09/2015 06:46 PM, David Drysdale wrote:
> On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
>> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
>>> On 11/24/2014 12:53 PM, David Drysdale wrote:
>>>> Signed-off-by: David Drysdale <drysdale@google.com>
>>>> ---
>>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 153 insertions(+)
>>>>  create mode 100644 man2/execveat.2
>>>
>>> David,
>>>
>>> Thanks for the very nicely prepared man page. I've done
>>> a few very light edits, and will release the version below
>>> with the next man-pages release.
>>>
>>> I have one question. In the message accompanying
>>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
>>>
>>>   The filename fed to the executed program as argv[0] (or the name of the
>>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>>>   reflecting how the executable was found.  This does however mean that
>>>   execution of a script in a /proc-less environment won't work; also, script
>>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
>>>   accessible after exec).
>>>
>>> How does one produce this situation where the execed program sees
>>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
>>> call look like?) I tried to produce this scenario, but could not.
>>
>> I think this is wrong. argv[0] is an arbitrary string provided by the
>> caller and would never be derived from the fd passed.
> 
> Yeah, I think I just wrote that wrong, it's only relevant for scripts.
> As Rich says, for normal binaries argv[0] is just the argv[0] that
> was passed into the execve[at] call.  For a script, the code in
> fs/binfmt_script.c will remove the original argv[0] and put the
> interpreter name and the script filename (e.g. "/bin/sh",
> "/dev/fd/6/script") in as 2 arguments in its place.

So, on reflection, I think it's worth saying something about this, and 
I added the following text to the man page:

   NOTES
       When asked to execute a script file, the argv[0] that  is  passed
       to  the  script  interpreter is a string of the form /dev/fd/N or
       /dev/fd/N/P, where N is the number of the file descriptor  passed
       via  the  dirfd argument.  A string of the first form occurs when
       AT_EMPTY_PATH is employed.  A string of the  second  form  occurs
       when the script is specified via both dirfd and pathname; in this
       case, P is the value given in pathname.

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-10  8:27           ` Michael Kerrisk (man-pages)
@ 2015-01-10 13:31             ` Rich Felker
  -1 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-10 13:31 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Sat, Jan 10, 2015 at 09:27:46AM +0100, Michael Kerrisk (man-pages) wrote:
> On 01/09/2015 06:46 PM, David Drysdale wrote:
> > On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
> >> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
> >>> On 11/24/2014 12:53 PM, David Drysdale wrote:
> >>>> Signed-off-by: David Drysdale <drysdale@google.com>
> >>>> ---
> >>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 153 insertions(+)
> >>>>  create mode 100644 man2/execveat.2
> >>>
> >>> David,
> >>>
> >>> Thanks for the very nicely prepared man page. I've done
> >>> a few very light edits, and will release the version below
> >>> with the next man-pages release.
> >>>
> >>> I have one question. In the message accompanying
> >>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
> >>>
> >>>   The filename fed to the executed program as argv[0] (or the name of the
> >>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
> >>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
> >>>   reflecting how the executable was found.  This does however mean that
> >>>   execution of a script in a /proc-less environment won't work; also, script
> >>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
> >>>   accessible after exec).
> >>>
> >>> How does one produce this situation where the execed program sees
> >>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
> >>> call look like?) I tried to produce this scenario, but could not.
> >>
> >> I think this is wrong. argv[0] is an arbitrary string provided by the
> >> caller and would never be derived from the fd passed.
> > 
> > Yeah, I think I just wrote that wrong, it's only relevant for scripts.
> > As Rich says, for normal binaries argv[0] is just the argv[0] that
> > was passed into the execve[at] call.  For a script, the code in
> > fs/binfmt_script.c will remove the original argv[0] and put the
> > interpreter name and the script filename (e.g. "/bin/sh",
> > "/dev/fd/6/script") in as 2 arguments in its place.
> 
> So, on reflection, I think it's worth saying something about this, and 
> I added the following text to the man page:
> 
>    NOTES
>        When asked to execute a script file, the argv[0] that  is  passed
>        to  the  script  interpreter is a string of the form /dev/fd/N or
>        /dev/fd/N/P, where N is the number of the file descriptor  passed
>        via  the  dirfd argument.  A string of the first form occurs when
>        AT_EMPTY_PATH is employed.  A string of the  second  form  occurs
>        when the script is specified via both dirfd and pathname; in this
>        case, P is the value given in pathname.

While I'm aware that you're simply documenting, it seems unnecessary
to me (and unnecessarily complicating of the cloexec issue) to have
the /dev/fd/N/P form. This could always be resolved by the kernel to a
single temp fd for the new process to use, and in fact it's probably
preferable to always get a "temp fd" in case the fd passed to fexecve
is NOT a throwaway one (e.g. if the original fd was stdin or
something); the program being executed should not have to use ugly and
error-prone heuristics to decide if it should close the exec fd.

On the other hand, this resolution could be done by userspace (open
with O_PATH|O_CLOEXEC prior to making the fexecveat syscall, and
always passing AT_EMPTY_PATH to the kernel) if desirable, so maybe it
doesn't make sense to have the kernel do it. In this sense the whole
"at" part of fexecveat becomes vestigial, though.

Any thoughts?

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10 13:31             ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-10 13:31 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: David Drysdale, Eric W. Biederman, Andy Lutomirski,
	Alexander Viro, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Sat, Jan 10, 2015 at 09:27:46AM +0100, Michael Kerrisk (man-pages) wrote:
> On 01/09/2015 06:46 PM, David Drysdale wrote:
> > On Fri, Jan 9, 2015 at 4:13 PM, Rich Felker <dalias@aerifal.cx> wrote:
> >> On Fri, Jan 09, 2015 at 04:47:31PM +0100, Michael Kerrisk (man-pages) wrote:
> >>> On 11/24/2014 12:53 PM, David Drysdale wrote:
> >>>> Signed-off-by: David Drysdale <drysdale@google.com>
> >>>> ---
> >>>>  man2/execveat.2 | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 153 insertions(+)
> >>>>  create mode 100644 man2/execveat.2
> >>>
> >>> David,
> >>>
> >>> Thanks for the very nicely prepared man page. I've done
> >>> a few very light edits, and will release the version below
> >>> with the next man-pages release.
> >>>
> >>> I have one question. In the message accompanying
> >>> commit 51f39a1f0cea1cacf8c787f652f26dfee9611874 you wrote:
> >>>
> >>>   The filename fed to the executed program as argv[0] (or the name of the
> >>>   script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
> >>>   (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
> >>>   reflecting how the executable was found.  This does however mean that
> >>>   execution of a script in a /proc-less environment won't work; also, script
> >>>   execution via an O_CLOEXEC file descriptor fails (as the file will not be
> >>>   accessible after exec).
> >>>
> >>> How does one produce this situation where the execed program sees
> >>> argv[0] as a /dev/fd path? (i.e., what would the execveat()
> >>> call look like?) I tried to produce this scenario, but could not.
> >>
> >> I think this is wrong. argv[0] is an arbitrary string provided by the
> >> caller and would never be derived from the fd passed.
> > 
> > Yeah, I think I just wrote that wrong, it's only relevant for scripts.
> > As Rich says, for normal binaries argv[0] is just the argv[0] that
> > was passed into the execve[at] call.  For a script, the code in
> > fs/binfmt_script.c will remove the original argv[0] and put the
> > interpreter name and the script filename (e.g. "/bin/sh",
> > "/dev/fd/6/script") in as 2 arguments in its place.
> 
> So, on reflection, I think it's worth saying something about this, and 
> I added the following text to the man page:
> 
>    NOTES
>        When asked to execute a script file, the argv[0] that  is  passed
>        to  the  script  interpreter is a string of the form /dev/fd/N or
>        /dev/fd/N/P, where N is the number of the file descriptor  passed
>        via  the  dirfd argument.  A string of the first form occurs when
>        AT_EMPTY_PATH is employed.  A string of the  second  form  occurs
>        when the script is specified via both dirfd and pathname; in this
>        case, P is the value given in pathname.

While I'm aware that you're simply documenting, it seems unnecessary
to me (and unnecessarily complicating of the cloexec issue) to have
the /dev/fd/N/P form. This could always be resolved by the kernel to a
single temp fd for the new process to use, and in fact it's probably
preferable to always get a "temp fd" in case the fd passed to fexecve
is NOT a throwaway one (e.g. if the original fd was stdin or
something); the program being executed should not have to use ugly and
error-prone heuristics to decide if it should close the exec fd.

On the other hand, this resolution could be done by userspace (open
with O_PATH|O_CLOEXEC prior to making the fexecveat syscall, and
always passing AT_EMPTY_PATH to the kernel) if desirable, so maybe it
doesn't make sense to have the kernel do it. In this sense the whole
"at" part of fexecveat becomes vestigial, though.

Any thoughts?

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-10  5:57                                         ` Rich Felker
  (?)
@ 2015-01-10 22:27                                           ` Eric W. Biederman
  -1 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-10 22:27 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:

>> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
>> etc.) before bothering with open(2), you'll get screwed.
>
> Yes, but I think that would be very bad interpreter design.
> stat/getxattr/access/whatever followed by open is always a TOCTOU
> race. The correct sequence of actions is always open followed by
> fstat/fgetxattr/...

Sigh.  I think everyone who has looked at this has been blind.

If userspace is reasonable all we have to do is fix /proc/self/exe
for shell scripts to point at the actual script,
and then pass /proc/self/exe on the shell scripts command line.

At a practical level we have to worry about backwards compability and
chroot jails.  But the existence of a clean implementation with
/proc/self/exe serves a proof of concept that it would not be too
difficult.  When someone cares enough to implement it.

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10 22:27                                           ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-10 22:27 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:

>> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
>> etc.) before bothering with open(2), you'll get screwed.
>
> Yes, but I think that would be very bad interpreter design.
> stat/getxattr/access/whatever followed by open is always a TOCTOU
> race. The correct sequence of actions is always open followed by
> fstat/fgetxattr/...

Sigh.  I think everyone who has looked at this has been blind.

If userspace is reasonable all we have to do is fix /proc/self/exe
for shell scripts to point at the actual script,
and then pass /proc/self/exe on the shell scripts command line.

At a practical level we have to worry about backwards compability and
chroot jails.  But the existence of a clean implementation with
/proc/self/exe serves a proof of concept that it would not be too
difficult.  When someone cares enough to implement it.

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-10 22:27                                           ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-10 22:27 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:

>> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
>> etc.) before bothering with open(2), you'll get screwed.
>
> Yes, but I think that would be very bad interpreter design.
> stat/getxattr/access/whatever followed by open is always a TOCTOU
> race. The correct sequence of actions is always open followed by
> fstat/fgetxattr/...

Sigh.  I think everyone who has looked at this has been blind.

If userspace is reasonable all we have to do is fix /proc/self/exe
for shell scripts to point at the actual script,
and then pass /proc/self/exe on the shell scripts command line.

At a practical level we have to worry about backwards compability and
chroot jails.  But the existence of a clean implementation with
/proc/self/exe serves a proof of concept that it would not be too
difficult.  When someone cares enough to implement it.

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-10 22:27                                           ` Eric W. Biederman
@ 2015-01-11  1:15                                             ` Rich Felker
  -1 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-11  1:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Sat, Jan 10, 2015 at 04:27:23PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:
> 
> >> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
> >> etc.) before bothering with open(2), you'll get screwed.
> >
> > Yes, but I think that would be very bad interpreter design.
> > stat/getxattr/access/whatever followed by open is always a TOCTOU
> > race. The correct sequence of actions is always open followed by
> > fstat/fgetxattr/...
> 
> Sigh.  I think everyone who has looked at this has been blind.
> 
> If userspace is reasonable all we have to do is fix /proc/self/exe
> for shell scripts to point at the actual script,
> and then pass /proc/self/exe on the shell scripts command line.
> 
> At a practical level we have to worry about backwards compability and
> chroot jails.  But the existence of a clean implementation with
> /proc/self/exe serves a proof of concept that it would not be too
> difficult.  When someone cares enough to implement it.

Is /proc/self/exe a "magic symlink" that's bound to the inode, or just
a regular symlink? In the latter case it defeats the whole purpose of
using O_EXEC fds and fexecve rather than pathnames.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-11  1:15                                             ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-11  1:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Sat, Jan 10, 2015 at 04:27:23PM -0600, Eric W. Biederman wrote:
> Rich Felker <dalias@aerifal.cx> writes:
> 
> > On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:
> 
> >> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
> >> etc.) before bothering with open(2), you'll get screwed.
> >
> > Yes, but I think that would be very bad interpreter design.
> > stat/getxattr/access/whatever followed by open is always a TOCTOU
> > race. The correct sequence of actions is always open followed by
> > fstat/fgetxattr/...
> 
> Sigh.  I think everyone who has looked at this has been blind.
> 
> If userspace is reasonable all we have to do is fix /proc/self/exe
> for shell scripts to point at the actual script,
> and then pass /proc/self/exe on the shell scripts command line.
> 
> At a practical level we have to worry about backwards compability and
> chroot jails.  But the existence of a clean implementation with
> /proc/self/exe serves a proof of concept that it would not be too
> difficult.  When someone cares enough to implement it.

Is /proc/self/exe a "magic symlink" that's bound to the inode, or just
a regular symlink? In the latter case it defeats the whole purpose of
using O_EXEC fds and fexecve rather than pathnames.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-11  1:15                                             ` Rich Felker
  (?)
@ 2015-01-11  2:09                                               ` Eric W. Biederman
  -1 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-11  2:09 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> On Sat, Jan 10, 2015 at 04:27:23PM -0600, Eric W. Biederman wrote:
>> Rich Felker <dalias@aerifal.cx> writes:
>> 
>> > On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:
>> 
>> >> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
>> >> etc.) before bothering with open(2), you'll get screwed.
>> >
>> > Yes, but I think that would be very bad interpreter design.
>> > stat/getxattr/access/whatever followed by open is always a TOCTOU
>> > race. The correct sequence of actions is always open followed by
>> > fstat/fgetxattr/...
>> 
>> Sigh.  I think everyone who has looked at this has been blind.
>> 
>> If userspace is reasonable all we have to do is fix /proc/self/exe
>> for shell scripts to point at the actual script,
>> and then pass /proc/self/exe on the shell scripts command line.
>> 
>> At a practical level we have to worry about backwards compability and
>> chroot jails.  But the existence of a clean implementation with
>> /proc/self/exe serves a proof of concept that it would not be too
>> difficult.  When someone cares enough to implement it.
>
> Is /proc/self/exe a "magic symlink" that's bound to the inode, or just
> a regular symlink? In the latter case it defeats the whole purpose of
> using O_EXEC fds and fexecve rather than pathnames.

In implementation /proc/self/exe is a named rather than a numbered file
descriptor.  Essentially when loading an elf executable the file
descriptor is duped to the name /proc/self/exe.  The implementation
otherwise is the same as /proc/self/fd/N.

The downside of course is that I expect if we were actually to change
/proc/self/exe from to point at the script instead of the shell some
piece of software somewhere would come melting down.  I am totally not
ready to consider that kind of mine field today.

Eric


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-11  2:09                                               ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-11  2:09 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> On Sat, Jan 10, 2015 at 04:27:23PM -0600, Eric W. Biederman wrote:
>> Rich Felker <dalias@aerifal.cx> writes:
>> 
>> > On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:
>> 
>> >> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
>> >> etc.) before bothering with open(2), you'll get screwed.
>> >
>> > Yes, but I think that would be very bad interpreter design.
>> > stat/getxattr/access/whatever followed by open is always a TOCTOU
>> > race. The correct sequence of actions is always open followed by
>> > fstat/fgetxattr/...
>> 
>> Sigh.  I think everyone who has looked at this has been blind.
>> 
>> If userspace is reasonable all we have to do is fix /proc/self/exe
>> for shell scripts to point at the actual script,
>> and then pass /proc/self/exe on the shell scripts command line.
>> 
>> At a practical level we have to worry about backwards compability and
>> chroot jails.  But the existence of a clean implementation with
>> /proc/self/exe serves a proof of concept that it would not be too
>> difficult.  When someone cares enough to implement it.
>
> Is /proc/self/exe a "magic symlink" that's bound to the inode, or just
> a regular symlink? In the latter case it defeats the whole purpose of
> using O_EXEC fds and fexecve rather than pathnames.

In implementation /proc/self/exe is a named rather than a numbered file
descriptor.  Essentially when loading an elf executable the file
descriptor is duped to the name /proc/self/exe.  The implementation
otherwise is the same as /proc/self/fd/N.

The downside of course is that I expect if we were actually to change
/proc/self/exe from to point at the script instead of the shell some
piece of software somewhere would come melting down.  I am totally not
ready to consider that kind of mine field today.

Eric

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-11  2:09                                               ` Eric W. Biederman
  0 siblings, 0 replies; 123+ messages in thread
From: Eric W. Biederman @ 2015-01-11  2:09 UTC (permalink / raw)
  To: Rich Felker
  Cc: Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

Rich Felker <dalias@aerifal.cx> writes:

> On Sat, Jan 10, 2015 at 04:27:23PM -0600, Eric W. Biederman wrote:
>> Rich Felker <dalias@aerifal.cx> writes:
>> 
>> > On Sat, Jan 10, 2015 at 04:14:57AM +0000, Al Viro wrote:
>> 
>> >> Except that if your interpreter does stat(2) (or access(2), or getxattr(2),
>> >> etc.) before bothering with open(2), you'll get screwed.
>> >
>> > Yes, but I think that would be very bad interpreter design.
>> > stat/getxattr/access/whatever followed by open is always a TOCTOU
>> > race. The correct sequence of actions is always open followed by
>> > fstat/fgetxattr/...
>> 
>> Sigh.  I think everyone who has looked at this has been blind.
>> 
>> If userspace is reasonable all we have to do is fix /proc/self/exe
>> for shell scripts to point at the actual script,
>> and then pass /proc/self/exe on the shell scripts command line.
>> 
>> At a practical level we have to worry about backwards compability and
>> chroot jails.  But the existence of a clean implementation with
>> /proc/self/exe serves a proof of concept that it would not be too
>> difficult.  When someone cares enough to implement it.
>
> Is /proc/self/exe a "magic symlink" that's bound to the inode, or just
> a regular symlink? In the latter case it defeats the whole purpose of
> using O_EXEC fds and fexecve rather than pathnames.

In implementation /proc/self/exe is a named rather than a numbered file
descriptor.  Essentially when loading an elf executable the file
descriptor is duped to the name /proc/self/exe.  The implementation
otherwise is the same as /proc/self/fd/N.

The downside of course is that I expect if we were actually to change
/proc/self/exe from to point at the script instead of the shell some
piece of software somewhere would come melting down.  I am totally not
ready to consider that kind of mine field today.

Eric


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-11 11:02                                                 ` Christoph Hellwig
  0 siblings, 0 replies; 123+ messages in thread
From: Christoph Hellwig @ 2015-01-11 11:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Rich Felker, Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Sat, Jan 10, 2015 at 08:09:10PM -0600, Eric W. Biederman wrote:
> In implementation /proc/self/exe is a named rather than a numbered file
> descriptor.  Essentially when loading an elf executable the file
> descriptor is duped to the name /proc/self/exe.  The implementation
> otherwise is the same as /proc/self/fd/N.
> 
> The downside of course is that I expect if we were actually to change
> /proc/self/exe from to point at the script instead of the shell some
> piece of software somewhere would come melting down.  I am totally not
> ready to consider that kind of mine field today.

We could add a /proc/self/script that points to the script, and either
is not available or still points to the executable if we are not running
a script.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-11 11:02                                                 ` Christoph Hellwig
  0 siblings, 0 replies; 123+ messages in thread
From: Christoph Hellwig @ 2015-01-11 11:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Rich Felker, Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Sat, Jan 10, 2015 at 08:09:10PM -0600, Eric W. Biederman wrote:
> In implementation /proc/self/exe is a named rather than a numbered file
> descriptor.  Essentially when loading an elf executable the file
> descriptor is duped to the name /proc/self/exe.  The implementation
> otherwise is the same as /proc/self/fd/N.
> 
> The downside of course is that I expect if we were actually to change
> /proc/self/exe from to point at the script instead of the shell some
> piece of software somewhere would come melting down.  I am totally not
> ready to consider that kind of mine field today.

We could add a /proc/self/script that points to the script, and either
is not available or still points to the executable if we are not running
a script.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-11 11:02                                                 ` Christoph Hellwig
  0 siblings, 0 replies; 123+ messages in thread
From: Christoph Hellwig @ 2015-01-11 11:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Rich Felker, Al Viro, David Drysdale, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, David Miller,
	Thomas Gleixner, Stephen Rothwell, Oleg Nesterov, Ingo Molnar,
	H. Peter Anvin, Kees Cook, Arnd Bergmann, Christoph Hellwig,
	X86 ML, linux-arch, Linux API, sparclinux-u79uwXL29TY76Z2rM5mHXA

On Sat, Jan 10, 2015 at 08:09:10PM -0600, Eric W. Biederman wrote:
> In implementation /proc/self/exe is a named rather than a numbered file
> descriptor.  Essentially when loading an elf executable the file
> descriptor is duped to the name /proc/self/exe.  The implementation
> otherwise is the same as /proc/self/fd/N.
> 
> The downside of course is that I expect if we were actually to change
> /proc/self/exe from to point at the script instead of the shell some
> piece of software somewhere would come melting down.  I am totally not
> ready to consider that kind of mine field today.

We could add a /proc/self/script that points to the script, and either
is not available or still points to the executable if we are not running
a script.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-10  1:33                           ` Rich Felker
  (?)
  (?)
@ 2015-01-12 11:33                           ` David Drysdale
  2015-01-12 16:07                               ` Rich Felker
  -1 siblings, 1 reply; 123+ messages in thread
From: David Drysdale @ 2015-01-12 11:33 UTC (permalink / raw)
  To: Rich Felker
  Cc: Eric W. Biederman, Al Viro, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Sat, Jan 10, 2015 at 1:33 AM, Rich Felker <dalias@aerifal.cx> wrote:
> On Fri, Jan 09, 2015 at 07:17:41PM -0600, Eric W. Biederman wrote:
>> Rich Felker <dalias@aerifal.cx> writes:
>>
>> > I'm not proposing code because I'm a libc developer not a kernel
>> > developer. I know what's needed for userspace to provide a conforming
>> > fexecve to applications, not how to implement that on the kernel side,
>> > although I'm trying to provide constructive ideas. The hostility is
>> > really not necessary.
>>
>> Conforming to what?
>>
>> The open group fexecve says nothing about requiring a file descriptor
>> passed to fexecve to have O_CLOEXEC.
>
> It doesn't require it but it allows it, and in multithreaded programs
> that might run child processes (or library code that might be used in
> such situations), O_CLOEXEC is mandatory everywhere to avoid fd leaks.

As a naive idea related to Andy's suggestion elsewhere, could you
just have an environment convention for fexecve-ing scripts?  That
would reduce FD leaks without any need for kernel involvement/changes.

For example, set _FEXECVED_VIA_FD=4 but don't set
O_CLOEXEC before fexecve, and the interpreter reads then
closes that FD.  Or just get the interpreter to spot scripts named
"/dev/fd/%d" and read-then-close the FD that way, cf. Eric's suggestion
at https://lkml.org/lkml/2014/10/22/652.

By the way, FreeBSD has a fexecve(2) syscall that behaves
in the same way as the current Linux code for an O_CLOEXEC
script -- the interpreter fails to open "/dev/fd/6" as it's gone.
Do you know if there are any other OSes that already do
something more sophisticated for this case?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-09 21:50                     ` Al Viro
  (?)
  (?)
@ 2015-01-12 14:18                     ` David Drysdale
  -1 siblings, 0 replies; 123+ messages in thread
From: David Drysdale @ 2015-01-12 14:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Rich Felker, Michael Kerrisk (man-pages),
	Eric W. Biederman, Andy Lutomirski, Meredydd Luff, linux-kernel,
	Andrew Morton, David Miller, Thomas Gleixner, Stephen Rothwell,
	Oleg Nesterov, Ingo Molnar, H. Peter Anvin, Kees Cook,
	Arnd Bergmann, Christoph Hellwig, X86 ML, linux-arch, Linux API,
	sparclinux

On Fri, Jan 9, 2015 at 9:50 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jan 09, 2015 at 04:28:52PM -0500, Rich Felker wrote:
>
>> The "magic open-once magic symlink" approach is really the cleanest
>> solution I can find. In the case where the interpreter does not open
>> the script, nothing terribly bad happens; the magic symlink just
>> sticks around until _exit or exec. In the case where the interpreter
>> opens it more than once, you get a failure, but as far as I know
>> existing interpreters don't do this, and it's arguably bad design. In
>> any case it's a caught error.
>
> You know what's cleaner than that?  git revert 27d6ec7ad
> It has just been merged; until 3.19 it's fair game for removal.
>
> And yes, I should've NAKed the damn thing loud and clear, rather than
> asking questions back then, getting no answers and letting it slip.
> Mea culpa.

Al, I'm sorry if I missed a question or concern of yours back in
October -- I certainly didn't intend to (that would be foolish indeed!).

[I thought the main open question was whether a dupfs
implementation would help with /dev/fd/ and /proc/ semantics, but I
had the (possibly incorrect) understanding that that was somewhat
orthogonal to the execveat implementation.]

Are there any changes/fixes/refactorings that I could do (especially
within the 3.19 timeframe) that would help mollify at all?

> Back then the procfs-free environments had been pushed as a serious argument
> in favour of merging the damn thing.  Now you guys turn around and say that
> we not only need procfs mounted, we need a yet-to-be-added kludge in there
> to cope with the actual intended uses.

Not me!

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
  2015-01-12 11:33                           ` David Drysdale
@ 2015-01-12 16:07                               ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-12 16:07 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Al Viro, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Mon, Jan 12, 2015 at 11:33:49AM +0000, David Drysdale wrote:
> On Sat, Jan 10, 2015 at 1:33 AM, Rich Felker <dalias@aerifal.cx> wrote:
> > On Fri, Jan 09, 2015 at 07:17:41PM -0600, Eric W. Biederman wrote:
> >> Rich Felker <dalias@aerifal.cx> writes:
> >>
> >> > I'm not proposing code because I'm a libc developer not a kernel
> >> > developer. I know what's needed for userspace to provide a conforming
> >> > fexecve to applications, not how to implement that on the kernel side,
> >> > although I'm trying to provide constructive ideas. The hostility is
> >> > really not necessary.
> >>
> >> Conforming to what?
> >>
> >> The open group fexecve says nothing about requiring a file descriptor
> >> passed to fexecve to have O_CLOEXEC.
> >
> > It doesn't require it but it allows it, and in multithreaded programs
> > that might run child processes (or library code that might be used in
> > such situations), O_CLOEXEC is mandatory everywhere to avoid fd leaks.
> 
> As a naive idea related to Andy's suggestion elsewhere, could you
> just have an environment convention for fexecve-ing scripts?  That
> would reduce FD leaks without any need for kernel involvement/changes.
> 
> For example, set _FEXECVED_VIA_FD=4 but don't set
> O_CLOEXEC before fexecve, and the interpreter reads then
> closes that FD.  Or just get the interpreter to spot scripts named
> "/dev/fd/%d" and read-then-close the FD that way, cf. Eric's suggestion
> at https://lkml.org/lkml/2014/10/22/652.

No. Any omission of O_CLOEXEC even momentarily is a potentially
dangerous fd leak. This is the case whenever the process is
multithreaded and it's possible that other threads might fork and
exec. Think of the case of a privileged daemon re-execing itself (e.g.
to switch to an updated version) while there are potentially other
threads spawning non-privileged processes.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2)
@ 2015-01-12 16:07                               ` Rich Felker
  0 siblings, 0 replies; 123+ messages in thread
From: Rich Felker @ 2015-01-12 16:07 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Al Viro, Michael Kerrisk (man-pages),
	Andy Lutomirski, Meredydd Luff, linux-kernel, Andrew Morton,
	David Miller, Thomas Gleixner, Stephen Rothwell, Oleg Nesterov,
	Ingo Molnar, H. Peter Anvin, Kees Cook, Arnd Bergmann,
	Christoph Hellwig, X86 ML, linux-arch, Linux API, sparclinux

On Mon, Jan 12, 2015 at 11:33:49AM +0000, David Drysdale wrote:
> On Sat, Jan 10, 2015 at 1:33 AM, Rich Felker <dalias@aerifal.cx> wrote:
> > On Fri, Jan 09, 2015 at 07:17:41PM -0600, Eric W. Biederman wrote:
> >> Rich Felker <dalias@aerifal.cx> writes:
> >>
> >> > I'm not proposing code because I'm a libc developer not a kernel
> >> > developer. I know what's needed for userspace to provide a conforming
> >> > fexecve to applications, not how to implement that on the kernel side,
> >> > although I'm trying to provide constructive ideas. The hostility is
> >> > really not necessary.
> >>
> >> Conforming to what?
> >>
> >> The open group fexecve says nothing about requiring a file descriptor
> >> passed to fexecve to have O_CLOEXEC.
> >
> > It doesn't require it but it allows it, and in multithreaded programs
> > that might run child processes (or library code that might be used in
> > such situations), O_CLOEXEC is mandatory everywhere to avoid fd leaks.
> 
> As a naive idea related to Andy's suggestion elsewhere, could you
> just have an environment convention for fexecve-ing scripts?  That
> would reduce FD leaks without any need for kernel involvement/changes.
> 
> For example, set _FEXECVED_VIA_FD=4 but don't set
> O_CLOEXEC before fexecve, and the interpreter reads then
> closes that FD.  Or just get the interpreter to spot scripts named
> "/dev/fd/%d" and read-then-close the FD that way, cf. Eric's suggestion
> at https://lkml.org/lkml/2014/10/22/652.

No. Any omission of O_CLOEXEC even momentarily is a potentially
dangerous fd leak. This is the case whenever the process is
multithreaded and it's possible that other threads might fork and
exec. Think of the case of a privileged daemon re-execing itself (e.g.
to switch to an updated version) while there are potentially other
threads spawning non-privileged processes.

Rich

^ permalink raw reply	[flat|nested] 123+ messages in thread

end of thread, other threads:[~2015-01-12 16:08 UTC | newest]

Thread overview: 123+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-24 11:53 [PATCHv10 0/5] syscalls,x86,sparc: Add execveat() system call David Drysdale
2014-11-24 11:53 ` David Drysdale
2014-11-24 11:53 ` [PATCHv10 1/5] syscalls: implement " David Drysdale
2014-11-24 11:53   ` David Drysdale
2014-11-24 11:53 ` [PATCHv10 2/5] x86: Hook up execveat " David Drysdale
2014-11-24 12:45   ` Thomas Gleixner
2014-11-24 12:45     ` Thomas Gleixner
2014-11-24 12:45     ` Thomas Gleixner
2014-11-24 17:06   ` Dan Carpenter
2014-11-24 17:06     ` Dan Carpenter
2014-11-24 17:06     ` Dan Carpenter
2014-11-24 18:26     ` David Drysdale
2014-11-24 18:26       ` David Drysdale
2014-11-25 12:16       ` Dan Carpenter
2014-11-25 12:16         ` Dan Carpenter
2014-11-25 12:16         ` Dan Carpenter
2014-11-24 18:53     ` Thomas Gleixner
2014-11-24 18:53       ` Thomas Gleixner
2014-11-24 11:53 ` [PATCHv10 3/5] syscalls: add selftest for execveat(2) David Drysdale
2014-11-24 11:53   ` David Drysdale
2014-11-24 11:53 ` [PATCHv10 4/5] sparc: Hook up execveat system call David Drysdale
2014-11-24 18:36   ` David Miller
2014-11-24 18:36     ` David Miller
2014-11-24 11:53 ` [PATCHv10 man-pages 5/5] execveat.2: initial man page for execveat(2) David Drysdale
2015-01-09 15:47   ` Michael Kerrisk (man-pages)
2015-01-09 15:47     ` Michael Kerrisk (man-pages)
2015-01-09 16:13     ` Rich Felker
2015-01-09 16:13       ` Rich Felker
2015-01-09 17:46       ` David Drysdale
2015-01-09 17:46         ` David Drysdale
2015-01-09 17:46         ` David Drysdale
2015-01-09 20:48         ` Rich Felker
2015-01-09 20:48           ` Rich Felker
2015-01-09 20:48           ` Rich Felker
2015-01-09 20:56           ` Al Viro
2015-01-09 20:56             ` Al Viro
2015-01-09 20:59             ` Rich Felker
2015-01-09 20:59               ` Rich Felker
2015-01-09 20:59               ` Rich Felker
2015-01-09 21:09               ` Al Viro
2015-01-09 21:09                 ` Al Viro
2015-01-09 21:09                 ` Al Viro
2015-01-09 21:28                 ` Rich Felker
2015-01-09 21:28                   ` Rich Felker
2015-01-09 21:50                   ` Al Viro
2015-01-09 21:50                     ` Al Viro
2015-01-09 22:17                     ` Rich Felker
2015-01-09 22:17                       ` Rich Felker
2015-01-09 22:33                       ` Al Viro
2015-01-09 22:33                         ` Al Viro
2015-01-09 22:42                         ` Rich Felker
2015-01-09 22:42                           ` Rich Felker
2015-01-09 22:57                           ` Al Viro
2015-01-09 22:57                             ` Al Viro
2015-01-09 22:57                             ` Al Viro
2015-01-09 23:12                             ` Rich Felker
2015-01-09 23:12                               ` Rich Felker
2015-01-09 23:24                               ` Andy Lutomirski
2015-01-09 23:24                                 ` Andy Lutomirski
2015-01-09 23:37                                 ` Rich Felker
2015-01-09 23:37                                   ` Rich Felker
2015-01-10  0:01                                 ` Al Viro
2015-01-09 23:36                               ` Al Viro
2015-01-09 23:36                                 ` Al Viro
2015-01-10  3:03                                 ` Al Viro
2015-01-10  3:03                                   ` Al Viro
2015-01-10  3:03                                   ` Al Viro
2015-01-10  3:41                                   ` Rich Felker
2015-01-10  3:41                                     ` Rich Felker
2015-01-10  4:14                                     ` Al Viro
2015-01-10  5:57                                       ` Rich Felker
2015-01-10  5:57                                         ` Rich Felker
2015-01-10 22:27                                         ` Eric W. Biederman
2015-01-10 22:27                                           ` Eric W. Biederman
2015-01-10 22:27                                           ` Eric W. Biederman
2015-01-11  1:15                                           ` Rich Felker
2015-01-11  1:15                                             ` Rich Felker
2015-01-11  2:09                                             ` Eric W. Biederman
2015-01-11  2:09                                               ` Eric W. Biederman
2015-01-11  2:09                                               ` Eric W. Biederman
2015-01-11 11:02                                               ` Christoph Hellwig
2015-01-11 11:02                                                 ` Christoph Hellwig
2015-01-11 11:02                                                 ` Christoph Hellwig
2015-01-12 14:18                     ` David Drysdale
2015-01-09 22:13                   ` Eric W. Biederman
2015-01-09 22:13                     ` Eric W. Biederman
2015-01-09 22:13                     ` Eric W. Biederman
2015-01-09 22:13                     ` Eric W. Biederman
2015-01-09 22:38                     ` Rich Felker
2015-01-09 22:38                       ` Rich Felker
2015-01-10  1:17                       ` Eric W. Biederman
2015-01-10  1:17                         ` Eric W. Biederman
2015-01-10  1:17                         ` Eric W. Biederman
2015-01-10  1:17                         ` Eric W. Biederman
2015-01-10  1:33                         ` Rich Felker
2015-01-10  1:33                           ` Rich Felker
2015-01-10  1:33                           ` Rich Felker
2015-01-12 11:33                           ` David Drysdale
2015-01-12 16:07                             ` Rich Felker
2015-01-12 16:07                               ` Rich Felker
2015-01-10  7:13                     ` Michael Kerrisk (man-pages)
2015-01-10  7:13                       ` Michael Kerrisk (man-pages)
2015-01-09 21:20               ` Eric W. Biederman
2015-01-09 21:20                 ` Eric W. Biederman
2015-01-09 21:20                 ` Eric W. Biederman
2015-01-09 21:31                 ` Rich Felker
2015-01-09 21:31                   ` Rich Felker
2015-01-09 21:31                   ` Rich Felker
2015-01-10  7:43         ` Michael Kerrisk (man-pages)
2015-01-10  7:43           ` Michael Kerrisk (man-pages)
2015-01-10  7:43           ` Michael Kerrisk (man-pages)
2015-01-10  8:27         ` Michael Kerrisk (man-pages)
2015-01-10  8:27           ` Michael Kerrisk (man-pages)
2015-01-10 13:31           ` Rich Felker
2015-01-10 13:31             ` Rich Felker
2015-01-10  7:38       ` Michael Kerrisk (man-pages)
2015-01-10  7:38         ` Michael Kerrisk (man-pages)
2015-01-10  7:38         ` Michael Kerrisk (man-pages)
2015-01-09 18:02     ` David Drysdale
2015-01-09 18:02       ` David Drysdale
2015-01-10  7:56       ` Michael Kerrisk (man-pages)
2015-01-10  7:56         ` Michael Kerrisk (man-pages)
2015-01-10  7:56         ` Michael Kerrisk (man-pages)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.