linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv5 0/3] syscalls,x86: Add execveat() system call
@ 2014-10-22 11:44 David Drysdale
  2014-10-22 11:44 ` [PATCHv5 1/3] syscalls,x86: implement " David Drysdale
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: David Drysdale @ 2014-10-22 11:44 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, x86, linux-arch, linux-api,
	David Drysdale

This patch set adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc
filesystem.  The current glibc version of fexecve(3) is implemented
via /proc, which causes problems in sandboxed or otherwise restricted
environments.

Given the desire for a /proc-free fexecve() implementation, HPA
suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
syscall would be an appropriate generalization.

Also, having a new syscall means that it can take a flags argument
without back-compatibility concerns.  The current implementation just
defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other
flags could be added in future -- for example, flags for new namespaces
(as suggested at https://lkml.org/lkml/2006/7/11/474).

Related history:
 - https://lkml.org/lkml/2006/12/27/123 is an example of someone
   realizing that fexecve() is likely to fail in a chroot environment.
 - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
   documenting the /proc requirement of fexecve(3) in its manpage, to
   "prevent other people from wasting their time".
 - https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
   it's not possible to fexecve() a file descriptor for a script with
   close-on-exec set (which is possible with the implementation here).
 - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
   problem where a process that did setuid() could not fexecve()
   because it no longer had access to /proc/self/fd; this has since
   been fixed.


Changes since v4, suggested by Eric W. Biederman:
 - Use empty filename with AT_EMPTY_PATH flag rather than NULL
   pathname to request fexecve-like behaviour.
 - Build pathname as "/dev/fd/<fd>/<filename>" (or "/dev/fd/<fd>")
   rather than using d_path().
 - Patch against v3.17 (bfe01a5ba249)

Changes since Meredydd's v3 patch:
 - Added a selftest.
 - Added a man page.
 - Left open_exec() signature untouched to reduce patch impact
   elsewhere (as suggested by Al Viro).
 - Filled in bprm->filename with d_path() into a buffer, to avoid use
   of potentially-ephemeral dentry->d_name.
 - Patch against v3.14 (455c6fdbd21916).


David Drysdale (2):
  syscalls,x86: implement execveat() system call
  syscalls,x86: add selftest for execveat(2)

 arch/x86/ia32/audit.c                   |   1 +
 arch/x86/ia32/ia32entry.S               |   1 +
 arch/x86/kernel/audit_64.c              |   1 +
 arch/x86/kernel/entry_64.S              |  28 +++
 arch/x86/syscalls/syscall_32.tbl        |   1 +
 arch/x86/syscalls/syscall_64.tbl        |   2 +
 arch/x86/um/sys_call_table_64.c         |   1 +
 fs/exec.c                               | 130 ++++++++++++--
 fs/namei.c                              |   2 +-
 include/linux/compat.h                  |   3 +
 include/linux/fs.h                      |   1 +
 include/linux/sched.h                   |   4 +
 include/linux/syscalls.h                |   4 +
 include/uapi/asm-generic/unistd.h       |   4 +-
 kernel/sys_ni.c                         |   3 +
 lib/audit.c                             |   3 +
 tools/testing/selftests/Makefile        |   1 +
 tools/testing/selftests/exec/.gitignore |   6 +
 tools/testing/selftests/exec/Makefile   |  22 +++
 tools/testing/selftests/exec/execveat.c | 295 ++++++++++++++++++++++++++++++++
 20 files changed, 497 insertions(+), 16 deletions(-)
 create mode 100644 tools/testing/selftests/exec/.gitignore
 create mode 100644 tools/testing/selftests/exec/Makefile
 create mode 100644 tools/testing/selftests/exec/execveat.c

-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCHv5 1/3] syscalls,x86: implement execveat() system call
  2014-10-22 11:44 [PATCHv5 0/3] syscalls,x86: Add execveat() system call David Drysdale
@ 2014-10-22 11:44 ` David Drysdale
  2014-10-22 18:07   ` Eric W. Biederman
  2014-10-22 18:44   ` Andy Lutomirski
  2014-10-22 11:44 ` [PATCHv5 2/3] syscalls,x86: add selftest for execveat(2) David Drysdale
  2014-10-22 11:44 ` [PATCHv5 man-pages 3/3] execveat.2: initial man page " David Drysdale
  2 siblings, 2 replies; 13+ messages in thread
From: David Drysdale @ 2014-10-22 11:44 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, x86, linux-arch, linux-api,
	David Drysdale

Add a new system execveat(2) syscall. execveat() is to execve() as
openat() is to open(): it takes a file descriptor that refers to a
directory, and resolves the filename relative to that.

In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in
other UNIXen, but in Linux glibc it depends on opening
"/proc/self/fd/<fd>" (and so relies on /proc being mounted).

The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found.  This does however mean that
execution of a script in a /proc-less environment won't work.

Only x86-64, i386 and x32 ABIs are supported in this patch.

Based on patches by Meredydd Luff <meredydd@senatehouse.org>

Signed-off-by: David Drysdale <drysdale@google.com>
---
 arch/x86/ia32/audit.c             |   1 +
 arch/x86/ia32/ia32entry.S         |   1 +
 arch/x86/kernel/audit_64.c        |   1 +
 arch/x86/kernel/entry_64.S        |  28 ++++++++
 arch/x86/syscalls/syscall_32.tbl  |   1 +
 arch/x86/syscalls/syscall_64.tbl  |   2 +
 arch/x86/um/sys_call_table_64.c   |   1 +
 fs/exec.c                         | 130 ++++++++++++++++++++++++++++++++++----
 fs/namei.c                        |   2 +-
 include/linux/compat.h            |   3 +
 include/linux/fs.h                |   1 +
 include/linux/sched.h             |   4 ++
 include/linux/syscalls.h          |   4 ++
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c                   |   3 +
 lib/audit.c                       |   3 +
 16 files changed, 173 insertions(+), 16 deletions(-)

diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
index 5d7b381da692..2eccc8932ae6 100644
--- a/arch/x86/ia32/audit.c
+++ b/arch/x86/ia32/audit.c
@@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
 	case __NR_socketcall:
 		return 4;
 	case __NR_execve:
+	case __NR_execveat:
 		return 5;
 	default:
 		return 1;
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 4299eb05023c..2516c09743e0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -464,6 +464,7 @@ GLOBAL(\label)
 	PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
 	PTREGSCALL stub32_sigreturn, sys32_sigreturn
 	PTREGSCALL stub32_execve, compat_sys_execve
+	PTREGSCALL stub32_execveat, compat_sys_execveat
 	PTREGSCALL stub32_fork, sys_fork
 	PTREGSCALL stub32_vfork, sys_vfork
 
diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
index 06d3e5a14d9d..f3672508b249 100644
--- a/arch/x86/kernel/audit_64.c
+++ b/arch/x86/kernel/audit_64.c
@@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
 	case __NR_openat:
 		return 3;
 	case __NR_execve:
+	case __NR_execveat:
 		return 5;
 	default:
 		return 0;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 2fac1343a90b..00c4526e6ffe 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -665,6 +665,20 @@ ENTRY(stub_execve)
 	CFI_ENDPROC
 END(stub_execve)
 
+ENTRY(stub_execveat)
+	CFI_STARTPROC
+	addq $8, %rsp
+	PARTIAL_FRAME 0
+	SAVE_REST
+	FIXUP_TOP_OF_STACK %r11
+	call sys_execveat
+	RESTORE_TOP_OF_STACK %r11
+	movq %rax,RAX(%rsp)
+	RESTORE_REST
+	jmp int_ret_from_sys_call
+	CFI_ENDPROC
+END(stub_execveat)
+
 /*
  * sigreturn is special because it needs to restore all registers on return.
  * This cannot be done with SYSRET, so use the IRET return path instead.
@@ -710,6 +724,20 @@ ENTRY(stub_x32_execve)
 	CFI_ENDPROC
 END(stub_x32_execve)
 
+ENTRY(stub_x32_execveat)
+	CFI_STARTPROC
+	addq $8, %rsp
+	PARTIAL_FRAME 0
+	SAVE_REST
+	FIXUP_TOP_OF_STACK %r11
+	call compat_sys_execveat
+	RESTORE_TOP_OF_STACK %r11
+	movq %rax,RAX(%rsp)
+	RESTORE_REST
+	jmp int_ret_from_sys_call
+	CFI_ENDPROC
+END(stub_x32_execveat)
+
 #endif
 
 /*
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b78168d85..2633e3195455 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,4 @@
 354	i386	seccomp			sys_seccomp
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
+357	i386	execveat		sys_execveat			stub32_execveat
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922727b9..1af5badd159c 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,7 @@
 318	common	getrandom		sys_getrandom
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
+321	64	execveat		stub_execveat
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
@@ -365,3 +366,4 @@
 542	x32	getsockopt		compat_sys_getsockopt
 543	x32	io_setup		compat_sys_io_setup
 544	x32	io_submit		compat_sys_io_submit
+545	x32	execveat		stub_x32_execveat
diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
index f2f0723070ca..20c3649d0691 100644
--- a/arch/x86/um/sys_call_table_64.c
+++ b/arch/x86/um/sys_call_table_64.c
@@ -31,6 +31,7 @@
 #define stub_fork sys_fork
 #define stub_vfork sys_vfork
 #define stub_execve sys_execve
+#define stub_execveat sys_execveat
 #define stub_rt_sigreturn sys_rt_sigreturn
 
 #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
diff --git a/fs/exec.c b/fs/exec.c
index a2b42a98c743..92a6e14f096a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -747,7 +747,7 @@ EXPORT_SYMBOL(setup_arg_pages);
 
 #endif /* CONFIG_MMU */
 
-static struct file *do_open_exec(struct filename *name)
+static struct file *do_open_execat(int fd, struct filename *name, int flags)
 {
 	struct file *file;
 	int err;
@@ -757,10 +757,34 @@ static struct file *do_open_exec(struct filename *name)
 		.intent = LOOKUP_OPEN,
 		.lookup_flags = LOOKUP_FOLLOW,
 	};
+	static const struct open_flags open_exec_nofollow_flags = {
+		.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+		.acc_mode = MAY_EXEC | MAY_OPEN,
+		.intent = LOOKUP_OPEN,
+		.lookup_flags = 0,
+	};
 
-	file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
-	if (IS_ERR(file))
-		goto out;
+	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
+		return ERR_PTR(-EINVAL);
+
+	if (name->name[0] != '\0') {
+		const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
+						   ? &open_exec_nofollow_flags
+						   : &open_exec_flags);
+
+		file = do_filp_open(fd, name, oflags);
+		if (IS_ERR(file))
+			goto out;
+	} else {
+		file = fget(fd);
+		if (!file)
+			return ERR_PTR(-EBADF);
+
+		err = inode_permission(file->f_path.dentry->d_inode,
+				open_exec_flags.acc_mode);
+		if (err)
+			goto exit;
+	}
 
 	err = -EACCES;
 	if (!S_ISREG(file_inode(file)->i_mode))
@@ -769,12 +793,13 @@ static struct file *do_open_exec(struct filename *name)
 	if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
 		goto exit;
 
-	fsnotify_open(file);
-
 	err = deny_write_access(file);
 	if (err)
 		goto exit;
 
+	if (name->name[0] != '\0')
+		fsnotify_open(file);
+
 out:
 	return file;
 
@@ -786,7 +811,7 @@ exit:
 struct file *open_exec(const char *name)
 {
 	struct filename tmp = { .name = name };
-	return do_open_exec(&tmp);
+	return do_open_execat(AT_FDCWD, &tmp, 0);
 }
 EXPORT_SYMBOL(open_exec);
 
@@ -1422,10 +1447,12 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int do_execve_common(struct filename *filename,
-				struct user_arg_ptr argv,
-				struct user_arg_ptr envp)
+static int do_execveat_common(int fd, struct filename *filename,
+			      struct user_arg_ptr argv,
+			      struct user_arg_ptr envp,
+			      int flags)
 {
+	char *pathbuf = NULL;
 	struct linux_binprm *bprm;
 	struct file *file;
 	struct files_struct *displaced;
@@ -1466,7 +1493,7 @@ static int do_execve_common(struct filename *filename,
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
 
-	file = do_open_exec(filename);
+	file = do_open_execat(fd, filename, flags);
 	retval = PTR_ERR(file);
 	if (IS_ERR(file))
 		goto out_unmark;
@@ -1474,7 +1501,27 @@ static int do_execve_common(struct filename *filename,
 	sched_exec();
 
 	bprm->file = file;
-	bprm->filename = bprm->interp = filename->name;
+	if (fd == AT_FDCWD || filename->name[0] == '/') {
+		bprm->filename = filename->name;
+	} else {
+		/*
+		 * Build a pathname that reflects how we got to the file,
+		 * either "/dev/fd/<fd>" (for an empty filename) or
+		 * "/dev/fd/<fd>/<filename>".
+		 */
+		pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
+		if (!pathbuf) {
+			retval = -ENOMEM;
+			goto out_unmark;
+		}
+		bprm->filename = pathbuf;
+		if (filename->name[0] == '\0')
+			sprintf(pathbuf, "/dev/fd/%d", fd);
+		else
+			snprintf(pathbuf, PATH_MAX,
+				 "/dev/fd/%d/%s", fd, filename->name);
+	}
+	bprm->interp = bprm->filename;
 
 	retval = bprm_mm_init(bprm);
 	if (retval)
@@ -1532,6 +1579,7 @@ out_unmark:
 
 out_free:
 	free_bprm(bprm);
+	kfree(pathbuf);
 
 out_files:
 	if (displaced)
@@ -1547,7 +1595,18 @@ int do_execve(struct filename *filename,
 {
 	struct user_arg_ptr argv = { .ptr.native = __argv };
 	struct user_arg_ptr envp = { .ptr.native = __envp };
-	return do_execve_common(filename, argv, envp);
+	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+int do_execveat(int fd, struct filename *filename,
+		const char __user *const __user *__argv,
+		const char __user *const __user *__envp,
+		int flags)
+{
+	struct user_arg_ptr argv = { .ptr.native = __argv };
+	struct user_arg_ptr envp = { .ptr.native = __envp };
+
+	return do_execveat_common(fd, filename, argv, envp, flags);
 }
 
 #ifdef CONFIG_COMPAT
@@ -1563,7 +1622,23 @@ static int compat_do_execve(struct filename *filename,
 		.is_compat = true,
 		.ptr.compat = __envp,
 	};
-	return do_execve_common(filename, argv, envp);
+	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+static int compat_do_execveat(int fd, struct filename *filename,
+			      const compat_uptr_t __user *__argv,
+			      const compat_uptr_t __user *__envp,
+			      int flags)
+{
+	struct user_arg_ptr argv = {
+		.is_compat = true,
+		.ptr.compat = __argv,
+	};
+	struct user_arg_ptr envp = {
+		.is_compat = true,
+		.ptr.compat = __envp,
+	};
+	return do_execveat_common(fd, filename, argv, envp, flags);
 }
 #endif
 
@@ -1603,6 +1678,20 @@ SYSCALL_DEFINE3(execve,
 {
 	return do_execve(getname(filename), argv, envp);
 }
+
+SYSCALL_DEFINE5(execveat,
+		int, fd, const char __user *, filename,
+		const char __user *const __user *, argv,
+		const char __user *const __user *, envp,
+		int, flags)
+{
+	int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
+
+	return do_execveat(fd,
+			   getname_flags(filename, lookup_flags, NULL),
+			   argv, envp, flags);
+}
+
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
 	const compat_uptr_t __user *, argv,
@@ -1610,4 +1699,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
 {
 	return compat_do_execve(getname(filename), argv, envp);
 }
+
+COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
+		       const char __user *, filename,
+		       const compat_uptr_t __user *, argv,
+		       const compat_uptr_t __user *, envp,
+		       int,  flags)
+{
+	int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
+
+	return compat_do_execveat(fd,
+				  getname_flags(filename, lookup_flags, NULL),
+				  argv, envp, flags);
+}
 #endif
diff --git a/fs/namei.c b/fs/namei.c
index a7b05bf82d31..553c84d3e0cc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -130,7 +130,7 @@ void final_putname(struct filename *name)
 
 #define EMBEDDED_NAME_MAX	(PATH_MAX - sizeof(struct filename))
 
-static struct filename *
+struct filename *
 getname_flags(const char __user *filename, int flags, int *empty)
 {
 	struct filename *result, *err;
diff --git a/include/linux/compat.h b/include/linux/compat.h
index e6494261eaff..7450ca2ac1fc 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
 
 asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
 		     const compat_uptr_t __user *envp);
+asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
+		     const compat_uptr_t __user *argv,
+		     const compat_uptr_t __user *envp, int flags);
 
 asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
 		compat_ulong_t __user *outp, compat_ulong_t __user *exp,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 94187721ad41..e9818574d738 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2060,6 +2060,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
 extern struct file * dentry_open(const struct path *, int, const struct cred *);
 extern int filp_close(struct file *, fl_owner_t id);
 
+extern struct filename *getname_flags(const char __user *, int, int *);
 extern struct filename *getname(const char __user *);
 extern struct filename *getname_kernel(const char *);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b867a4dab38a..33e056da7d33 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2430,6 +2430,10 @@ extern void do_group_exit(int);
 extern int do_execve(struct filename *,
 		     const char __user * const __user *,
 		     const char __user * const __user *);
+extern int do_execveat(int, struct filename *,
+		       const char __user * const __user *,
+		       const char __user * const __user *,
+		       int);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85a9ce4..df5422294deb 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -876,4 +876,8 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
 asmlinkage long sys_getrandom(char __user *buf, size_t count,
 			      unsigned int flags);
 
+asmlinkage long sys_execveat(int dfd, const char __user *filename,
+			const char __user *const __user *argv,
+			const char __user *const __user *envp, int flags);
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 11d11bc5c78f..feef07d29663 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp)
 __SYSCALL(__NR_getrandom, sys_getrandom)
 #define __NR_memfd_create 279
 __SYSCALL(__NR_memfd_create, sys_memfd_create)
+#define __NR_execveat 280
+__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
 
 #undef __NR_syscalls
-#define __NR_syscalls 280
+#define __NR_syscalls 281
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 391d4ddb6f4b..efb06058ad3e 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -218,3 +218,6 @@ cond_syscall(sys_kcmp);
 
 /* operate on Secure Computing state */
 cond_syscall(sys_seccomp);
+
+/* execveat */
+cond_syscall(sys_execveat);
diff --git a/lib/audit.c b/lib/audit.c
index 1d726a22565b..b8fb5ee81e26 100644
--- a/lib/audit.c
+++ b/lib/audit.c
@@ -54,6 +54,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
 	case __NR_socketcall:
 		return 4;
 #endif
+#ifdef __NR_execveat
+	case __NR_execveat:
+#endif
 	case __NR_execve:
 		return 5;
 	default:
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCHv5 2/3] syscalls,x86: add selftest for execveat(2)
  2014-10-22 11:44 [PATCHv5 0/3] syscalls,x86: Add execveat() system call David Drysdale
  2014-10-22 11:44 ` [PATCHv5 1/3] syscalls,x86: implement " David Drysdale
@ 2014-10-22 11:44 ` David Drysdale
  2014-10-22 11:44 ` [PATCHv5 man-pages 3/3] execveat.2: initial man page " David Drysdale
  2 siblings, 0 replies; 13+ messages in thread
From: David Drysdale @ 2014-10-22 11:44 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, x86, linux-arch, linux-api,
	David Drysdale

Signed-off-by: David Drysdale <drysdale@google.com>
---
 tools/testing/selftests/Makefile        |   1 +
 tools/testing/selftests/exec/.gitignore |   6 +
 tools/testing/selftests/exec/Makefile   |  22 +++
 tools/testing/selftests/exec/execveat.c | 295 ++++++++++++++++++++++++++++++++
 4 files changed, 324 insertions(+)
 create mode 100644 tools/testing/selftests/exec/.gitignore
 create mode 100644 tools/testing/selftests/exec/Makefile
 create mode 100644 tools/testing/selftests/exec/execveat.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 36ff2e4c7b6f..210cf68b3511 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -14,6 +14,7 @@ TARGETS += powerpc
 TARGETS += user
 TARGETS += sysctl
 TARGETS += firmware
+TARGETS += exec
 
 TARGETS_HOTPLUG = cpu-hotplug
 TARGETS_HOTPLUG += memory-hotplug
diff --git a/tools/testing/selftests/exec/.gitignore b/tools/testing/selftests/exec/.gitignore
new file mode 100644
index 000000000000..349a786899a7
--- /dev/null
+++ b/tools/testing/selftests/exec/.gitignore
@@ -0,0 +1,6 @@
+subdir*
+script*
+execveat
+execveat.symlink
+execveat.moved
+execveat.ephemeral
diff --git a/tools/testing/selftests/exec/Makefile b/tools/testing/selftests/exec/Makefile
new file mode 100644
index 000000000000..428cb1d2a06a
--- /dev/null
+++ b/tools/testing/selftests/exec/Makefile
@@ -0,0 +1,22 @@
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+BINARIES = execveat
+DEPS = execveat.symlink script subdir
+all: $(BINARIES) $(DEPS)
+
+subdir:
+	mkdir -p $@
+script:
+	echo '#!/bin/sh' > $@
+	echo 'exit $$*' >> $@
+	chmod +x $@
+execveat.symlink: execveat
+	ln -s -f $< $@
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+	./execveat
+
+clean:
+	rm -rf $(BINARIES) $(DEPS) subdir.moved execveat.moved
diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
new file mode 100644
index 000000000000..1f2722ed3847
--- /dev/null
+++ b/tools/testing/selftests/exec/execveat.c
@@ -0,0 +1,295 @@
+/*
+ * Copyright (c) 2014 Google, Inc.
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Selftests for execveat(2).
+ */
+
+#define _GNU_SOURCE  /* to get O_PATH, AT_EMPTY_PATH */
+#include <sys/sendfile.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+static char *envp[] = { "IN_TEST=yes", NULL };
+static char *argv[] = { "execveat", "99", NULL };
+
+static int execveat_(int fd, const char *path, char **argv, char **envp,
+		     int flags)
+{
+#ifdef __NR_execveat
+	return syscall(__NR_execveat, fd, path, argv, envp, flags);
+#else
+	errno = -ENOSYS;
+	return -1;
+#endif
+}
+
+#define check_execveat_fail(fd, path, flags, errno)	\
+	_check_execveat_fail(fd, path, flags, errno, #errno)
+static int _check_execveat_fail(int fd, const char *path, int flags,
+				int expected_errno, const char *errno_str)
+{
+	int rc;
+
+	errno = 0;
+	printf("Check failure of execveat(%d, '%s', %d) with %s... ",
+		fd, path?:"(null)", flags, errno_str);
+	rc = execveat_(fd, path, argv, envp, flags);
+
+	if (rc > 0) {
+		printf("[FAIL] (unexpected success from execveat(2))\n");
+		return 1;
+	}
+	if (errno != expected_errno) {
+		printf("[FAIL] (expected errno %d (%s) not %d (%s)\n",
+			expected_errno, strerror(expected_errno),
+			errno, strerror(errno));
+		return 1;
+	}
+	printf("[OK]\n");
+	return 0;
+}
+
+static int check_execveat_invoked_rc(int fd, const char *path, int flags,
+				     int expected_rc)
+{
+	int status;
+	int rc;
+	pid_t child;
+
+	printf("Check success of execveat(%d, '%s', %d)... ",
+		fd, path?:"(null)", flags);
+	child = fork();
+	if (child < 0) {
+		printf("[FAIL] (fork() failed)\n");
+		return 1;
+	}
+	if (child == 0) {
+		/* Child: do execveat(). */
+		rc = execveat_(fd, path, argv, envp, flags);
+		printf("[FAIL]: execveat() failed, rc=%d errno=%d (%s)\n",
+			rc, errno, strerror(errno));
+		exit(1);  /* should not reach here */
+	}
+	/* Parent: wait for & check child's exit status. */
+	rc = waitpid(child, &status, 0);
+	if (rc != child) {
+		printf("[FAIL] (waitpid(%d,...) returned %d)\n", child, rc);
+		return 1;
+	}
+	if (!WIFEXITED(status)) {
+		printf("[FAIL] (child %d did not exit cleanly, status=%08x)\n",
+			child, status);
+		return 1;
+	}
+	if (WEXITSTATUS(status) != expected_rc) {
+		printf("[FAIL] (child %d exited with %d not %d)\n",
+			child, WEXITSTATUS(status), expected_rc);
+		return 1;
+	}
+	printf("[OK]\n");
+	return 0;
+}
+
+static int check_execveat(int fd, const char *path, int flags)
+{
+	return check_execveat_invoked_rc(fd, path, flags, 99);
+}
+
+static char *concat(const char *left, const char *right)
+{
+	char *result = malloc(strlen(left) + strlen(right) + 1);
+
+	strcpy(result, left);
+	strcat(result, right);
+	return result;
+}
+
+static int open_or_die(const char *filename, int flags)
+{
+	int fd = open(filename, flags);
+
+	if (fd < 0) {
+		printf("Failed to open '%s'; "
+			"check prerequisites are available\n", filename);
+		exit(1);
+	}
+	return fd;
+}
+
+static int run_tests(void)
+{
+	int fail = 0;
+	char *fullname = realpath("execveat", NULL);
+	char *fullname_script = realpath("script", NULL);
+	char *fullname_symlink = concat(fullname, ".symlink");
+	int subdir_dfd = open_or_die("subdir", O_DIRECTORY|O_RDONLY);
+	int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
+					       O_DIRECTORY|O_RDONLY);
+	int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+	int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
+	int fd = open_or_die("execveat", O_RDONLY);
+	int fd_path = open_or_die("execveat", O_RDONLY|O_PATH);
+	int fd_symlink = open_or_die("execveat.symlink", O_RDONLY);
+	int fd_script = open_or_die("script", O_RDONLY);
+	int fd_ephemeral = open_or_die("execveat.ephemeral", O_RDONLY);
+	int fd_script_ephemeral = open_or_die("script.ephemeral", O_RDONLY);
+
+	/* Normal executable file: */
+	/*   dfd + path */
+	fail |= check_execveat(subdir_dfd, "../execveat", 0);
+	fail |= check_execveat(dot_dfd, "execveat", 0);
+	fail |= check_execveat(dot_dfd_path, "execveat", 0);
+	/*   absolute path */
+	fail |= check_execveat(AT_FDCWD, fullname, 0);
+	/*   absolute path with nonsense dfd */
+	fail |= check_execveat(99, fullname, 0);
+	/*   fd + no path */
+	fail |= check_execveat(fd, "", AT_EMPTY_PATH);
+
+	/* Mess with executable file that's already open: */
+	/*   fd + no path to a file that's been renamed */
+	rename("execveat.ephemeral", "execveat.moved");
+	fail |= check_execveat(fd_ephemeral, "", AT_EMPTY_PATH);
+	/*   fd + no path to a file that's been deleted */
+	unlink("execveat.moved"); /* remove the file now fd open */
+	fail |= check_execveat(fd_ephemeral, "", AT_EMPTY_PATH);
+
+	/* Invalid argument failures */
+	fail |= check_execveat_fail(fd, "", 0, ENOENT);
+	fail |= check_execveat_fail(fd, NULL, AT_EMPTY_PATH, EFAULT);
+
+	/* Symlink to executable file: */
+	/*   dfd + path */
+	fail |= check_execveat(dot_dfd, "execveat.symlink", 0);
+	fail |= check_execveat(dot_dfd_path, "execveat.symlink", 0);
+	/*   absolute path */
+	fail |= check_execveat(AT_FDCWD, fullname_symlink, 0);
+	/*   fd + no path, even with AT_SYMLINK_NOFOLLOW (already followed) */
+	fail |= check_execveat(fd_symlink, "", AT_EMPTY_PATH);
+	fail |= check_execveat(fd_symlink, "",
+			       AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW);
+
+	/* Symlink fails when AT_SYMLINK_NOFOLLOW set: */
+	/*   dfd + path */
+	fail |= check_execveat_fail(dot_dfd, "execveat.symlink",
+				    AT_SYMLINK_NOFOLLOW, ELOOP);
+	fail |= check_execveat_fail(dot_dfd_path, "execveat.symlink",
+				    AT_SYMLINK_NOFOLLOW, ELOOP);
+	/*   absolute path */
+	fail |= check_execveat_fail(AT_FDCWD, fullname_symlink,
+				    AT_SYMLINK_NOFOLLOW, ELOOP);
+
+	/* Shell script wrapping executable file: */
+	/*   dfd + path */
+	fail |= check_execveat(subdir_dfd, "../script", 0);
+	fail |= check_execveat(dot_dfd, "script", 0);
+	fail |= check_execveat(dot_dfd_path, "script", 0);
+	/*   absolute path */
+	fail |= check_execveat(AT_FDCWD, fullname_script, 0);
+	/*   fd + no path */
+	fail |= check_execveat(fd_script, "", AT_EMPTY_PATH);
+	fail |= check_execveat(fd_script, "",
+			       AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW);
+
+	/* Mess with script file that's already open: */
+	/*   fd + no path to a file that's been renamed */
+	rename("script.ephemeral", "script.moved");
+	fail |= check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH);
+	/*   fd + no path to a file that's been deleted */
+	unlink("script.moved"); /* remove the file while fd open */
+	fail |= check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH);
+
+	/* Rename a subdirectory in the path: */
+	rename("subdir.ephemeral", "subdir.moved");
+	fail |= check_execveat(subdir_dfd_ephemeral, "../script", 0);
+	fail |= check_execveat(subdir_dfd_ephemeral, "script", 0);
+	/* Remove the subdir and its contents */
+	unlink("subdir.moved/script");
+	unlink("subdir.moved");
+	/* Shell loads via deleted subdir OK because name starts with .. */
+	fail |= check_execveat(subdir_dfd_ephemeral, "../script", 0);
+	fail |= check_execveat_fail(subdir_dfd_ephemeral, "script", 0, ENOENT);
+
+	/* Flag values other than AT_SYMLINK_NOFOLLOW => EINVAL */
+	fail |= check_execveat_fail(dot_dfd, "execveat", 0xFFFF, EINVAL);
+	/* Invalid path => ENOENT */
+	fail |= check_execveat_fail(dot_dfd, "no-such-file", 0, ENOENT);
+	fail |= check_execveat_fail(dot_dfd_path, "no-such-file", 0, ENOENT);
+	fail |= check_execveat_fail(AT_FDCWD, "no-such-file", 0, ENOENT);
+	/* Attempt to execute directory => EACCES */
+	fail |= check_execveat_fail(dot_dfd, "", AT_EMPTY_PATH, EACCES);
+	/* Attempt to execute non-executable => EACCES */
+	fail |= check_execveat_fail(dot_dfd, "Makefile", 0, EACCES);
+	/* Attempt to execute file opened with O_PATH => EBADF */
+	fail |= check_execveat_fail(dot_dfd_path, "", AT_EMPTY_PATH, EBADF);
+	fail |= check_execveat_fail(fd_path, "", AT_EMPTY_PATH, EBADF);
+	/* Attempt to execute nonsense FD => EBADF */
+	fail |= check_execveat_fail(99, "", AT_EMPTY_PATH, EBADF);
+	fail |= check_execveat_fail(99, "execveat", 0, EBADF);
+	/* Attempt to execute relative to non-directory => ENOTDIR */
+	fail |= check_execveat_fail(fd, "execveat", 0, ENOTDIR);
+
+	return fail ? -1 : 0;
+}
+
+void exe_cp(const char *src, const char *dest)
+{
+	int in_fd = open_or_die(src, O_RDONLY);
+	int out_fd = open(dest, O_RDWR|O_CREAT|O_TRUNC, 0755);
+	struct stat info;
+
+	fstat(in_fd, &info);
+	sendfile(out_fd, in_fd, NULL, info.st_size);
+	close(in_fd);
+	close(out_fd);
+}
+
+void prerequisites(void)
+{
+	int fd;
+	const char *script = "#!/bin/sh\nexit $*\n";
+
+	/* Create ephemeral copies of files */
+	exe_cp("execveat", "execveat.ephemeral");
+	exe_cp("script", "script.ephemeral");
+	mkdir("subdir.ephemeral", 0755);
+
+	fd = open("subdir.ephemeral/script", O_RDWR|O_CREAT|O_TRUNC, 0755);
+	write(fd, script, strlen(script));
+	close(fd);
+}
+
+int main(int argc, char **argv)
+{
+	int rc;
+
+	if (argc >= 2) {
+		/* If we are invoked with an argument, exit immediately. */
+		/* Check expected environment transferred. */
+		const char *in_test = getenv("IN_TEST");
+
+		if (!in_test || strcmp(in_test, "yes") != 0) {
+			printf("[FAIL] (no IN_TEST=yes in env)\n");
+			return 1;
+		}
+
+		/* Use the final argument as an exit code. */
+		rc = atoi(argv[argc - 1]);
+		fflush(stdout);
+	} else {
+		prerequisites();
+		rc = run_tests();
+	}
+	return rc;
+}
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCHv5 man-pages 3/3] execveat.2: initial man page for execveat(2)
  2014-10-22 11:44 [PATCHv5 0/3] syscalls,x86: Add execveat() system call David Drysdale
  2014-10-22 11:44 ` [PATCHv5 1/3] syscalls,x86: implement " David Drysdale
  2014-10-22 11:44 ` [PATCHv5 2/3] syscalls,x86: add selftest for execveat(2) David Drysdale
@ 2014-10-22 11:44 ` David Drysdale
  2014-10-22 17:55   ` Eric W. Biederman
  2 siblings, 1 reply; 13+ messages in thread
From: David Drysdale @ 2014-10-22 11:44 UTC (permalink / raw)
  To: Eric W. Biederman, Andy Lutomirski, Alexander Viro,
	Meredydd Luff, linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, x86, linux-arch, linux-api,
	David Drysdale

Signed-off-by: David Drysdale <drysdale@google.com>
---
 man2/execveat.2 | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 144 insertions(+)
 create mode 100644 man2/execveat.2

diff --git a/man2/execveat.2 b/man2/execveat.2
new file mode 100644
index 000000000000..d19571a3eb9d
--- /dev/null
+++ b/man2/execveat.2
@@ -0,0 +1,144 @@
+.\" Copyright (c) 2014 Google, Inc.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date.  The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein.  The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
+.SH NAME
+execveat \- execute program relative to a directory file descriptor
+.SH SYNOPSIS
+.B #include <unistd.h>
+.sp
+.BI "int execve(int " fd ", const char *" pathname ","
+.br
+.BI "           char *const " argv "[],  char *const " envp "[],"
+.br
+.BI "           int " flags);
+.SH DESCRIPTION
+The
+.BR execveat ()
+system call executes the program pointed to by the combination of \fIfd\fP and \fIpathname\fP.
+The
+.BR execveat ()
+system call operates in exactly the same way as
+.BR execve (2),
+except for the differences described in this manual page.
+
+If the pathname given in
+.I pathname
+is relative, then it is interpreted relative to the directory
+referred to by the file descriptor
+.I fd
+(rather than relative to the current working directory of
+the calling process, as is done by
+.BR execve (2)
+for a relative pathname).
+
+If
+.I pathname
+is relative and
+.I fd
+is the special value
+.BR AT_FDCWD ,
+then
+.I pathname
+is interpreted relative to the current working
+directory of the calling process (like
+.BR execve (2)).
+
+If
+.I pathname
+is absolute, then
+.I fd
+is ignored.
+
+If
+.I pathname
+is an empty string and the
+.BR AT_EMPTY_PATH
+flag is specified, then the file descriptor
+.I fd
+specifies the file to be executed.
+
+.I flags
+can either be 0, or include the following flags:
+.TP
+.BR AT_EMPTY_PATH
+If
+.I pathname
+is an empty string, operate on the file referred to by
+.IR fd
+(which may have been obtained using the
+.BR open (2)
+.B O_PATH
+flag).
+.TP
+.B AT_SYMLINK_NOFOLLOW
+If the file identified by
+.I fd
+and a non-NULL
+.I pathname
+is a symbolic link, then the call fails with the error
+.BR EINVAL .
+.SH "RETURN VALUE"
+On success,
+.BR execveat ()
+does not return. On error \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+The same errors that occur for
+.BR execve (2)
+can also occur for
+.BR execveat ().
+The following additional errors can occur for
+.BR execveat ():
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B EINVAL
+Invalid flag specified in
+.IR flags .
+.TP
+.B ENOTDIR
+.I pathname
+is relative and
+.I fd
+is a file descriptor referring to a file other than a directory.
+.SH VERSIONS
+.BR execveat ()
+was added to Linux in kernel 3.???.
+.SH NOTES
+In addition to the reasons explained in
+.BR openat (2),
+the
+.BR execveat ()
+system call is also needed to allow
+.BR fexecve (3)
+to be implemented on systems that do not have the
+.I /proc
+filesystem mounted.
+.SH SEE ALSO
+.BR execve (2),
+.BR fexecve (3)
-- 
2.1.0.rc2.206.gedb03e5


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCHv5 man-pages 3/3] execveat.2: initial man page for execveat(2)
  2014-10-22 11:44 ` [PATCHv5 man-pages 3/3] execveat.2: initial man page " David Drysdale
@ 2014-10-22 17:55   ` Eric W. Biederman
  2014-10-22 20:13     ` David Drysdale
  0 siblings, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 2014-10-22 17:55 UTC (permalink / raw)
  To: David Drysdale
  Cc: Andy Lutomirski, Alexander Viro, Meredydd Luff, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, x86, linux-arch, linux-api

David Drysdale <drysdale@google.com> writes:

> Signed-off-by: David Drysdale <drysdale@google.com>
> ---
>  man2/execveat.2 | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 144 insertions(+)
>  create mode 100644 man2/execveat.2
>
> diff --git a/man2/execveat.2 b/man2/execveat.2
> new file mode 100644
> index 000000000000..d19571a3eb9d
> --- /dev/null
> +++ b/man2/execveat.2
> @@ -0,0 +1,144 @@
> +.\" Copyright (c) 2014 Google, Inc.
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +execveat \- execute program relative to a directory file descriptor
> +.SH SYNOPSIS
> +.B #include <unistd.h>
> +.sp
> +.BI "int execve(int " fd ", const char *" pathname ","
            ^^^^^^ That needs to be execveat
> +.br
> +.BI "           char *const " argv "[],  char *const " envp "[],"
> +.br
> +.BI "           int " flags);
> +.SH DESCRIPTION
> +The
> +.BR execveat ()
> +system call executes the program pointed to by the combination of \fIfd\fP and \fIpathname\fP.
> +The
> +.BR execveat ()
> +system call operates in exactly the same way as
> +.BR execve (2),
> +except for the differences described in this manual page.
> +
> +If the pathname given in
> +.I pathname
> +is relative, then it is interpreted relative to the directory
> +referred to by the file descriptor
> +.I fd
> +(rather than relative to the current working directory of
> +the calling process, as is done by
> +.BR execve (2)
> +for a relative pathname).
> +
> +If
> +.I pathname
> +is relative and
> +.I fd
> +is the special value
> +.BR AT_FDCWD ,
> +then
> +.I pathname
> +is interpreted relative to the current working
> +directory of the calling process (like
> +.BR execve (2)).
> +
> +If
> +.I pathname
> +is absolute, then
> +.I fd
> +is ignored.
> +
> +If
> +.I pathname
> +is an empty string and the
> +.BR AT_EMPTY_PATH
> +flag is specified, then the file descriptor
> +.I fd
> +specifies the file to be executed.
> +
> +.I flags
> +can either be 0, or include the following flags:
> +.TP
> +.BR AT_EMPTY_PATH
> +If
> +.I pathname
> +is an empty string, operate on the file referred to by
> +.IR fd
> +(which may have been obtained using the
> +.BR open (2)
> +.B O_PATH
> +flag).
> +.TP
> +.B AT_SYMLINK_NOFOLLOW
> +If the file identified by
> +.I fd
> +and a non-NULL
> +.I pathname
> +is a symbolic link, then the call fails with the error
> +.BR EINVAL .
> +.SH "RETURN VALUE"
> +On success,
> +.BR execveat ()
> +does not return. On error \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +The same errors that occur for
> +.BR execve (2)
> +can also occur for
> +.BR execveat ().
> +The following additional errors can occur for
> +.BR execveat ():
> +.TP
> +.B EBADF
> +.I fd
> +is not a valid file descriptor.
> +.TP
> +.B EINVAL
> +Invalid flag specified in
> +.IR flags .
> +.TP
> +.B ENOTDIR
> +.I pathname
> +is relative and
> +.I fd
> +is a file descriptor referring to a file other than a directory.
> +.SH VERSIONS
> +.BR execveat ()
> +was added to Linux in kernel 3.???.
> +.SH NOTES
> +In addition to the reasons explained in
> +.BR openat (2),
> +the
> +.BR execveat ()
> +system call is also needed to allow
> +.BR fexecve (3)
> +to be implemented on systems that do not have the
> +.I /proc
> +filesystem mounted.
> +.SH SEE ALSO
> +.BR execve (2),
> +.BR fexecve (3)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call
  2014-10-22 11:44 ` [PATCHv5 1/3] syscalls,x86: implement " David Drysdale
@ 2014-10-22 18:07   ` Eric W. Biederman
  2014-10-23  6:40     ` David Drysdale
  2014-10-22 18:44   ` Andy Lutomirski
  1 sibling, 1 reply; 13+ messages in thread
From: Eric W. Biederman @ 2014-10-22 18:07 UTC (permalink / raw)
  To: David Drysdale
  Cc: Andy Lutomirski, Alexander Viro, Meredydd Luff, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, x86, linux-arch, linux-api

David Drysdale <drysdale@google.com> writes:

> Add a new system execveat(2) syscall. execveat() is to execve() as
> openat() is to open(): it takes a file descriptor that refers to a
> directory, and resolves the filename relative to that.
>
> In addition, if the filename is empty and AT_EMPTY_PATH is specified,
> execveat() executes the file to which the file descriptor refers. This
> replicates the functionality of fexecve(), which is a system call in
> other UNIXen, but in Linux glibc it depends on opening
> "/proc/self/fd/<fd>" (and so relies on /proc being mounted).
>
> The filename fed to the executed program as argv[0] (or the name of the
> script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
> (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
> reflecting how the executable was found.  This does however mean that
> execution of a script in a /proc-less environment won't work.
>
> Only x86-64, i386 and x32 ABIs are supported in this patch.
>
> Based on patches by Meredydd Luff <meredydd@senatehouse.org>
>
> Signed-off-by: David Drysdale <drysdale@google.com>
> ---
>  arch/x86/ia32/audit.c             |   1 +
>  arch/x86/ia32/ia32entry.S         |   1 +
>  arch/x86/kernel/audit_64.c        |   1 +
>  arch/x86/kernel/entry_64.S        |  28 ++++++++
>  arch/x86/syscalls/syscall_32.tbl  |   1 +
>  arch/x86/syscalls/syscall_64.tbl  |   2 +
>  arch/x86/um/sys_call_table_64.c   |   1 +
>  fs/exec.c                         | 130 ++++++++++++++++++++++++++++++++++----
>  fs/namei.c                        |   2 +-
>  include/linux/compat.h            |   3 +
>  include/linux/fs.h                |   1 +
>  include/linux/sched.h             |   4 ++
>  include/linux/syscalls.h          |   4 ++
>  include/uapi/asm-generic/unistd.h |   4 +-
>  kernel/sys_ni.c                   |   3 +
>  lib/audit.c                       |   3 +
>  16 files changed, 173 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
> index 5d7b381da692..2eccc8932ae6 100644
> --- a/arch/x86/ia32/audit.c
> +++ b/arch/x86/ia32/audit.c
> @@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
>  	case __NR_socketcall:
>  		return 4;
>  	case __NR_execve:
> +	case __NR_execveat:
>  		return 5;
>  	default:
>  		return 1;
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 4299eb05023c..2516c09743e0 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -464,6 +464,7 @@ GLOBAL(\label)
>  	PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
>  	PTREGSCALL stub32_sigreturn, sys32_sigreturn
>  	PTREGSCALL stub32_execve, compat_sys_execve
> +	PTREGSCALL stub32_execveat, compat_sys_execveat
>  	PTREGSCALL stub32_fork, sys_fork
>  	PTREGSCALL stub32_vfork, sys_vfork
>  
> diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
> index 06d3e5a14d9d..f3672508b249 100644
> --- a/arch/x86/kernel/audit_64.c
> +++ b/arch/x86/kernel/audit_64.c
> @@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
>  	case __NR_openat:
>  		return 3;
>  	case __NR_execve:
> +	case __NR_execveat:
>  		return 5;
>  	default:
>  		return 0;
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 2fac1343a90b..00c4526e6ffe 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -665,6 +665,20 @@ ENTRY(stub_execve)
>  	CFI_ENDPROC
>  END(stub_execve)
>  
> +ENTRY(stub_execveat)
> +	CFI_STARTPROC
> +	addq $8, %rsp
> +	PARTIAL_FRAME 0
> +	SAVE_REST
> +	FIXUP_TOP_OF_STACK %r11
> +	call sys_execveat
> +	RESTORE_TOP_OF_STACK %r11
> +	movq %rax,RAX(%rsp)
> +	RESTORE_REST
> +	jmp int_ret_from_sys_call
> +	CFI_ENDPROC
> +END(stub_execveat)
> +
>  /*
>   * sigreturn is special because it needs to restore all registers on return.
>   * This cannot be done with SYSRET, so use the IRET return path instead.
> @@ -710,6 +724,20 @@ ENTRY(stub_x32_execve)
>  	CFI_ENDPROC
>  END(stub_x32_execve)
>  
> +ENTRY(stub_x32_execveat)
> +	CFI_STARTPROC
> +	addq $8, %rsp
> +	PARTIAL_FRAME 0
> +	SAVE_REST
> +	FIXUP_TOP_OF_STACK %r11
> +	call compat_sys_execveat
> +	RESTORE_TOP_OF_STACK %r11
> +	movq %rax,RAX(%rsp)
> +	RESTORE_REST
> +	jmp int_ret_from_sys_call
> +	CFI_ENDPROC
> +END(stub_x32_execveat)
> +
>  #endif
>  
>  /*
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index 028b78168d85..2633e3195455 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -363,3 +363,4 @@
>  354	i386	seccomp			sys_seccomp
>  355	i386	getrandom		sys_getrandom
>  356	i386	memfd_create		sys_memfd_create
> +357	i386	execveat		sys_execveat			stub32_execveat
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 35dd922727b9..1af5badd159c 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -327,6 +327,7 @@
>  318	common	getrandom		sys_getrandom
>  319	common	memfd_create		sys_memfd_create
>  320	common	kexec_file_load		sys_kexec_file_load
> +321	64	execveat		stub_execveat
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -365,3 +366,4 @@
>  542	x32	getsockopt		compat_sys_getsockopt
>  543	x32	io_setup		compat_sys_io_setup
>  544	x32	io_submit		compat_sys_io_submit
> +545	x32	execveat		stub_x32_execveat
> diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
> index f2f0723070ca..20c3649d0691 100644
> --- a/arch/x86/um/sys_call_table_64.c
> +++ b/arch/x86/um/sys_call_table_64.c
> @@ -31,6 +31,7 @@
>  #define stub_fork sys_fork
>  #define stub_vfork sys_vfork
>  #define stub_execve sys_execve
> +#define stub_execveat sys_execveat
>  #define stub_rt_sigreturn sys_rt_sigreturn
>  
>  #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
> diff --git a/fs/exec.c b/fs/exec.c
> index a2b42a98c743..92a6e14f096a 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -747,7 +747,7 @@ EXPORT_SYMBOL(setup_arg_pages);
>  
>  #endif /* CONFIG_MMU */
>  
> -static struct file *do_open_exec(struct filename *name)
> +static struct file *do_open_execat(int fd, struct filename *name, int flags)
>  {
>  	struct file *file;
>  	int err;
> @@ -757,10 +757,34 @@ static struct file *do_open_exec(struct filename *name)
>  		.intent = LOOKUP_OPEN,
>  		.lookup_flags = LOOKUP_FOLLOW,
>  	};
> +	static const struct open_flags open_exec_nofollow_flags = {
> +		.open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
> +		.acc_mode = MAY_EXEC | MAY_OPEN,
> +		.intent = LOOKUP_OPEN,
> +		.lookup_flags = 0,
> +	};
>  
> -	file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
> -	if (IS_ERR(file))
> -		goto out;
> +	if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (name->name[0] != '\0') {

Is it really necessary to special case AT_EMPTY_PATH here.  I would
have thought the existing logic in namei.c would have been fine
assuning we passed LOOKUP_EMPTY.

> +		const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
> +						   ? &open_exec_nofollow_flags
> +						   : &open_exec_flags);
> +
> +		file = do_filp_open(fd, name, oflags);
> +		if (IS_ERR(file))
> +			goto out;
> +	} else {
> +		file = fget(fd);
> +		if (!file)
> +			return ERR_PTR(-EBADF);
> +
> +		err = inode_permission(file->f_path.dentry->d_inode,
> +				open_exec_flags.acc_mode);
> +		if (err)
> +			goto exit;
> +	}
>
>  	err = -EACCES;
>  	if (!S_ISREG(file_inode(file)->i_mode))
> @@ -769,12 +793,13 @@ static struct file *do_open_exec(struct filename *name)
>  	if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
>  		goto exit;
>  
> -	fsnotify_open(file);
> -
>  	err = deny_write_access(file);
>  	if (err)
>  		goto exit;
>  
> +	if (name->name[0] != '\0')
> +		fsnotify_open(file);
> +
>  out:
>  	return file;
>  
> @@ -786,7 +811,7 @@ exit:
>  struct file *open_exec(const char *name)
>  {
>  	struct filename tmp = { .name = name };
> -	return do_open_exec(&tmp);
> +	return do_open_execat(AT_FDCWD, &tmp, 0);
>  }
>  EXPORT_SYMBOL(open_exec);
>  
> @@ -1422,10 +1447,12 @@ static int exec_binprm(struct linux_binprm *bprm)
>  /*
>   * sys_execve() executes a new program.
>   */
> -static int do_execve_common(struct filename *filename,
> -				struct user_arg_ptr argv,
> -				struct user_arg_ptr envp)
> +static int do_execveat_common(int fd, struct filename *filename,
> +			      struct user_arg_ptr argv,
> +			      struct user_arg_ptr envp,
> +			      int flags)
>  {
> +	char *pathbuf = NULL;
>  	struct linux_binprm *bprm;
>  	struct file *file;
>  	struct files_struct *displaced;
> @@ -1466,7 +1493,7 @@ static int do_execve_common(struct filename *filename,
>  	check_unsafe_exec(bprm);
>  	current->in_execve = 1;
>  
> -	file = do_open_exec(filename);
> +	file = do_open_execat(fd, filename, flags);
>  	retval = PTR_ERR(file);
>  	if (IS_ERR(file))
>  		goto out_unmark;
> @@ -1474,7 +1501,27 @@ static int do_execve_common(struct filename *filename,
>  	sched_exec();
>  
>  	bprm->file = file;
> -	bprm->filename = bprm->interp = filename->name;
> +	if (fd == AT_FDCWD || filename->name[0] == '/') {
> +		bprm->filename = filename->name;
> +	} else {
> +		/*
> +		 * Build a pathname that reflects how we got to the file,
> +		 * either "/dev/fd/<fd>" (for an empty filename) or
> +		 * "/dev/fd/<fd>/<filename>".
> +		 */
> +		pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
> +		if (!pathbuf) {
> +			retval = -ENOMEM;
> +			goto out_unmark;
> +		}
> +		bprm->filename = pathbuf;
> +		if (filename->name[0] == '\0')
> +			sprintf(pathbuf, "/dev/fd/%d", fd);
> +		else
> +			snprintf(pathbuf, PATH_MAX,
> +				 "/dev/fd/%d/%s", fd, filename->name);
> +	}
> +	bprm->interp = bprm->filename;
>  
>  	retval = bprm_mm_init(bprm);
>  	if (retval)
> @@ -1532,6 +1579,7 @@ out_unmark:
>  
>  out_free:
>  	free_bprm(bprm);
> +	kfree(pathbuf);
>  
>  out_files:
>  	if (displaced)
> @@ -1547,7 +1595,18 @@ int do_execve(struct filename *filename,
>  {
>  	struct user_arg_ptr argv = { .ptr.native = __argv };
>  	struct user_arg_ptr envp = { .ptr.native = __envp };
> -	return do_execve_common(filename, argv, envp);
> +	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
> +}
> +
> +int do_execveat(int fd, struct filename *filename,
> +		const char __user *const __user *__argv,
> +		const char __user *const __user *__envp,
> +		int flags)
> +{
> +	struct user_arg_ptr argv = { .ptr.native = __argv };
> +	struct user_arg_ptr envp = { .ptr.native = __envp };
> +
> +	return do_execveat_common(fd, filename, argv, envp, flags);
>  }
>  
>  #ifdef CONFIG_COMPAT
> @@ -1563,7 +1622,23 @@ static int compat_do_execve(struct filename *filename,
>  		.is_compat = true,
>  		.ptr.compat = __envp,
>  	};
> -	return do_execve_common(filename, argv, envp);
> +	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
> +}
> +
> +static int compat_do_execveat(int fd, struct filename *filename,
> +			      const compat_uptr_t __user *__argv,
> +			      const compat_uptr_t __user *__envp,
> +			      int flags)
> +{
> +	struct user_arg_ptr argv = {
> +		.is_compat = true,
> +		.ptr.compat = __argv,
> +	};
> +	struct user_arg_ptr envp = {
> +		.is_compat = true,
> +		.ptr.compat = __envp,
> +	};
> +	return do_execveat_common(fd, filename, argv, envp, flags);
>  }
>  #endif
>  
> @@ -1603,6 +1678,20 @@ SYSCALL_DEFINE3(execve,
>  {
>  	return do_execve(getname(filename), argv, envp);
>  }
> +
> +SYSCALL_DEFINE5(execveat,
> +		int, fd, const char __user *, filename,
> +		const char __user *const __user *, argv,
> +		const char __user *const __user *, envp,
> +		int, flags)
> +{
> +	int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
> +
> +	return do_execveat(fd,
> +			   getname_flags(filename, lookup_flags, NULL),
> +			   argv, envp, flags);
> +}
> +
>  #ifdef CONFIG_COMPAT
>  COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
>  	const compat_uptr_t __user *, argv,
> @@ -1610,4 +1699,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
>  {
>  	return compat_do_execve(getname(filename), argv, envp);
>  }
> +
> +COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
> +		       const char __user *, filename,
> +		       const compat_uptr_t __user *, argv,
> +		       const compat_uptr_t __user *, envp,
> +		       int,  flags)
> +{
> +	int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
> +
> +	return compat_do_execveat(fd,
> +				  getname_flags(filename, lookup_flags, NULL),
> +				  argv, envp, flags);
> +}
>  #endif
> diff --git a/fs/namei.c b/fs/namei.c
> index a7b05bf82d31..553c84d3e0cc 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -130,7 +130,7 @@ void final_putname(struct filename *name)
>  
>  #define EMBEDDED_NAME_MAX	(PATH_MAX - sizeof(struct filename))
>  
> -static struct filename *
> +struct filename *
>  getname_flags(const char __user *filename, int flags, int *empty)
>  {
>  	struct filename *result, *err;
> diff --git a/include/linux/compat.h b/include/linux/compat.h
> index e6494261eaff..7450ca2ac1fc 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
>  
>  asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
>  		     const compat_uptr_t __user *envp);
> +asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
> +		     const compat_uptr_t __user *argv,
> +		     const compat_uptr_t __user *envp, int flags);
>  
>  asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
>  		compat_ulong_t __user *outp, compat_ulong_t __user *exp,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 94187721ad41..e9818574d738 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2060,6 +2060,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
>  extern struct file * dentry_open(const struct path *, int, const struct cred *);
>  extern int filp_close(struct file *, fl_owner_t id);
>  
> +extern struct filename *getname_flags(const char __user *, int, int *);
>  extern struct filename *getname(const char __user *);
>  extern struct filename *getname_kernel(const char *);
>  
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b867a4dab38a..33e056da7d33 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2430,6 +2430,10 @@ extern void do_group_exit(int);
>  extern int do_execve(struct filename *,
>  		     const char __user * const __user *,
>  		     const char __user * const __user *);
> +extern int do_execveat(int, struct filename *,
> +		       const char __user * const __user *,
> +		       const char __user * const __user *,
> +		       int);
>  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
>  struct task_struct *fork_idle(int);
>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 0f86d85a9ce4..df5422294deb 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -876,4 +876,8 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
>  asmlinkage long sys_getrandom(char __user *buf, size_t count,
>  			      unsigned int flags);
>  
> +asmlinkage long sys_execveat(int dfd, const char __user *filename,
> +			const char __user *const __user *argv,
> +			const char __user *const __user *envp, int flags);
> +
>  #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 11d11bc5c78f..feef07d29663 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp)
>  __SYSCALL(__NR_getrandom, sys_getrandom)
>  #define __NR_memfd_create 279
>  __SYSCALL(__NR_memfd_create, sys_memfd_create)
> +#define __NR_execveat 280
> +__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 280
> +#define __NR_syscalls 281
>  
>  /*
>   * All syscalls below here should go away really,
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 391d4ddb6f4b..efb06058ad3e 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -218,3 +218,6 @@ cond_syscall(sys_kcmp);
>  
>  /* operate on Secure Computing state */
>  cond_syscall(sys_seccomp);
> +
> +/* execveat */
> +cond_syscall(sys_execveat);
> diff --git a/lib/audit.c b/lib/audit.c
> index 1d726a22565b..b8fb5ee81e26 100644
> --- a/lib/audit.c
> +++ b/lib/audit.c
> @@ -54,6 +54,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
>  	case __NR_socketcall:
>  		return 4;
>  #endif
> +#ifdef __NR_execveat
> +	case __NR_execveat:
> +#endif
>  	case __NR_execve:
>  		return 5;
>  	default:

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call
  2014-10-22 11:44 ` [PATCHv5 1/3] syscalls,x86: implement " David Drysdale
  2014-10-22 18:07   ` Eric W. Biederman
@ 2014-10-22 18:44   ` Andy Lutomirski
  2014-10-27 18:03     ` David Drysdale
  1 sibling, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2014-10-22 18:44 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Alexander Viro, Meredydd Luff, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API

On Wed, Oct 22, 2014 at 4:44 AM, David Drysdale <drysdale@google.com> wrote:
> Add a new system execveat(2) syscall. execveat() is to execve() as
> openat() is to open(): it takes a file descriptor that refers to a
> directory, and resolves the filename relative to that.
>

>         bprm->file = file;
> -       bprm->filename = bprm->interp = filename->name;
> +       if (fd == AT_FDCWD || filename->name[0] == '/') {
> +               bprm->filename = filename->name;
> +       } else {
> +               /*
> +                * Build a pathname that reflects how we got to the file,
> +                * either "/dev/fd/<fd>" (for an empty filename) or
> +                * "/dev/fd/<fd>/<filename>".
> +                */
> +               pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
> +               if (!pathbuf) {
> +                       retval = -ENOMEM;
> +                       goto out_unmark;
> +               }
> +               bprm->filename = pathbuf;
> +               if (filename->name[0] == '\0')
> +                       sprintf(pathbuf, "/dev/fd/%d", fd);

If the fd is O_CLOEXEC, then this will result in a confused child
process.  Should we fail exec attempts like that for non-static
programs?  (E.g. set filename to "" or something and fix up the binfmt
drivers to handle that?)

> +               else
> +                       snprintf(pathbuf, PATH_MAX,
> +                                "/dev/fd/%d/%s", fd, filename->name);

Does this need to handle the case where the result exceeds PATH_MAX?

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv5 man-pages 3/3] execveat.2: initial man page for execveat(2)
  2014-10-22 17:55   ` Eric W. Biederman
@ 2014-10-22 20:13     ` David Drysdale
  0 siblings, 0 replies; 13+ messages in thread
From: David Drysdale @ 2014-10-22 20:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Alexander Viro, Meredydd Luff, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API

On Wed, Oct 22, 2014 at 6:55 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> David Drysdale <drysdale@google.com> writes:
>> +.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +execveat \- execute program relative to a directory file descriptor
>> +.SH SYNOPSIS
>> +.B #include <unistd.h>
>> +.sp
>> +.BI "int execve(int " fd ", const char *" pathname ","
>             ^^^^^^ That needs to be execveat

Ah yes, so it does -- thanks for spotting.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call
  2014-10-22 18:07   ` Eric W. Biederman
@ 2014-10-23  6:40     ` David Drysdale
  0 siblings, 0 replies; 13+ messages in thread
From: David Drysdale @ 2014-10-23  6:40 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Alexander Viro, Meredydd Luff, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API

On Wed, Oct 22, 2014 at 7:07 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> David Drysdale <drysdale@google.com> writes:
>
>> Add a new system execveat(2) syscall. execveat() is to execve() as
>> openat() is to open(): it takes a file descriptor that refers to a
>> directory, and resolves the filename relative to that.
>>
>> In addition, if the filename is empty and AT_EMPTY_PATH is specified,
>> execveat() executes the file to which the file descriptor refers. This
>> replicates the functionality of fexecve(), which is a system call in
>> other UNIXen, but in Linux glibc it depends on opening
>> "/proc/self/fd/<fd>" (and so relies on /proc being mounted).
>>
>> The filename fed to the executed program as argv[0] (or the name of the
>> script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>> (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>> reflecting how the executable was found.  This does however mean that
>> execution of a script in a /proc-less environment won't work.
>>
>> Only x86-64, i386 and x32 ABIs are supported in this patch.
>>
>> Based on patches by Meredydd Luff <meredydd@senatehouse.org>
>>
>> Signed-off-by: David Drysdale <drysdale@google.com>
>> ---
>>  arch/x86/ia32/audit.c             |   1 +
>>  arch/x86/ia32/ia32entry.S         |   1 +
>>  arch/x86/kernel/audit_64.c        |   1 +
>>  arch/x86/kernel/entry_64.S        |  28 ++++++++
>>  arch/x86/syscalls/syscall_32.tbl  |   1 +
>>  arch/x86/syscalls/syscall_64.tbl  |   2 +
>>  arch/x86/um/sys_call_table_64.c   |   1 +
>>  fs/exec.c                         | 130 ++++++++++++++++++++++++++++++++++----
>>  fs/namei.c                        |   2 +-
>>  include/linux/compat.h            |   3 +
>>  include/linux/fs.h                |   1 +
>>  include/linux/sched.h             |   4 ++
>>  include/linux/syscalls.h          |   4 ++
>>  include/uapi/asm-generic/unistd.h |   4 +-
>>  kernel/sys_ni.c                   |   3 +
>>  lib/audit.c                       |   3 +
>>  16 files changed, 173 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
>> index 5d7b381da692..2eccc8932ae6 100644
>> --- a/arch/x86/ia32/audit.c
>> +++ b/arch/x86/ia32/audit.c
>> @@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
>>       case __NR_socketcall:
>>               return 4;
>>       case __NR_execve:
>> +     case __NR_execveat:
>>               return 5;
>>       default:
>>               return 1;
>> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> index 4299eb05023c..2516c09743e0 100644
>> --- a/arch/x86/ia32/ia32entry.S
>> +++ b/arch/x86/ia32/ia32entry.S
>> @@ -464,6 +464,7 @@ GLOBAL(\label)
>>       PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
>>       PTREGSCALL stub32_sigreturn, sys32_sigreturn
>>       PTREGSCALL stub32_execve, compat_sys_execve
>> +     PTREGSCALL stub32_execveat, compat_sys_execveat
>>       PTREGSCALL stub32_fork, sys_fork
>>       PTREGSCALL stub32_vfork, sys_vfork
>>
>> diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
>> index 06d3e5a14d9d..f3672508b249 100644
>> --- a/arch/x86/kernel/audit_64.c
>> +++ b/arch/x86/kernel/audit_64.c
>> @@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
>>       case __NR_openat:
>>               return 3;
>>       case __NR_execve:
>> +     case __NR_execveat:
>>               return 5;
>>       default:
>>               return 0;
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index 2fac1343a90b..00c4526e6ffe 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -665,6 +665,20 @@ ENTRY(stub_execve)
>>       CFI_ENDPROC
>>  END(stub_execve)
>>
>> +ENTRY(stub_execveat)
>> +     CFI_STARTPROC
>> +     addq $8, %rsp
>> +     PARTIAL_FRAME 0
>> +     SAVE_REST
>> +     FIXUP_TOP_OF_STACK %r11
>> +     call sys_execveat
>> +     RESTORE_TOP_OF_STACK %r11
>> +     movq %rax,RAX(%rsp)
>> +     RESTORE_REST
>> +     jmp int_ret_from_sys_call
>> +     CFI_ENDPROC
>> +END(stub_execveat)
>> +
>>  /*
>>   * sigreturn is special because it needs to restore all registers on return.
>>   * This cannot be done with SYSRET, so use the IRET return path instead.
>> @@ -710,6 +724,20 @@ ENTRY(stub_x32_execve)
>>       CFI_ENDPROC
>>  END(stub_x32_execve)
>>
>> +ENTRY(stub_x32_execveat)
>> +     CFI_STARTPROC
>> +     addq $8, %rsp
>> +     PARTIAL_FRAME 0
>> +     SAVE_REST
>> +     FIXUP_TOP_OF_STACK %r11
>> +     call compat_sys_execveat
>> +     RESTORE_TOP_OF_STACK %r11
>> +     movq %rax,RAX(%rsp)
>> +     RESTORE_REST
>> +     jmp int_ret_from_sys_call
>> +     CFI_ENDPROC
>> +END(stub_x32_execveat)
>> +
>>  #endif
>>
>>  /*
>> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
>> index 028b78168d85..2633e3195455 100644
>> --- a/arch/x86/syscalls/syscall_32.tbl
>> +++ b/arch/x86/syscalls/syscall_32.tbl
>> @@ -363,3 +363,4 @@
>>  354  i386    seccomp                 sys_seccomp
>>  355  i386    getrandom               sys_getrandom
>>  356  i386    memfd_create            sys_memfd_create
>> +357  i386    execveat                sys_execveat                    stub32_execveat
>> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
>> index 35dd922727b9..1af5badd159c 100644
>> --- a/arch/x86/syscalls/syscall_64.tbl
>> +++ b/arch/x86/syscalls/syscall_64.tbl
>> @@ -327,6 +327,7 @@
>>  318  common  getrandom               sys_getrandom
>>  319  common  memfd_create            sys_memfd_create
>>  320  common  kexec_file_load         sys_kexec_file_load
>> +321  64      execveat                stub_execveat
>>
>>  #
>>  # x32-specific system call numbers start at 512 to avoid cache impact
>> @@ -365,3 +366,4 @@
>>  542  x32     getsockopt              compat_sys_getsockopt
>>  543  x32     io_setup                compat_sys_io_setup
>>  544  x32     io_submit               compat_sys_io_submit
>> +545  x32     execveat                stub_x32_execveat
>> diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
>> index f2f0723070ca..20c3649d0691 100644
>> --- a/arch/x86/um/sys_call_table_64.c
>> +++ b/arch/x86/um/sys_call_table_64.c
>> @@ -31,6 +31,7 @@
>>  #define stub_fork sys_fork
>>  #define stub_vfork sys_vfork
>>  #define stub_execve sys_execve
>> +#define stub_execveat sys_execveat
>>  #define stub_rt_sigreturn sys_rt_sigreturn
>>
>>  #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
>> diff --git a/fs/exec.c b/fs/exec.c
>> index a2b42a98c743..92a6e14f096a 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -747,7 +747,7 @@ EXPORT_SYMBOL(setup_arg_pages);
>>
>>  #endif /* CONFIG_MMU */
>>
>> -static struct file *do_open_exec(struct filename *name)
>> +static struct file *do_open_execat(int fd, struct filename *name, int flags)
>>  {
>>       struct file *file;
>>       int err;
>> @@ -757,10 +757,34 @@ static struct file *do_open_exec(struct filename *name)
>>               .intent = LOOKUP_OPEN,
>>               .lookup_flags = LOOKUP_FOLLOW,
>>       };
>> +     static const struct open_flags open_exec_nofollow_flags = {
>> +             .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
>> +             .acc_mode = MAY_EXEC | MAY_OPEN,
>> +             .intent = LOOKUP_OPEN,
>> +             .lookup_flags = 0,
>> +     };
>>
>> -     file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
>> -     if (IS_ERR(file))
>> -             goto out;
>> +     if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
>> +             return ERR_PTR(-EINVAL);
>> +
>> +     if (name->name[0] != '\0') {
>
> Is it really necessary to special case AT_EMPTY_PATH here.  I would
> have thought the existing logic in namei.c would have been fine
> assuning we passed LOOKUP_EMPTY.

Just using do_filp_open() throughout looks mostly plausible on a quick
experiment, but my initial version appears to make O_PATH fds unexpectedly
fexecve()-able (I'm glad I had a test case for that).

I'll look for a way around that, hopefully without an explicit special case.

>> +             const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
>> +                                                ? &open_exec_nofollow_flags
>> +                                                : &open_exec_flags);
>> +
>> +             file = do_filp_open(fd, name, oflags);
>> +             if (IS_ERR(file))
>> +                     goto out;
>> +     } else {
>> +             file = fget(fd);
>> +             if (!file)
>> +                     return ERR_PTR(-EBADF);
>> +
>> +             err = inode_permission(file->f_path.dentry->d_inode,
>> +                             open_exec_flags.acc_mode);
>> +             if (err)
>> +                     goto exit;
>> +     }
>>
>>       err = -EACCES;
>>       if (!S_ISREG(file_inode(file)->i_mode))
>> @@ -769,12 +793,13 @@ static struct file *do_open_exec(struct filename *name)
>>       if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
>>               goto exit;
>>
>> -     fsnotify_open(file);
>> -
>>       err = deny_write_access(file);
>>       if (err)
>>               goto exit;
>>
>> +     if (name->name[0] != '\0')
>> +             fsnotify_open(file);
>> +
>>  out:
>>       return file;
>>
>> @@ -786,7 +811,7 @@ exit:
>>  struct file *open_exec(const char *name)
>>  {
>>       struct filename tmp = { .name = name };
>> -     return do_open_exec(&tmp);
>> +     return do_open_execat(AT_FDCWD, &tmp, 0);
>>  }
>>  EXPORT_SYMBOL(open_exec);
>>
>> @@ -1422,10 +1447,12 @@ static int exec_binprm(struct linux_binprm *bprm)
>>  /*
>>   * sys_execve() executes a new program.
>>   */
>> -static int do_execve_common(struct filename *filename,
>> -                             struct user_arg_ptr argv,
>> -                             struct user_arg_ptr envp)
>> +static int do_execveat_common(int fd, struct filename *filename,
>> +                           struct user_arg_ptr argv,
>> +                           struct user_arg_ptr envp,
>> +                           int flags)
>>  {
>> +     char *pathbuf = NULL;
>>       struct linux_binprm *bprm;
>>       struct file *file;
>>       struct files_struct *displaced;
>> @@ -1466,7 +1493,7 @@ static int do_execve_common(struct filename *filename,
>>       check_unsafe_exec(bprm);
>>       current->in_execve = 1;
>>
>> -     file = do_open_exec(filename);
>> +     file = do_open_execat(fd, filename, flags);
>>       retval = PTR_ERR(file);
>>       if (IS_ERR(file))
>>               goto out_unmark;
>> @@ -1474,7 +1501,27 @@ static int do_execve_common(struct filename *filename,
>>       sched_exec();
>>
>>       bprm->file = file;
>> -     bprm->filename = bprm->interp = filename->name;
>> +     if (fd == AT_FDCWD || filename->name[0] == '/') {
>> +             bprm->filename = filename->name;
>> +     } else {
>> +             /*
>> +              * Build a pathname that reflects how we got to the file,
>> +              * either "/dev/fd/<fd>" (for an empty filename) or
>> +              * "/dev/fd/<fd>/<filename>".
>> +              */
>> +             pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>> +             if (!pathbuf) {
>> +                     retval = -ENOMEM;
>> +                     goto out_unmark;
>> +             }
>> +             bprm->filename = pathbuf;
>> +             if (filename->name[0] == '\0')
>> +                     sprintf(pathbuf, "/dev/fd/%d", fd);
>> +             else
>> +                     snprintf(pathbuf, PATH_MAX,
>> +                              "/dev/fd/%d/%s", fd, filename->name);
>> +     }
>> +     bprm->interp = bprm->filename;
>>
>>       retval = bprm_mm_init(bprm);
>>       if (retval)
>> @@ -1532,6 +1579,7 @@ out_unmark:
>>
>>  out_free:
>>       free_bprm(bprm);
>> +     kfree(pathbuf);
>>
>>  out_files:
>>       if (displaced)
>> @@ -1547,7 +1595,18 @@ int do_execve(struct filename *filename,
>>  {
>>       struct user_arg_ptr argv = { .ptr.native = __argv };
>>       struct user_arg_ptr envp = { .ptr.native = __envp };
>> -     return do_execve_common(filename, argv, envp);
>> +     return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
>> +}
>> +
>> +int do_execveat(int fd, struct filename *filename,
>> +             const char __user *const __user *__argv,
>> +             const char __user *const __user *__envp,
>> +             int flags)
>> +{
>> +     struct user_arg_ptr argv = { .ptr.native = __argv };
>> +     struct user_arg_ptr envp = { .ptr.native = __envp };
>> +
>> +     return do_execveat_common(fd, filename, argv, envp, flags);
>>  }
>>
>>  #ifdef CONFIG_COMPAT
>> @@ -1563,7 +1622,23 @@ static int compat_do_execve(struct filename *filename,
>>               .is_compat = true,
>>               .ptr.compat = __envp,
>>       };
>> -     return do_execve_common(filename, argv, envp);
>> +     return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
>> +}
>> +
>> +static int compat_do_execveat(int fd, struct filename *filename,
>> +                           const compat_uptr_t __user *__argv,
>> +                           const compat_uptr_t __user *__envp,
>> +                           int flags)
>> +{
>> +     struct user_arg_ptr argv = {
>> +             .is_compat = true,
>> +             .ptr.compat = __argv,
>> +     };
>> +     struct user_arg_ptr envp = {
>> +             .is_compat = true,
>> +             .ptr.compat = __envp,
>> +     };
>> +     return do_execveat_common(fd, filename, argv, envp, flags);
>>  }
>>  #endif
>>
>> @@ -1603,6 +1678,20 @@ SYSCALL_DEFINE3(execve,
>>  {
>>       return do_execve(getname(filename), argv, envp);
>>  }
>> +
>> +SYSCALL_DEFINE5(execveat,
>> +             int, fd, const char __user *, filename,
>> +             const char __user *const __user *, argv,
>> +             const char __user *const __user *, envp,
>> +             int, flags)
>> +{
>> +     int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
>> +
>> +     return do_execveat(fd,
>> +                        getname_flags(filename, lookup_flags, NULL),
>> +                        argv, envp, flags);
>> +}
>> +
>>  #ifdef CONFIG_COMPAT
>>  COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
>>       const compat_uptr_t __user *, argv,
>> @@ -1610,4 +1699,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
>>  {
>>       return compat_do_execve(getname(filename), argv, envp);
>>  }
>> +
>> +COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
>> +                    const char __user *, filename,
>> +                    const compat_uptr_t __user *, argv,
>> +                    const compat_uptr_t __user *, envp,
>> +                    int,  flags)
>> +{
>> +     int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
>> +
>> +     return compat_do_execveat(fd,
>> +                               getname_flags(filename, lookup_flags, NULL),
>> +                               argv, envp, flags);
>> +}
>>  #endif
>> diff --git a/fs/namei.c b/fs/namei.c
>> index a7b05bf82d31..553c84d3e0cc 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -130,7 +130,7 @@ void final_putname(struct filename *name)
>>
>>  #define EMBEDDED_NAME_MAX    (PATH_MAX - sizeof(struct filename))
>>
>> -static struct filename *
>> +struct filename *
>>  getname_flags(const char __user *filename, int flags, int *empty)
>>  {
>>       struct filename *result, *err;
>> diff --git a/include/linux/compat.h b/include/linux/compat.h
>> index e6494261eaff..7450ca2ac1fc 100644
>> --- a/include/linux/compat.h
>> +++ b/include/linux/compat.h
>> @@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
>>
>>  asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
>>                    const compat_uptr_t __user *envp);
>> +asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
>> +                  const compat_uptr_t __user *argv,
>> +                  const compat_uptr_t __user *envp, int flags);
>>
>>  asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
>>               compat_ulong_t __user *outp, compat_ulong_t __user *exp,
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 94187721ad41..e9818574d738 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2060,6 +2060,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
>>  extern struct file * dentry_open(const struct path *, int, const struct cred *);
>>  extern int filp_close(struct file *, fl_owner_t id);
>>
>> +extern struct filename *getname_flags(const char __user *, int, int *);
>>  extern struct filename *getname(const char __user *);
>>  extern struct filename *getname_kernel(const char *);
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index b867a4dab38a..33e056da7d33 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2430,6 +2430,10 @@ extern void do_group_exit(int);
>>  extern int do_execve(struct filename *,
>>                    const char __user * const __user *,
>>                    const char __user * const __user *);
>> +extern int do_execveat(int, struct filename *,
>> +                    const char __user * const __user *,
>> +                    const char __user * const __user *,
>> +                    int);
>>  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
>>  struct task_struct *fork_idle(int);
>>  extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index 0f86d85a9ce4..df5422294deb 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -876,4 +876,8 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
>>  asmlinkage long sys_getrandom(char __user *buf, size_t count,
>>                             unsigned int flags);
>>
>> +asmlinkage long sys_execveat(int dfd, const char __user *filename,
>> +                     const char __user *const __user *argv,
>> +                     const char __user *const __user *envp, int flags);
>> +
>>  #endif
>> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
>> index 11d11bc5c78f..feef07d29663 100644
>> --- a/include/uapi/asm-generic/unistd.h
>> +++ b/include/uapi/asm-generic/unistd.h
>> @@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp)
>>  __SYSCALL(__NR_getrandom, sys_getrandom)
>>  #define __NR_memfd_create 279
>>  __SYSCALL(__NR_memfd_create, sys_memfd_create)
>> +#define __NR_execveat 280
>> +__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>>
>>  #undef __NR_syscalls
>> -#define __NR_syscalls 280
>> +#define __NR_syscalls 281
>>
>>  /*
>>   * All syscalls below here should go away really,
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index 391d4ddb6f4b..efb06058ad3e 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -218,3 +218,6 @@ cond_syscall(sys_kcmp);
>>
>>  /* operate on Secure Computing state */
>>  cond_syscall(sys_seccomp);
>> +
>> +/* execveat */
>> +cond_syscall(sys_execveat);
>> diff --git a/lib/audit.c b/lib/audit.c
>> index 1d726a22565b..b8fb5ee81e26 100644
>> --- a/lib/audit.c
>> +++ b/lib/audit.c
>> @@ -54,6 +54,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
>>       case __NR_socketcall:
>>               return 4;
>>  #endif
>> +#ifdef __NR_execveat
>> +     case __NR_execveat:
>> +#endif
>>       case __NR_execve:
>>               return 5;
>>       default:

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call
  2014-10-22 18:44   ` Andy Lutomirski
@ 2014-10-27 18:03     ` David Drysdale
  2014-10-27 18:47       ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: David Drysdale @ 2014-10-27 18:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Alexander Viro, Meredydd Luff, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API

On Wed, Oct 22, 2014 at 7:44 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Oct 22, 2014 at 4:44 AM, David Drysdale <drysdale@google.com> wrote:
>> Add a new system execveat(2) syscall. execveat() is to execve() as
>> openat() is to open(): it takes a file descriptor that refers to a
>> directory, and resolves the filename relative to that.
>>
>
>>         bprm->file = file;
>> -       bprm->filename = bprm->interp = filename->name;
>> +       if (fd == AT_FDCWD || filename->name[0] == '/') {
>> +               bprm->filename = filename->name;
>> +       } else {
>> +               /*
>> +                * Build a pathname that reflects how we got to the file,
>> +                * either "/dev/fd/<fd>" (for an empty filename) or
>> +                * "/dev/fd/<fd>/<filename>".
>> +                */
>> +               pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>> +               if (!pathbuf) {
>> +                       retval = -ENOMEM;
>> +                       goto out_unmark;
>> +               }
>> +               bprm->filename = pathbuf;
>> +               if (filename->name[0] == '\0')
>> +                       sprintf(pathbuf, "/dev/fd/%d", fd);
>
> If the fd is O_CLOEXEC, then this will result in a confused child
> process.  Should we fail exec attempts like that for non-static
> programs?  (E.g. set filename to "" or something and fix up the binfmt
> drivers to handle that?)

Isn't it just scripts that get confused here (as normal executables don't
get to see brpm->filename)?

Given that we don't know which we have at this point, I'd suggest
carrying on regardless.  Or we could fall back to use the previous
best-effort d_path() code for O_CLOEXEC fds.  Thoughts?

>> +               else
>> +                       snprintf(pathbuf, PATH_MAX,
>> +                                "/dev/fd/%d/%s", fd, filename->name);
>
> Does this need to handle the case where the result exceeds PATH_MAX?

I guess we could kmalloc(strlen(filename->name) + 19) to avoid the
possibility of failure, but that just defers the inevitable -- the interpreter
won't be able to open the script file anyway.  But it would at least then
generate the appropriate error (ENAMETOOLONG rather than ENOENT).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call
  2014-10-27 18:03     ` David Drysdale
@ 2014-10-27 18:47       ` Andy Lutomirski
  2014-10-28 17:30         ` David Drysdale
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2014-10-27 18:47 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Alexander Viro, Meredydd Luff, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API

On Mon, Oct 27, 2014 at 11:03 AM, David Drysdale <drysdale@google.com> wrote:
> On Wed, Oct 22, 2014 at 7:44 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Oct 22, 2014 at 4:44 AM, David Drysdale <drysdale@google.com> wrote:
>>> Add a new system execveat(2) syscall. execveat() is to execve() as
>>> openat() is to open(): it takes a file descriptor that refers to a
>>> directory, and resolves the filename relative to that.
>>>
>>
>>>         bprm->file = file;
>>> -       bprm->filename = bprm->interp = filename->name;
>>> +       if (fd == AT_FDCWD || filename->name[0] == '/') {
>>> +               bprm->filename = filename->name;
>>> +       } else {
>>> +               /*
>>> +                * Build a pathname that reflects how we got to the file,
>>> +                * either "/dev/fd/<fd>" (for an empty filename) or
>>> +                * "/dev/fd/<fd>/<filename>".
>>> +                */
>>> +               pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>>> +               if (!pathbuf) {
>>> +                       retval = -ENOMEM;
>>> +                       goto out_unmark;
>>> +               }
>>> +               bprm->filename = pathbuf;
>>> +               if (filename->name[0] == '\0')
>>> +                       sprintf(pathbuf, "/dev/fd/%d", fd);
>>
>> If the fd is O_CLOEXEC, then this will result in a confused child
>> process.  Should we fail exec attempts like that for non-static
>> programs?  (E.g. set filename to "" or something and fix up the binfmt
>> drivers to handle that?)
>
> Isn't it just scripts that get confused here (as normal executables don't
> get to see brpm->filename)?
>
> Given that we don't know which we have at this point, I'd suggest
> carrying on regardless.  Or we could fall back to use the previous
> best-effort d_path() code for O_CLOEXEC fds.  Thoughts?

How hard would it be to mark the bprm as not having a path for the
binary?  Then we could fail later on if and when we actually need the
path.

I don't really have a strong opinion here, though.  I do prefer
actually failing the execveat call over succeeding but invoking a
script interpreter than can't possibly work.

>
>>> +               else
>>> +                       snprintf(pathbuf, PATH_MAX,
>>> +                                "/dev/fd/%d/%s", fd, filename->name);
>>
>> Does this need to handle the case where the result exceeds PATH_MAX?
>
> I guess we could kmalloc(strlen(filename->name) + 19) to avoid the
> possibility of failure, but that just defers the inevitable -- the interpreter
> won't be able to open the script file anyway.  But it would at least then
> generate the appropriate error (ENAMETOOLONG rather than ENOENT).

Depends whether anyone cares about bprm->filename.  But I think the
code should either return an error or allocate enough space.


-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call
  2014-10-27 18:47       ` Andy Lutomirski
@ 2014-10-28 17:30         ` David Drysdale
  2014-10-28 17:48           ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: David Drysdale @ 2014-10-28 17:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, Alexander Viro, Meredydd Luff, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API

[Oops, re-send remembering to turn on plaintext mode -- sorry]

On Mon, Oct 27, 2014 at 6:47 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Oct 27, 2014 at 11:03 AM, David Drysdale <drysdale@google.com> wrote:
>> On Wed, Oct 22, 2014 at 7:44 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Wed, Oct 22, 2014 at 4:44 AM, David Drysdale <drysdale@google.com> wrote:
>>>> Add a new system execveat(2) syscall. execveat() is to execve() as
>>>> openat() is to open(): it takes a file descriptor that refers to a
>>>> directory, and resolves the filename relative to that.
>>>>
>>>
>>>>         bprm->file = file;
>>>> -       bprm->filename = bprm->interp = filename->name;
>>>> +       if (fd == AT_FDCWD || filename->name[0] == '/') {
>>>> +               bprm->filename = filename->name;
>>>> +       } else {
>>>> +               /*
>>>> +                * Build a pathname that reflects how we got to the file,
>>>> +                * either "/dev/fd/<fd>" (for an empty filename) or
>>>> +                * "/dev/fd/<fd>/<filename>".
>>>> +                */
>>>> +               pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>>>> +               if (!pathbuf) {
>>>> +                       retval = -ENOMEM;
>>>> +                       goto out_unmark;
>>>> +               }
>>>> +               bprm->filename = pathbuf;
>>>> +               if (filename->name[0] == '\0')
>>>> +                       sprintf(pathbuf, "/dev/fd/%d", fd);
>>>
>>> If the fd is O_CLOEXEC, then this will result in a confused child
>>> process.  Should we fail exec attempts like that for non-static
>>> programs?  (E.g. set filename to "" or something and fix up the binfmt
>>> drivers to handle that?)
>>
>> Isn't it just scripts that get confused here (as normal executables don't
>> get to see brpm->filename)?
>>
>> Given that we don't know which we have at this point, I'd suggest
>> carrying on regardless.  Or we could fall back to use the previous
>> best-effort d_path() code for O_CLOEXEC fds.  Thoughts?
>
> How hard would it be to mark the bprm as not having a path for the
> binary?  Then we could fail later on if and when we actually need the
> path.

Adding a flag to bprm->interp_flags to indicate that the bprm->filename
will be inaccessible after exec is straightforward.  But I'm not sure who
should/could make use of the flag...

> I don't really have a strong opinion here, though.  I do prefer
> actually failing the execveat call over succeeding but invoking a
> script interpreter than can't possibly work.

Yeah, but that involves the kernel code (e.g. fs/binfmt_script.c) making
an assumption about what the interpreter is going to do -- specifically
that it's going to try to open its argv[1].  Admittedly, that's a very likely
assumption, but I'm not sure it's one the kernel should make -- a script
like "#!/bin/echo" wouldn't be very useful, but fexecve()ing it would still
work OK on a name like "/dev/fd/7" after fd 7 is closed.

(Also, we need some kind of non-empty name in bprm->filename,
even if it's going to be inaccessible later, so that any LSM processing
off of the bprm_set_creds()/bprm_check_security() hooks has something
to work with; those hooks are pre-exec so the "/dev/fd/<fd>" part should
still be OK at that point.)

So I guess I lean towards keeping "/dev/fd/<fd>/<path>" regardless.

>>
>>>> +               else
>>>> +                       snprintf(pathbuf, PATH_MAX,
>>>> +                                "/dev/fd/%d/%s", fd, filename->name);
>>>
>>> Does this need to handle the case where the result exceeds PATH_MAX?
>>
>> I guess we could kmalloc(strlen(filename->name) + 19) to avoid the
>> possibility of failure, but that just defers the inevitable -- the interpreter
>> won't be able to open the script file anyway.  But it would at least then
>> generate the appropriate error (ENAMETOOLONG rather than ENOENT).
>
> Depends whether anyone cares about bprm->filename.  But I think the
> code should either return an error or allocate enough space.

I'll allocate enough space.

>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call
  2014-10-28 17:30         ` David Drysdale
@ 2014-10-28 17:48           ` Andy Lutomirski
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2014-10-28 17:48 UTC (permalink / raw)
  To: David Drysdale
  Cc: Eric W. Biederman, Alexander Viro, Meredydd Luff, linux-kernel,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
	Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API

On Tue, Oct 28, 2014 at 10:30 AM, David Drysdale <drysdale@google.com> wrote:
> [Oops, re-send remembering to turn on plaintext mode -- sorry]
>
> On Mon, Oct 27, 2014 at 6:47 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Mon, Oct 27, 2014 at 11:03 AM, David Drysdale <drysdale@google.com> wrote:
>>> On Wed, Oct 22, 2014 at 7:44 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Wed, Oct 22, 2014 at 4:44 AM, David Drysdale <drysdale@google.com> wrote:
>>>>> Add a new system execveat(2) syscall. execveat() is to execve() as
>>>>> openat() is to open(): it takes a file descriptor that refers to a
>>>>> directory, and resolves the filename relative to that.
>>>>>
>>>>
>>>>>         bprm->file = file;
>>>>> -       bprm->filename = bprm->interp = filename->name;
>>>>> +       if (fd == AT_FDCWD || filename->name[0] == '/') {
>>>>> +               bprm->filename = filename->name;
>>>>> +       } else {
>>>>> +               /*
>>>>> +                * Build a pathname that reflects how we got to the file,
>>>>> +                * either "/dev/fd/<fd>" (for an empty filename) or
>>>>> +                * "/dev/fd/<fd>/<filename>".
>>>>> +                */
>>>>> +               pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>>>>> +               if (!pathbuf) {
>>>>> +                       retval = -ENOMEM;
>>>>> +                       goto out_unmark;
>>>>> +               }
>>>>> +               bprm->filename = pathbuf;
>>>>> +               if (filename->name[0] == '\0')
>>>>> +                       sprintf(pathbuf, "/dev/fd/%d", fd);
>>>>
>>>> If the fd is O_CLOEXEC, then this will result in a confused child
>>>> process.  Should we fail exec attempts like that for non-static
>>>> programs?  (E.g. set filename to "" or something and fix up the binfmt
>>>> drivers to handle that?)
>>>
>>> Isn't it just scripts that get confused here (as normal executables don't
>>> get to see brpm->filename)?
>>>
>>> Given that we don't know which we have at this point, I'd suggest
>>> carrying on regardless.  Or we could fall back to use the previous
>>> best-effort d_path() code for O_CLOEXEC fds.  Thoughts?
>>
>> How hard would it be to mark the bprm as not having a path for the
>> binary?  Then we could fail later on if and when we actually need the
>> path.
>
> Adding a flag to bprm->interp_flags to indicate that the bprm->filename
> will be inaccessible after exec is straightforward.  But I'm not sure who
> should/could make use of the flag...
>
>> I don't really have a strong opinion here, though.  I do prefer
>> actually failing the execveat call over succeeding but invoking a
>> script interpreter than can't possibly work.
>
> Yeah, but that involves the kernel code (e.g. fs/binfmt_script.c) making
> an assumption about what the interpreter is going to do -- specifically
> that it's going to try to open its argv[1].  Admittedly, that's a very likely
> assumption, but I'm not sure it's one the kernel should make -- a script
> like "#!/bin/echo" wouldn't be very useful, but fexecve()ing it would still
> work OK on a name like "/dev/fd/7" after fd 7 is closed.

Hmm.  I'm unconvinced.  If an important part of executing the script
is passing it an argv[0] that can be opened, then I think we shouldn't
allow a known-bad argv[0].

>
> (Also, we need some kind of non-empty name in bprm->filename,
> even if it's going to be inaccessible later, so that any LSM processing
> off of the bprm_set_creds()/bprm_check_security() hooks has something
> to work with; those hooks are pre-exec so the "/dev/fd/<fd>" part should
> still be OK at that point.)
>
> So I guess I lean towards keeping "/dev/fd/<fd>/<path>" regardless.
>
>>>
>>>>> +               else
>>>>> +                       snprintf(pathbuf, PATH_MAX,
>>>>> +                                "/dev/fd/%d/%s", fd, filename->name);
>>>>
>>>> Does this need to handle the case where the result exceeds PATH_MAX?
>>>
>>> I guess we could kmalloc(strlen(filename->name) + 19) to avoid the
>>> possibility of failure, but that just defers the inevitable -- the interpreter
>>> won't be able to open the script file anyway.  But it would at least then
>>> generate the appropriate error (ENAMETOOLONG rather than ENOENT).
>>
>> Depends whether anyone cares about bprm->filename.  But I think the
>> code should either return an error or allocate enough space.
>
> I'll allocate enough space.
>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-10-28 17:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-22 11:44 [PATCHv5 0/3] syscalls,x86: Add execveat() system call David Drysdale
2014-10-22 11:44 ` [PATCHv5 1/3] syscalls,x86: implement " David Drysdale
2014-10-22 18:07   ` Eric W. Biederman
2014-10-23  6:40     ` David Drysdale
2014-10-22 18:44   ` Andy Lutomirski
2014-10-27 18:03     ` David Drysdale
2014-10-27 18:47       ` Andy Lutomirski
2014-10-28 17:30         ` David Drysdale
2014-10-28 17:48           ` Andy Lutomirski
2014-10-22 11:44 ` [PATCHv5 2/3] syscalls,x86: add selftest for execveat(2) David Drysdale
2014-10-22 11:44 ` [PATCHv5 man-pages 3/3] execveat.2: initial man page " David Drysdale
2014-10-22 17:55   ` Eric W. Biederman
2014-10-22 20:13     ` David Drysdale

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).