linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/3] Add pidfd_getfd syscall
@ 2020-01-03 16:29 Sargun Dhillon
  2020-01-03 16:29 ` [PATCH v8 1/3] vfs, fdtable: Add get_task_file helper Sargun Dhillon
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Sargun Dhillon @ 2020-01-03 16:29 UTC (permalink / raw)
  To: linux-kernel, containers, linux-api, linux-fsdevel
  Cc: Sargun Dhillon, tycho, jannh, cyphar, christian.brauner, oleg,
	luto, viro, gpascutto, ealvarez, fweimer, jld, arnd

This patchset introduces a mechanism (pidfd_getfd syscall) to get file
descriptors from other processes via pidfd. Although this can be achieved
using SCM_RIGHTS, and parasitic code injection, this offers a more
straightforward mechanism, with less overhead and complexity. The process
under manipulation's fd still remains valid, and unmodified by the
copy operation.

It introduces a flags field. The flags field is reserved a the moment,
but the intent is to extend it with the following capabilities:
 * Close the remote FD when copying it
 * Drop the cgroup data if it's a fd pointing a socket when copying it

The syscall numbers were chosen to be one greater than openat2.

Summary of history:
This initially started as a ptrace command. It did not require the process
to be stopped, and felt like kind of an awkward fit for ptrace. After that,
it moved to an ioctl on the pidfd. Given functionality, it made sense to
make it a syscall which did not require the process to be stopped.

Previous versions:
 V7: https://lore.kernel.org/lkml/20191226180227.GA29389@ircssh-2.c.rugged-nimbus-611.internal/
 V6: https://lore.kernel.org/lkml/20191223210823.GA25083@ircssh-2.c.rugged-nimbus-611.internal/
 V5: https://lore.kernel.org/lkml/20191220232746.GA20215@ircssh-2.c.rugged-nimbus-611.internal/
 V4: https://lore.kernel.org/lkml/20191218235310.GA17259@ircssh-2.c.rugged-nimbus-611.internal/
 V3: https://lore.kernel.org/lkml/20191217005842.GA14379@ircssh-2.c.rugged-nimbus-611.internal/
 V2: https://lore.kernel.org/lkml/20191209070446.GA32336@ircssh-2.c.rugged-nimbus-611.internal/
 RFC V1: https://lore.kernel.org/lkml/20191205234450.GA26369@ircssh-2.c.rugged-nimbus-611.internal/

Changes since v7:
 * No longer put security_file_recv at the end, and align with other
   usages of putting it at the end of the file_recv.
 * Rewrite self-tests in kselftest harness.
 * Minor refactoring

Changes since v6:
 * Proper attribution of get_task_file helper
 * Move all types for syscall to int to represent fd

Changes since v5:
 * Drop pidfd_getfd_options struct and replace with a flags field

Changes since v4:
 * Turn into a syscall
 * Move to PTRACE_MODE_ATTACH_REALCREDS from PTRACE_MODE_READ_REALCREDS
 * Remove the sample code. This will come in another patchset, as the
   new self-tests cover all the functionality.

Changes since v3:
 * Add self-test
 * Move to ioctl passing fd directly, versus args struct
 * Shuffle around include files

Changes since v2:
 * Move to ioctl on pidfd instead of ptrace function
 * Add security check before moving file descriptor

Changes since the RFC v1:
 * Introduce a new helper to fs/file.c to fetch a file descriptor from
   any process. It largely uses the code suggested by Oleg, with a few
   changes to fix locking
 * It uses an extensible options struct to supply the FD, and option.
 * I added a sample, using the code from the user-ptrace sample

Sargun Dhillon (3):
  vfs, fdtable: Add get_task_file helper
  pid: Introduce pidfd_getfd syscall
  test: Add test for pidfd getfd

 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd.h               |   2 +-
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 fs/file.c                                     |  22 +-
 include/linux/file.h                          |   2 +
 include/linux/syscalls.h                      |   1 +
 include/uapi/asm-generic/unistd.h             |   4 +-
 kernel/pid.c                                  |  90 +++++++
 tools/testing/selftests/pidfd/.gitignore      |   1 +
 tools/testing/selftests/pidfd/Makefile        |   4 +-
 .../selftests/pidfd/pidfd_getfd_test.c        | 227 ++++++++++++++++++
 26 files changed, 365 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/pidfd/pidfd_getfd_test.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v8 1/3] vfs, fdtable: Add get_task_file helper
  2020-01-03 16:29 [PATCH v8 0/3] Add pidfd_getfd syscall Sargun Dhillon
@ 2020-01-03 16:29 ` Sargun Dhillon
  2020-01-05 12:47   ` Christian Brauner
  2020-01-03 16:29 ` [PATCH v8 2/3] pid: Introduce pidfd_getfd syscall Sargun Dhillon
  2020-01-03 16:29 ` [PATCH v8 3/3] test: Add test for pidfd getfd Sargun Dhillon
  2 siblings, 1 reply; 13+ messages in thread
From: Sargun Dhillon @ 2020-01-03 16:29 UTC (permalink / raw)
  To: linux-kernel, containers, linux-api, linux-fsdevel
  Cc: Sargun Dhillon, tycho, jannh, cyphar, christian.brauner, oleg,
	luto, viro, gpascutto, ealvarez, fweimer, jld, arnd

This introduces a function which can be used to fetch a file, given an
arbitrary task. As long as the user holds a reference (refcnt) to the
task_struct it is safe to call, and will either return NULL on failure,
or a pointer to the file, with a refcnt.

This patch is based on Oleg Nesterov's (cf. [1]) patch from September
2018.

[1]: Link: https://lore.kernel.org/r/20180915160423.GA31461@redhat.com

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 fs/file.c            | 22 ++++++++++++++++++++--
 include/linux/file.h |  2 ++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 2f4fcf985079..2fc5eeef54a4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -706,9 +706,9 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
+static struct file *__fget_files(struct files_struct *files, unsigned int fd,
+				 fmode_t mask, unsigned int refs)
 {
-	struct files_struct *files = current->files;
 	struct file *file;
 
 	rcu_read_lock();
@@ -729,6 +729,12 @@ static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
 	return file;
 }
 
+static inline struct file *__fget(unsigned int fd, fmode_t mask,
+				  unsigned int refs)
+{
+	return __fget_files(current->files, fd, mask, refs);
+}
+
 struct file *fget_many(unsigned int fd, unsigned int refs)
 {
 	return __fget(fd, FMODE_PATH, refs);
@@ -746,6 +752,18 @@ struct file *fget_raw(unsigned int fd)
 }
 EXPORT_SYMBOL(fget_raw);
 
+struct file *fget_task(struct task_struct *task, unsigned int fd)
+{
+	struct file *file = NULL;
+
+	task_lock(task);
+	if (task->files)
+		file = __fget_files(task->files, fd, 0, 1);
+	task_unlock(task);
+
+	return file;
+}
+
 /*
  * Lightweight file lookup - no refcnt increment if fd table isn't shared.
  *
diff --git a/include/linux/file.h b/include/linux/file.h
index 3fcddff56bc4..c6c7b24ea9f7 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -16,6 +16,7 @@ extern void fput(struct file *);
 extern void fput_many(struct file *, unsigned int);
 
 struct file_operations;
+struct task_struct;
 struct vfsmount;
 struct dentry;
 struct inode;
@@ -47,6 +48,7 @@ static inline void fdput(struct fd fd)
 extern struct file *fget(unsigned int fd);
 extern struct file *fget_many(unsigned int fd, unsigned int refs);
 extern struct file *fget_raw(unsigned int fd);
+extern struct file *fget_task(struct task_struct *task, unsigned int fd);
 extern unsigned long __fdget(unsigned int fd);
 extern unsigned long __fdget_raw(unsigned int fd);
 extern unsigned long __fdget_pos(unsigned int fd);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v8 2/3] pid: Introduce pidfd_getfd syscall
  2020-01-03 16:29 [PATCH v8 0/3] Add pidfd_getfd syscall Sargun Dhillon
  2020-01-03 16:29 ` [PATCH v8 1/3] vfs, fdtable: Add get_task_file helper Sargun Dhillon
@ 2020-01-03 16:29 ` Sargun Dhillon
  2020-01-05 13:30   ` Christian Brauner
  2020-01-17 23:06   ` Matthew Wilcox
  2020-01-03 16:29 ` [PATCH v8 3/3] test: Add test for pidfd getfd Sargun Dhillon
  2 siblings, 2 replies; 13+ messages in thread
From: Sargun Dhillon @ 2020-01-03 16:29 UTC (permalink / raw)
  To: linux-kernel, containers, linux-api, linux-fsdevel
  Cc: Sargun Dhillon, tycho, jannh, cyphar, christian.brauner, oleg,
	luto, viro, gpascutto, ealvarez, fweimer, jld, arnd

This syscall allows for the retrieval of file descriptors from other
processes, based on their pidfd. This is possible using ptrace, and
injection of parasitic code to inject code which leverages SCM_RIGHTS
to move file descriptors between a tracee and a tracer. Unfortunately,
ptrace comes with a high cost of requiring the process to be stopped,
and breaks debuggers. This does not require stopping the process under
manipulation.

One reason to use this is to allow sandboxers to take actions on file
descriptors on the behalf of another process. For example, this can be
combined with seccomp-bpf's user notification to do on-demand fd
extraction and take privileged actions. One such privileged action
is binding a socket to a privileged port.

This also adds the syscall to all architectures at the same time.

/* prototype */
  /* flags is currently reserved and should be set to 0 */
  int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);

/* testing */
Ran self-test suite on x86_64

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/include/asm/unistd.h             |  2 +-
 arch/arm64/include/asm/unistd32.h           |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 include/linux/syscalls.h                    |  1 +
 include/uapi/asm-generic/unistd.h           |  4 +-
 kernel/pid.c                                | 90 +++++++++++++++++++++
 21 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 8e13b0b2928d..82301080f5e7 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 543	common	fspick				sys_fspick
 544	common	pidfd_open			sys_pidfd_open
 # 545 reserved for clone3
+548	common	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..ba045e2f3a60 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+438	common	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..b722e47377a5 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		436
+#define __NR_compat_syscalls		439
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 94ab29cf4f00..a8da97a2de41 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
 __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 #define __NR_clone3 435
 __SYSCALL(__NR_clone3, sys_clone3)
+#define __NR_pidfd_getfd 438
+__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 36d5faf4c86c..2b11adfc860c 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,3 +356,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	common	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index a88a285a0e5f..44e879e98459 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	common	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 09b0cd7dab0a..7afa00125cc4 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+438	common	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index e7c5ab38e403..856d5ba34461 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -374,3 +374,4 @@
 433	n32	fspick				sys_fspick
 434	n32	pidfd_open			sys_pidfd_open
 435	n32	clone3				__sys_clone3
+438	n32	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 13cd66581f3b..2db6075352f3 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -350,3 +350,4 @@
 433	n64	fspick				sys_fspick
 434	n64	pidfd_open			sys_pidfd_open
 435	n64	clone3				__sys_clone3
+438	n64	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 353539ea4140..e9f9d4a9b105 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -423,3 +423,4 @@
 433	o32	fspick				sys_fspick
 434	o32	pidfd_open			sys_pidfd_open
 435	o32	clone3				__sys_clone3
+438	o32	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 285ff516150c..c58c7eb144ca 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3_wrapper
+438	common	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 43f736ed47f2..707609bfe3ea 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	nospu	clone3				ppc_clone3
+438	common	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 3054e9c035a3..185cd624face 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
 433  common	fspick			sys_fspick			sys_fspick
 434  common	pidfd_open		sys_pidfd_open			sys_pidfd_open
 435  common	clone3			sys_clone3			sys_clone3
+438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index b5ed26c4c005..88f90895aad8 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	common	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8c8cc7537fb2..218df6a2326e 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -481,3 +481,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 # 435 reserved for clone3
+438	common	pidfd_getfd			sys_pidfd_getfd
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 15908eb9b17e..9c3101b65e0f 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,3 +440,4 @@
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
 434	i386	pidfd_open		sys_pidfd_open			__ia32_sys_pidfd_open
 435	i386	clone3			sys_clone3			__ia32_sys_clone3
+438	i386	pidfd_getfd		sys_pidfd_getfd			__ia32_sys_pidfd_getfd
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c29976eca4a8..cef85db75a62 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
 433	common	fspick			__x64_sys_fspick
 434	common	pidfd_open		__x64_sys_pidfd_open
 435	common	clone3			__x64_sys_clone3/ptregs
+438	common	pidfd_getfd		__x64_sys_pidfd_getfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 25f4de729a6d..ae15183def12 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -406,3 +406,4 @@
 433	common	fspick				sys_fspick
 434	common	pidfd_open			sys_pidfd_open
 435	common	clone3				sys_clone3
+438	common	pidfd_getfd			sys_pidfd_getfd
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2960dedcfde8..5edbc31af51f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1000,6 +1000,7 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags)
 asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
+asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1fc8faa6e973..d36ec3d645bd 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -850,9 +850,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 #define __NR_clone3 435
 __SYSCALL(__NR_clone3, sys_clone3)
 #endif
+#define __NR_pidfd_getfd 438
+__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 
 #undef __NR_syscalls
-#define __NR_syscalls 436
+#define __NR_syscalls 439
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/pid.c b/kernel/pid.c
index 2278e249141d..0f4ecb57214c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -578,3 +578,93 @@ void __init pid_idr_init(void)
 	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
 			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
 }
+
+static struct file *__pidfd_fget(struct task_struct *task, int fd)
+{
+	struct file *file;
+	int ret;
+
+	ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
+	if (ret)
+		return ERR_PTR(ret);
+
+	if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
+		file = fget_task(task, fd);
+	else
+		file = ERR_PTR(-EPERM);
+
+	mutex_unlock(&task->signal->cred_guard_mutex);
+
+	return file ?: ERR_PTR(-EBADF);
+}
+
+static int pidfd_getfd(struct pid *pid, int fd)
+{
+	struct task_struct *task;
+	struct file *file;
+	int ret;
+
+	task = get_pid_task(pid, PIDTYPE_PID);
+	if (!task)
+		return -ESRCH;
+
+	file = __pidfd_fget(task, fd);
+	put_task_struct(task);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ret = security_file_receive(file);
+	if (ret) {
+		fput(file);
+		return ret;
+	}
+
+	ret = get_unused_fd_flags(O_CLOEXEC);
+	if (ret < 0)
+		fput(file);
+	else
+		fd_install(ret, file);
+
+	return ret;
+}
+
+/**
+ * sys_pidfd_getfd() - Get a file descriptor from another process
+ *
+ * @pidfd:	the pidfd file descriptor of the process
+ * @fd:		the file descriptor number to get
+ * @flags:	flags on how to get the fd (reserved)
+ *
+ * This syscall gets a copy of a file descriptor from another process
+ * based on the pidfd, and file descriptor number. It requires that
+ * the calling process has the ability to ptrace the process represented
+ * by the pidfd. The process which is having its file descriptor copied
+ * is otherwise unaffected.
+ *
+ * Return: On success, a cloexec file descriptor is returned.
+ *         On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE3(pidfd_getfd, int, pidfd, int, fd,
+		unsigned int, flags)
+{
+	struct pid *pid;
+	struct fd f;
+	int ret;
+
+	/* flags is currently unused - make sure it's unset */
+	if (flags)
+		return -EINVAL;
+
+	f = fdget(pidfd);
+	if (!f.file)
+		return -EBADF;
+
+	pid = pidfd_pid(f.file);
+	if (IS_ERR(pid))
+		ret = PTR_ERR(pid);
+	else
+		ret = pidfd_getfd(pid, fd);
+
+	fdput(f);
+	return ret;
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v8 3/3] test: Add test for pidfd getfd
  2020-01-03 16:29 [PATCH v8 0/3] Add pidfd_getfd syscall Sargun Dhillon
  2020-01-03 16:29 ` [PATCH v8 1/3] vfs, fdtable: Add get_task_file helper Sargun Dhillon
  2020-01-03 16:29 ` [PATCH v8 2/3] pid: Introduce pidfd_getfd syscall Sargun Dhillon
@ 2020-01-03 16:29 ` Sargun Dhillon
  2020-01-05 14:20   ` Christian Brauner
  2 siblings, 1 reply; 13+ messages in thread
From: Sargun Dhillon @ 2020-01-03 16:29 UTC (permalink / raw)
  To: linux-kernel, containers, linux-api, linux-fsdevel
  Cc: Sargun Dhillon, tycho, jannh, cyphar, christian.brauner, oleg,
	luto, viro, gpascutto, ealvarez, fweimer, jld, arnd

The following tests:
  * Fetch FD, and then compare via kcmp
  * Make sure getfd can be blocked by blocking ptrace_may_access
  * Making sure fetching bad FDs fails
  * Make sure trying to set flags to non-zero results in an EINVAL

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
---
 tools/testing/selftests/pidfd/.gitignore      |   1 +
 tools/testing/selftests/pidfd/Makefile        |   4 +-
 .../selftests/pidfd/pidfd_getfd_test.c        | 227 ++++++++++++++++++
 3 files changed, 230 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/pidfd/pidfd_getfd_test.c

diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore
index 8d069490e17b..3a779c084d96 100644
--- a/tools/testing/selftests/pidfd/.gitignore
+++ b/tools/testing/selftests/pidfd/.gitignore
@@ -2,3 +2,4 @@ pidfd_open_test
 pidfd_poll_test
 pidfd_test
 pidfd_wait
+pidfd_getfd_test
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index 43db1b98e845..2071f7ab5dc9 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
-CFLAGS += -g -I../../../../usr/include/ -pthread
+CFLAGS += -g -I../../../../usr/include/ -pthread -Wall
 
-TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait
+TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait pidfd_getfd_test
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/pidfd/pidfd_getfd_test.c b/tools/testing/selftests/pidfd/pidfd_getfd_test.c
new file mode 100644
index 000000000000..26ca75597812
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_getfd_test.c
@@ -0,0 +1,227 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <linux/types.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <linux/kcmp.h>
+
+#include "pidfd.h"
+#include "../kselftest.h"
+#include "../kselftest_harness.h"
+
+/*
+ * UNKNOWN_FD is an fd number that should never exist in the child, as it is
+ * used to check the negative case.
+ */
+#define UNKNOWN_FD 111
+
+static int sys_kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1,
+		    unsigned long idx2)
+{
+	return syscall(__NR_kcmp, pid1, pid2, type, idx1, idx2);
+}
+
+static int sys_pidfd_getfd(int pidfd, int fd, int flags)
+{
+	return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
+}
+
+static int sys_memfd_create(const char *name, unsigned int flags)
+{
+	return syscall(__NR_memfd_create, name, flags);
+}
+
+static int __child(int sk, int memfd)
+{
+	int ret;
+	char buf;
+
+	/*
+	 * Ensure we don't leave around a bunch of orphaned children if our
+	 * tests fail.
+	 */
+	ret = prctl(PR_SET_PDEATHSIG, SIGKILL);
+	if (ret) {
+		fprintf(stderr, "%s: Child could not set DEATHSIG\n",
+			strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	ret = send(sk, &memfd, sizeof(memfd), 0);
+	if (ret != sizeof(memfd)) {
+		fprintf(stderr, "%s: Child failed to send fd number\n",
+			strerror(errno));
+		return EXIT_FAILURE;
+	}
+
+	while ((ret = recv(sk, &buf, sizeof(buf), 0)) > 0) {
+		if (buf == 'P') {
+			ret = prctl(PR_SET_DUMPABLE, 0);
+			if (ret < 0) {
+				fprintf(stderr,
+					"%s: Child failed to disable ptrace\n",
+					strerror(errno));
+				return EXIT_FAILURE;
+			}
+		} else {
+			fprintf(stderr, "Child received unknown command %c\n",
+				buf);
+			return EXIT_FAILURE;
+		}
+		ret = send(sk, &buf, sizeof(buf), 0);
+		if (ret != 1) {
+			fprintf(stderr, "%s: Child failed to ack\n",
+				strerror(errno));
+			return EXIT_FAILURE;
+		}
+	}
+
+	if (ret < 0) {
+		fprintf(stderr, "%s: Child failed to read from socket\n",
+			strerror(errno));
+	}
+
+	return EXIT_SUCCESS;
+}
+
+static int child(int sk)
+{
+	int memfd, ret;
+
+	memfd = sys_memfd_create("test", 0);
+	if (memfd < 0) {
+		fprintf(stderr, "%s: Child could not create memfd\n",
+			strerror(errno));
+		ret = EXIT_FAILURE;
+	} else {
+		ret = __child(sk, memfd);
+		close(memfd);
+	}
+
+	close(sk);
+	return ret;
+}
+
+FIXTURE(child)
+{
+	pid_t pid;
+	int pidfd, sk, remote_fd;
+};
+
+FIXTURE_SETUP(child)
+{
+	int ret, sk_pair[2];
+
+	ASSERT_EQ(0, socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair))
+	{
+		TH_LOG("%s: failed to create socketpair", strerror(errno));
+	}
+	self->sk = sk_pair[0];
+
+	self->pid = fork();
+	ASSERT_GE(self->pid, 0);
+
+	if (self->pid == 0) {
+		close(sk_pair[0]);
+		exit(child(sk_pair[1]));
+	}
+
+	close(sk_pair[1]);
+
+	self->pidfd = sys_pidfd_open(self->pid, 0);
+	ASSERT_GE(self->pidfd, 0);
+
+	/*
+	 * Wait for the child to complete setup. It'll send the remote memfd's
+	 * number when ready.
+	 */
+	ret = recv(sk_pair[0], &self->remote_fd, sizeof(self->remote_fd), 0);
+	ASSERT_EQ(sizeof(self->remote_fd), ret);
+}
+
+FIXTURE_TEARDOWN(child)
+{
+	int status;
+
+	EXPECT_EQ(0, close(self->pidfd));
+	EXPECT_EQ(0, close(self->sk));
+
+	EXPECT_EQ(waitpid(self->pid, &status, 0), self->pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
+TEST_F(child, disable_ptrace)
+{
+	int uid, fd;
+	char c;
+
+	/*
+	 * Turn into nobody if we're root, to avoid CAP_SYS_PTRACE
+	 *
+	 * The tests should run in their own process, so even this test fails,
+	 * it shouldn't result in subsequent tests failing.
+	 */
+	uid = getuid();
+	if (uid == 0)
+		ASSERT_EQ(0, seteuid(USHRT_MAX));
+
+	ASSERT_EQ(1, send(self->sk, "P", 1, 0));
+	ASSERT_EQ(1, recv(self->sk, &c, 1, 0));
+
+	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
+	EXPECT_EQ(-1, fd);
+	EXPECT_EQ(EPERM, errno);
+
+	if (uid == 0)
+		ASSERT_EQ(0, seteuid(0));
+}
+
+TEST_F(child, fetch_fd)
+{
+	int fd, ret;
+
+	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
+	ASSERT_GE(fd, 0);
+
+	EXPECT_EQ(0, sys_kcmp(getpid(), self->pid, KCMP_FILE, fd, self->remote_fd));
+
+	ret = fcntl(fd, F_GETFD);
+	ASSERT_GE(ret, 0);
+	EXPECT_GE(ret & FD_CLOEXEC, 0);
+
+	close(fd);
+}
+
+TEST_F(child, test_unknown_fd)
+{
+	int fd;
+
+	fd = sys_pidfd_getfd(self->pidfd, UNKNOWN_FD, 0);
+	EXPECT_EQ(-1, fd) {
+		TH_LOG("getfd succeeded while fetching unknown fd");
+	};
+	EXPECT_EQ(EBADF, errno) {
+		TH_LOG("%s: getfd did not get EBADF", strerror(errno));
+	}
+}
+
+TEST(flags_set)
+{
+	ASSERT_EQ(-1, sys_pidfd_getfd(0, 0, 1));
+	EXPECT_EQ(errno, EINVAL);
+}
+
+TEST_HARNESS_MAIN
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 1/3] vfs, fdtable: Add get_task_file helper
  2020-01-03 16:29 ` [PATCH v8 1/3] vfs, fdtable: Add get_task_file helper Sargun Dhillon
@ 2020-01-05 12:47   ` Christian Brauner
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2020-01-05 12:47 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-kernel, containers, linux-api, linux-fsdevel, tycho, jannh,
	cyphar, oleg, luto, viro, gpascutto, ealvarez, fweimer, jld,
	arnd

On Fri, Jan 03, 2020 at 08:29:26AM -0800, Sargun Dhillon wrote:
> This introduces a function which can be used to fetch a file, given an
> arbitrary task. As long as the user holds a reference (refcnt) to the
> task_struct it is safe to call, and will either return NULL on failure,
> or a pointer to the file, with a refcnt.
> 
> This patch is based on Oleg Nesterov's (cf. [1]) patch from September
> 2018.
> 
> [1]: Link: https://lore.kernel.org/r/20180915160423.GA31461@redhat.com
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

Nit: the patch is titled "vfs, fdtable: Add get_task_file helper"
but the actual helper is called fget_task()

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 2/3] pid: Introduce pidfd_getfd syscall
  2020-01-03 16:29 ` [PATCH v8 2/3] pid: Introduce pidfd_getfd syscall Sargun Dhillon
@ 2020-01-05 13:30   ` Christian Brauner
  2020-01-17 23:06   ` Matthew Wilcox
  1 sibling, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2020-01-05 13:30 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-kernel, containers, linux-api, linux-fsdevel, tycho, jannh,
	cyphar, oleg, luto, viro, gpascutto, ealvarez, fweimer, jld,
	arnd

On Fri, Jan 03, 2020 at 08:29:27AM -0800, Sargun Dhillon wrote:
> This syscall allows for the retrieval of file descriptors from other
> processes, based on their pidfd. This is possible using ptrace, and
> injection of parasitic code to inject code which leverages SCM_RIGHTS
> to move file descriptors between a tracee and a tracer. Unfortunately,
> ptrace comes with a high cost of requiring the process to be stopped,
> and breaks debuggers. This does not require stopping the process under
> manipulation.
> 
> One reason to use this is to allow sandboxers to take actions on file
> descriptors on the behalf of another process. For example, this can be
> combined with seccomp-bpf's user notification to do on-demand fd
> extraction and take privileged actions. One such privileged action
> is binding a socket to a privileged port.
> 
> This also adds the syscall to all architectures at the same time.
> 
> /* prototype */
>   /* flags is currently reserved and should be set to 0 */
>   int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
> 
> /* testing */
> Ran self-test suite on x86_64
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>

The prefered way of adding a syscall is to keep the implementation
separate from the wiring up into the syscall tables. So please split the
patch into two:
- [2/4] pidfd_getfd() implementation
- [3/4] pidfd_getfd() wiring up 
otherwise

Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 3/3] test: Add test for pidfd getfd
  2020-01-03 16:29 ` [PATCH v8 3/3] test: Add test for pidfd getfd Sargun Dhillon
@ 2020-01-05 14:20   ` Christian Brauner
  2020-01-05 19:08     ` Sargun Dhillon
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2020-01-05 14:20 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-kernel, containers, linux-api, linux-fsdevel, tycho, jannh,
	cyphar, oleg, luto, viro, gpascutto, ealvarez, fweimer, jld,
	arnd

On Fri, Jan 03, 2020 at 08:29:28AM -0800, Sargun Dhillon wrote:
> The following tests:
>   * Fetch FD, and then compare via kcmp
>   * Make sure getfd can be blocked by blocking ptrace_may_access
>   * Making sure fetching bad FDs fails
>   * Make sure trying to set flags to non-zero results in an EINVAL
> 
> Signed-off-by: Sargun Dhillon <sargun@sargun.me>
> ---
>  tools/testing/selftests/pidfd/.gitignore      |   1 +
>  tools/testing/selftests/pidfd/Makefile        |   4 +-
>  .../selftests/pidfd/pidfd_getfd_test.c        | 227 ++++++++++++++++++
>  3 files changed, 230 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/pidfd/pidfd_getfd_test.c
> 
> diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore
> index 8d069490e17b..3a779c084d96 100644
> --- a/tools/testing/selftests/pidfd/.gitignore
> +++ b/tools/testing/selftests/pidfd/.gitignore
> @@ -2,3 +2,4 @@ pidfd_open_test
>  pidfd_poll_test
>  pidfd_test
>  pidfd_wait
> +pidfd_getfd_test
> diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
> index 43db1b98e845..2071f7ab5dc9 100644
> --- a/tools/testing/selftests/pidfd/Makefile
> +++ b/tools/testing/selftests/pidfd/Makefile
> @@ -1,7 +1,7 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> -CFLAGS += -g -I../../../../usr/include/ -pthread
> +CFLAGS += -g -I../../../../usr/include/ -pthread -Wall
>  
> -TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait
> +TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait pidfd_getfd_test
>  
>  include ../lib.mk
>  
> diff --git a/tools/testing/selftests/pidfd/pidfd_getfd_test.c b/tools/testing/selftests/pidfd/pidfd_getfd_test.c
> new file mode 100644
> index 000000000000..26ca75597812
> --- /dev/null
> +++ b/tools/testing/selftests/pidfd/pidfd_getfd_test.c
> @@ -0,0 +1,227 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#define _GNU_SOURCE
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <limits.h>
> +#include <linux/types.h>
> +#include <sched.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <syscall.h>
> +#include <sys/prctl.h>
> +#include <sys/wait.h>
> +#include <unistd.h>
> +#include <sys/socket.h>
> +#include <linux/kcmp.h>
> +
> +#include "pidfd.h"
> +#include "../kselftest.h"
> +#include "../kselftest_harness.h"
> +
> +/*
> + * UNKNOWN_FD is an fd number that should never exist in the child, as it is
> + * used to check the negative case.
> + */
> +#define UNKNOWN_FD 111
> +
> +static int sys_kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1,
> +		    unsigned long idx2)
> +{
> +	return syscall(__NR_kcmp, pid1, pid2, type, idx1, idx2);
> +}
> +
> +static int sys_pidfd_getfd(int pidfd, int fd, int flags)
> +{
> +	return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
> +}

I think you can move this to the pidfd.h header as:

static inline int sys_pidfd_getfd(int pidfd, int fd, int flags)
{
	return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
}

Note, this also needs an

#ifndef __NR_pidfd_getfd
__NR_pidfd_getfd -1
#endif
so that compilation doesn't fail.

> +
> +static int sys_memfd_create(const char *name, unsigned int flags)
> +{
> +	return syscall(__NR_memfd_create, name, flags);
> +}
> +
> +static int __child(int sk, int memfd)
> +{
> +	int ret;
> +	char buf;
> +
> +	/*
> +	 * Ensure we don't leave around a bunch of orphaned children if our
> +	 * tests fail.
> +	 */
> +	ret = prctl(PR_SET_PDEATHSIG, SIGKILL);
> +	if (ret) {
> +		fprintf(stderr, "%s: Child could not set DEATHSIG\n",
> +			strerror(errno));
> +		return EXIT_FAILURE;

return -1

> +	}
> +
> +	ret = send(sk, &memfd, sizeof(memfd), 0);
> +	if (ret != sizeof(memfd)) {
> +		fprintf(stderr, "%s: Child failed to send fd number\n",
> +			strerror(errno));
> +		return EXIT_FAILURE;

return -1

> +	}
> +
> +	while ((ret = recv(sk, &buf, sizeof(buf), 0)) > 0) {
> +		if (buf == 'P') {
> +			ret = prctl(PR_SET_DUMPABLE, 0);
> +			if (ret < 0) {
> +				fprintf(stderr,
> +					"%s: Child failed to disable ptrace\n",
> +					strerror(errno));
> +				return EXIT_FAILURE;

return -1

> +			}
> +		} else {
> +			fprintf(stderr, "Child received unknown command %c\n",
> +				buf);
> +			return EXIT_FAILURE;

return -1

> +		}
> +		ret = send(sk, &buf, sizeof(buf), 0);
> +		if (ret != 1) {
> +			fprintf(stderr, "%s: Child failed to ack\n",
> +				strerror(errno));
> +			return EXIT_FAILURE;

return -1

> +		}
> +	}
> +
> +	if (ret < 0) {
> +		fprintf(stderr, "%s: Child failed to read from socket\n",
> +			strerror(errno));

Is this intentional that this is no failure?

> +	}
> +
> +	return EXIT_SUCCESS;

return 0

> +}
> +
> +static int child(int sk)
> +{
> +	int memfd, ret;
> +
> +	memfd = sys_memfd_create("test", 0);
> +	if (memfd < 0) {
> +		fprintf(stderr, "%s: Child could not create memfd\n",
> +			strerror(errno));
> +		ret = EXIT_FAILURE;

ret = -1;

> +	} else {
> +		ret = __child(sk, memfd);
> +		close(memfd);
> +	}
> +
> +	close(sk);
> +	return ret;
> +}
> +
> +FIXTURE(child)
> +{
> +	pid_t pid;
> +	int pidfd, sk, remote_fd;
> +};
> +
> +FIXTURE_SETUP(child)
> +{
> +	int ret, sk_pair[2];
> +
> +	ASSERT_EQ(0, socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair))
> +	{
> +		TH_LOG("%s: failed to create socketpair", strerror(errno));
> +	}
> +	self->sk = sk_pair[0];
> +
> +	self->pid = fork();
> +	ASSERT_GE(self->pid, 0);
> +
> +	if (self->pid == 0) {
> +		close(sk_pair[0]);
> +		exit(child(sk_pair[1]));

if (child(sk_pair[1]))
	_exit(EXIT_FAILURE);
_exit(EXIT_SUCCESS);

I would like to only use exit macros where one actually calls
{_}exit()s. It makes the logic easier to follow and ensures that one
doesn't accidently do an exit(-21345) or something (e.g. when adding new
code).

> +	}
> +
> +	close(sk_pair[1]);
> +
> +	self->pidfd = sys_pidfd_open(self->pid, 0);
> +	ASSERT_GE(self->pidfd, 0);
> +
> +	/*
> +	 * Wait for the child to complete setup. It'll send the remote memfd's
> +	 * number when ready.
> +	 */
> +	ret = recv(sk_pair[0], &self->remote_fd, sizeof(self->remote_fd), 0);
> +	ASSERT_EQ(sizeof(self->remote_fd), ret);
> +}
> +
> +FIXTURE_TEARDOWN(child)
> +{
> +	int status;
> +
> +	EXPECT_EQ(0, close(self->pidfd));
> +	EXPECT_EQ(0, close(self->sk));
> +
> +	EXPECT_EQ(waitpid(self->pid, &status, 0), self->pid);
> +	EXPECT_EQ(true, WIFEXITED(status));
> +	EXPECT_EQ(0, WEXITSTATUS(status));
> +}
> +
> +TEST_F(child, disable_ptrace)
> +{
> +	int uid, fd;
> +	char c;
> +
> +	/*
> +	 * Turn into nobody if we're root, to avoid CAP_SYS_PTRACE
> +	 *
> +	 * The tests should run in their own process, so even this test fails,
> +	 * it shouldn't result in subsequent tests failing.
> +	 */
> +	uid = getuid();
> +	if (uid == 0)
> +		ASSERT_EQ(0, seteuid(USHRT_MAX));

Hm, isn't it safer to do 65535 explicitly? Since USHRT_MAX can
technically be greater than 65535.

> +
> +	ASSERT_EQ(1, send(self->sk, "P", 1, 0));
> +	ASSERT_EQ(1, recv(self->sk, &c, 1, 0));
> +
> +	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
> +	EXPECT_EQ(-1, fd);
> +	EXPECT_EQ(EPERM, errno);
> +
> +	if (uid == 0)
> +		ASSERT_EQ(0, seteuid(0));
> +}
> +
> +TEST_F(child, fetch_fd)
> +{
> +	int fd, ret;
> +
> +	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
> +	ASSERT_GE(fd, 0);
> +
> +	EXPECT_EQ(0, sys_kcmp(getpid(), self->pid, KCMP_FILE, fd, self->remote_fd));

So most of these tests seem to take place when the child has already
called exit() - or at least it's very likely that the child has already
called exit() - and remains a zombie. That's not ideal because
that's not the common scenario/use-case. Usually the task of which we
want to get an fd will be alive. Also, if the child has already called
exit(), by the time it returns to userspace it should have already
called exit_files() and so I wonder whether this test would fail if it's
run after the child has exited. Maybe I'm missing something here... Is
there some ordering enforced by TEST_F()?

Also, what does self->pid point to? The fd of the already exited child?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 3/3] test: Add test for pidfd getfd
  2020-01-05 14:20   ` Christian Brauner
@ 2020-01-05 19:08     ` Sargun Dhillon
  2020-01-06 17:19       ` Christian Brauner
  0 siblings, 1 reply; 13+ messages in thread
From: Sargun Dhillon @ 2020-01-05 19:08 UTC (permalink / raw)
  To: linux-kernel, containers, linux-api, linux-fsdevel
  Cc: Sargun Dhillon, tycho, jannh, cyphar, christian.brauner, oleg,
	luto, viro, gpascutto, ealvarez, fweimer, jld, arnd

On Sun, Jan 05, 2020 at 03:20:23PM +0100, Christian Brauner wrote:
> On Fri, Jan 03, 2020 at 08:29:28AM -0800, Sargun Dhillon wrote:
> > +static int sys_pidfd_getfd(int pidfd, int fd, int flags)
> > +{
> > +	return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
> > +}
> 
> I think you can move this to the pidfd.h header as:
> 
> static inline int sys_pidfd_getfd(int pidfd, int fd, int flags)
> {
> 	return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
> }
> 
> Note, this also needs an
> 
> #ifndef __NR_pidfd_getfd
> __NR_pidfd_getfd -1
> #endif
> so that compilation doesn't fail.
> 
I'll go ahead and move this into pidfd.h, and follow the pattern there. I
don't think it's worth checking if each time the return code is ENOSYS.

Does it make sense to add something like:
#ifdef __NR_pidfd_getfd
TEST_HARNESS_MAIN
#else
int main(void)
{
	fprintf(stderr, "pidfd_getfd syscall not supported\n");
	return KSFT_SKIP;
}
#endif

to short-circuit the entire test suite?


> > +
> > +static int sys_memfd_create(const char *name, unsigned int flags)
> > +{
> > +	return syscall(__NR_memfd_create, name, flags);
> > +}
> > +
> > +static int __child(int sk, int memfd)
> > +{
> > +	int ret;
> > +	char buf;
> > +
> > +	/*
> > +	 * Ensure we don't leave around a bunch of orphaned children if our
> > +	 * tests fail.
> > +	 */
> > +	ret = prctl(PR_SET_PDEATHSIG, SIGKILL);
> > +	if (ret) {
> > +		fprintf(stderr, "%s: Child could not set DEATHSIG\n",
> > +			strerror(errno));
> > +		return EXIT_FAILURE;
> 
> return -1
> 
> > +	}
> > +
> > +	ret = send(sk, &memfd, sizeof(memfd), 0);
> > +	if (ret != sizeof(memfd)) {
> > +		fprintf(stderr, "%s: Child failed to send fd number\n",
> > +			strerror(errno));
> > +		return EXIT_FAILURE;
> 
> return -1
> 
> > +	}
> > +
> > +	while ((ret = recv(sk, &buf, sizeof(buf), 0)) > 0) {
> > +		if (buf == 'P') {
> > +			ret = prctl(PR_SET_DUMPABLE, 0);
> > +			if (ret < 0) {
> > +				fprintf(stderr,
> > +					"%s: Child failed to disable ptrace\n",
> > +					strerror(errno));
> > +				return EXIT_FAILURE;
> 
> return -1
> 
> > +			}
> > +		} else {
> > +			fprintf(stderr, "Child received unknown command %c\n",
> > +				buf);
> > +			return EXIT_FAILURE;
> 
> return -1
> 
> > +		}
> > +		ret = send(sk, &buf, sizeof(buf), 0);
> > +		if (ret != 1) {
> > +			fprintf(stderr, "%s: Child failed to ack\n",
> > +				strerror(errno));
> > +			return EXIT_FAILURE;
> 
> return -1
> 
> > +		}
> > +	}
> > +
> > +	if (ret < 0) {
> > +		fprintf(stderr, "%s: Child failed to read from socket\n",
> > +			strerror(errno));
> 
> Is this intentional that this is no failure?
> 
My thought here, is the only case where this should happen is if the "ptrace 
command" was not properly "transmitted", and the ptrace test itself would fail.

I can add an explicit exit failure here.

> > +	}
> > +
> > +	return EXIT_SUCCESS;
> 
> return 0
> 
> > +}
> > +
> > +static int child(int sk)
> > +{
> > +	int memfd, ret;
> > +
> > +	memfd = sys_memfd_create("test", 0);
> > +	if (memfd < 0) {
> > +		fprintf(stderr, "%s: Child could not create memfd\n",
> > +			strerror(errno));
> > +		ret = EXIT_FAILURE;
> 
> ret = -1;
> 
> > +	} else {
> > +		ret = __child(sk, memfd);
> > +		close(memfd);
> > +	}
> > +
> > +	close(sk);
> > +	return ret;
> > +}
> > +
> > +FIXTURE(child)
> > +{
> > +	pid_t pid;
> > +	int pidfd, sk, remote_fd;
> > +};
> > +
> > +FIXTURE_SETUP(child)
> > +{
> > +	int ret, sk_pair[2];
> > +
> > +	ASSERT_EQ(0, socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair))
> > +	{
> > +		TH_LOG("%s: failed to create socketpair", strerror(errno));
> > +	}
> > +	self->sk = sk_pair[0];
> > +
> > +	self->pid = fork();
> > +	ASSERT_GE(self->pid, 0);
> > +
> > +	if (self->pid == 0) {
> > +		close(sk_pair[0]);
> > +		exit(child(sk_pair[1]));
> 
> if (child(sk_pair[1]))
> 	_exit(EXIT_FAILURE);
> _exit(EXIT_SUCCESS);
> 
> I would like to only use exit macros where one actually calls
> {_}exit()s. It makes the logic easier to follow and ensures that one
> doesn't accidently do an exit(-21345) or something (e.g. when adding new
> code).
> 
> > +	}
> > +
> > +	close(sk_pair[1]);
> > +
> > +	self->pidfd = sys_pidfd_open(self->pid, 0);
> > +	ASSERT_GE(self->pidfd, 0);
> > +
> > +	/*
> > +	 * Wait for the child to complete setup. It'll send the remote memfd's
> > +	 * number when ready.
> > +	 */
> > +	ret = recv(sk_pair[0], &self->remote_fd, sizeof(self->remote_fd), 0);
> > +	ASSERT_EQ(sizeof(self->remote_fd), ret);
> > +}
> > +
> > +FIXTURE_TEARDOWN(child)
> > +{
> > +	int status;
> > +
> > +	EXPECT_EQ(0, close(self->pidfd));
> > +	EXPECT_EQ(0, close(self->sk));
> > +
> > +	EXPECT_EQ(waitpid(self->pid, &status, 0), self->pid);
> > +	EXPECT_EQ(true, WIFEXITED(status));
> > +	EXPECT_EQ(0, WEXITSTATUS(status));
> > +}
> > +
> > +TEST_F(child, disable_ptrace)
> > +{
> > +	int uid, fd;
> > +	char c;
> > +
> > +	/*
> > +	 * Turn into nobody if we're root, to avoid CAP_SYS_PTRACE
> > +	 *
> > +	 * The tests should run in their own process, so even this test fails,
> > +	 * it shouldn't result in subsequent tests failing.
> > +	 */
> > +	uid = getuid();
> > +	if (uid == 0)
> > +		ASSERT_EQ(0, seteuid(USHRT_MAX));
> 
> Hm, isn't it safer to do 65535 explicitly? Since USHRT_MAX can
> technically be greater than 65535.
> 
I borrowed this from the BPF tests. I can hardcode something like:
#define NOBODY_UID 65535
and setuid to that, if you think it's safer?

> > +
> > +	ASSERT_EQ(1, send(self->sk, "P", 1, 0));
> > +	ASSERT_EQ(1, recv(self->sk, &c, 1, 0));
> > +
> > +	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
> > +	EXPECT_EQ(-1, fd);
> > +	EXPECT_EQ(EPERM, errno);
> > +
> > +	if (uid == 0)
> > +		ASSERT_EQ(0, seteuid(0));
> > +}
> > +
> > +TEST_F(child, fetch_fd)
> > +{
> > +	int fd, ret;
> > +
> > +	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
> > +	ASSERT_GE(fd, 0);
> > +
> > +	EXPECT_EQ(0, sys_kcmp(getpid(), self->pid, KCMP_FILE, fd, self->remote_fd));
> 
> So most of these tests seem to take place when the child has already
> called exit() - or at least it's very likely that the child has already
> called exit() - and remains a zombie. That's not ideal because
> that's not the common scenario/use-case. Usually the task of which we
> want to get an fd will be alive. Also, if the child has already called
> exit(), by the time it returns to userspace it should have already
> called exit_files() and so I wonder whether this test would fail if it's
> run after the child has exited. Maybe I'm missing something here... Is
> there some ordering enforced by TEST_F()?
Yeah, I think perhaps I was being too clever.
The timeline roughly goes something like this:

# Fixture bringup
[parent] creates socket_pair
[parent] forks, and passes pair down to child
[parent] waits to read sizeof(int) from the sk_pair
[child] creates memfd 
[__child] sends local memfd number to parent via sk_pair
[__child] waits to read from sk_pair
[parent] reads remote memfd number from socket
# Test
[parent] performs tests
# Fixture teardown
[parent] closes sk_pair
[__child] reads 0 from recv on sk_pair, implies the other end is closed
[__child] Returns / exits 0
[parent] Reaps child / reads exit code

---
The one case where this is not true, is if the parent sends 'P' to the sk pair,
it triggers setting PR_SET_DUMPABLE to 0, and then resumes waiting for the fd to 
close.

Maybe I'm being too clever? Instead, the alternative was to send explicit stop / 
start messages across the sk_pair, but that got kind of ugly. Do you have a 
better suggestion?

> 
> Also, what does self->pid point to? The fd of the already exited child?
It's just the pid of the child. pidfd is the fd of the (unexited) child.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 3/3] test: Add test for pidfd getfd
  2020-01-05 19:08     ` Sargun Dhillon
@ 2020-01-06 17:19       ` Christian Brauner
  2020-01-06 21:06         ` Sargun Dhillon
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2020-01-06 17:19 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-kernel, containers, linux-api, linux-fsdevel, tycho, jannh,
	cyphar, oleg, luto, viro, gpascutto, ealvarez, fweimer, jld,
	arnd

On Sun, Jan 05, 2020 at 07:08:13PM +0000, Sargun Dhillon wrote:
> On Sun, Jan 05, 2020 at 03:20:23PM +0100, Christian Brauner wrote:
> > On Fri, Jan 03, 2020 at 08:29:28AM -0800, Sargun Dhillon wrote:
> > > +static int sys_pidfd_getfd(int pidfd, int fd, int flags)
> > > +{
> > > +	return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
> > > +}
> > 
> > I think you can move this to the pidfd.h header as:
> > 
> > static inline int sys_pidfd_getfd(int pidfd, int fd, int flags)
> > {
> > 	return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
> > }
> > 
> > Note, this also needs an
> > 
> > #ifndef __NR_pidfd_getfd
> > __NR_pidfd_getfd -1
> > #endif
> > so that compilation doesn't fail.
> > 
> I'll go ahead and move this into pidfd.h, and follow the pattern there. I
> don't think it's worth checking if each time the return code is ENOSYS.
> 
> Does it make sense to add something like:
> #ifdef __NR_pidfd_getfd
> TEST_HARNESS_MAIN
> #else
> int main(void)
> {
> 	fprintf(stderr, "pidfd_getfd syscall not supported\n");
> 	return KSFT_SKIP;
> }
> #endif
> 
> to short-circuit the entire test suite?

You mean the getfd testsuite? If so and that works, then sounds like a
good idea to me.

> 
> 
> > > +
> > > +static int sys_memfd_create(const char *name, unsigned int flags)
> > > +{
> > > +	return syscall(__NR_memfd_create, name, flags);
> > > +}
> > > +
> > > +static int __child(int sk, int memfd)
> > > +{
> > > +	int ret;
> > > +	char buf;
> > > +
> > > +	/*
> > > +	 * Ensure we don't leave around a bunch of orphaned children if our
> > > +	 * tests fail.
> > > +	 */
> > > +	ret = prctl(PR_SET_PDEATHSIG, SIGKILL);
> > > +	if (ret) {
> > > +		fprintf(stderr, "%s: Child could not set DEATHSIG\n",
> > > +			strerror(errno));
> > > +		return EXIT_FAILURE;
> > 
> > return -1
> > 
> > > +	}
> > > +
> > > +	ret = send(sk, &memfd, sizeof(memfd), 0);
> > > +	if (ret != sizeof(memfd)) {
> > > +		fprintf(stderr, "%s: Child failed to send fd number\n",
> > > +			strerror(errno));
> > > +		return EXIT_FAILURE;
> > 
> > return -1
> > 
> > > +	}
> > > +
> > > +	while ((ret = recv(sk, &buf, sizeof(buf), 0)) > 0) {
> > > +		if (buf == 'P') {
> > > +			ret = prctl(PR_SET_DUMPABLE, 0);
> > > +			if (ret < 0) {
> > > +				fprintf(stderr,
> > > +					"%s: Child failed to disable ptrace\n",
> > > +					strerror(errno));
> > > +				return EXIT_FAILURE;
> > 
> > return -1
> > 
> > > +			}
> > > +		} else {
> > > +			fprintf(stderr, "Child received unknown command %c\n",
> > > +				buf);
> > > +			return EXIT_FAILURE;
> > 
> > return -1
> > 
> > > +		}
> > > +		ret = send(sk, &buf, sizeof(buf), 0);
> > > +		if (ret != 1) {
> > > +			fprintf(stderr, "%s: Child failed to ack\n",
> > > +				strerror(errno));
> > > +			return EXIT_FAILURE;
> > 
> > return -1
> > 
> > > +		}
> > > +	}
> > > +
> > > +	if (ret < 0) {
> > > +		fprintf(stderr, "%s: Child failed to read from socket\n",
> > > +			strerror(errno));
> > 
> > Is this intentional that this is no failure?
> > 
> My thought here, is the only case where this should happen is if the "ptrace 
> command" was not properly "transmitted", and the ptrace test itself would fail.
> 
> I can add an explicit exit failure here.

Ok.

> 
> > > +	}
> > > +
> > > +	return EXIT_SUCCESS;
> > 
> > return 0
> > 
> > > +}
> > > +
> > > +static int child(int sk)
> > > +{
> > > +	int memfd, ret;
> > > +
> > > +	memfd = sys_memfd_create("test", 0);
> > > +	if (memfd < 0) {
> > > +		fprintf(stderr, "%s: Child could not create memfd\n",
> > > +			strerror(errno));
> > > +		ret = EXIT_FAILURE;
> > 
> > ret = -1;
> > 
> > > +	} else {
> > > +		ret = __child(sk, memfd);
> > > +		close(memfd);
> > > +	}
> > > +
> > > +	close(sk);
> > > +	return ret;
> > > +}
> > > +
> > > +FIXTURE(child)
> > > +{
> > > +	pid_t pid;
> > > +	int pidfd, sk, remote_fd;
> > > +};
> > > +
> > > +FIXTURE_SETUP(child)
> > > +{
> > > +	int ret, sk_pair[2];
> > > +
> > > +	ASSERT_EQ(0, socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair))
> > > +	{
> > > +		TH_LOG("%s: failed to create socketpair", strerror(errno));
> > > +	}
> > > +	self->sk = sk_pair[0];
> > > +
> > > +	self->pid = fork();
> > > +	ASSERT_GE(self->pid, 0);
> > > +
> > > +	if (self->pid == 0) {
> > > +		close(sk_pair[0]);
> > > +		exit(child(sk_pair[1]));
> > 
> > if (child(sk_pair[1]))
> > 	_exit(EXIT_FAILURE);
> > _exit(EXIT_SUCCESS);
> > 
> > I would like to only use exit macros where one actually calls
> > {_}exit()s. It makes the logic easier to follow and ensures that one
> > doesn't accidently do an exit(-21345) or something (e.g. when adding new
> > code).
> > 
> > > +	}
> > > +
> > > +	close(sk_pair[1]);
> > > +
> > > +	self->pidfd = sys_pidfd_open(self->pid, 0);
> > > +	ASSERT_GE(self->pidfd, 0);
> > > +
> > > +	/*
> > > +	 * Wait for the child to complete setup. It'll send the remote memfd's
> > > +	 * number when ready.
> > > +	 */
> > > +	ret = recv(sk_pair[0], &self->remote_fd, sizeof(self->remote_fd), 0);
> > > +	ASSERT_EQ(sizeof(self->remote_fd), ret);
> > > +}
> > > +
> > > +FIXTURE_TEARDOWN(child)
> > > +{
> > > +	int status;
> > > +
> > > +	EXPECT_EQ(0, close(self->pidfd));
> > > +	EXPECT_EQ(0, close(self->sk));
> > > +
> > > +	EXPECT_EQ(waitpid(self->pid, &status, 0), self->pid);
> > > +	EXPECT_EQ(true, WIFEXITED(status));
> > > +	EXPECT_EQ(0, WEXITSTATUS(status));
> > > +}
> > > +
> > > +TEST_F(child, disable_ptrace)
> > > +{
> > > +	int uid, fd;
> > > +	char c;
> > > +
> > > +	/*
> > > +	 * Turn into nobody if we're root, to avoid CAP_SYS_PTRACE
> > > +	 *
> > > +	 * The tests should run in their own process, so even this test fails,
> > > +	 * it shouldn't result in subsequent tests failing.
> > > +	 */
> > > +	uid = getuid();
> > > +	if (uid == 0)
> > > +		ASSERT_EQ(0, seteuid(USHRT_MAX));
> > 
> > Hm, isn't it safer to do 65535 explicitly? Since USHRT_MAX can
> > technically be greater than 65535.
> > 
> I borrowed this from the BPF tests. I can hardcode something like:
> #define NOBODY_UID 65535
> and setuid to that, if you think it's safer?

If you want to specifically seteuid() to 65535 then yes, using the
hard-coded number or using a dedicated macro seems better.

> 
> > > +
> > > +	ASSERT_EQ(1, send(self->sk, "P", 1, 0));
> > > +	ASSERT_EQ(1, recv(self->sk, &c, 1, 0));
> > > +
> > > +	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
> > > +	EXPECT_EQ(-1, fd);
> > > +	EXPECT_EQ(EPERM, errno);
> > > +
> > > +	if (uid == 0)
> > > +		ASSERT_EQ(0, seteuid(0));
> > > +}
> > > +
> > > +TEST_F(child, fetch_fd)
> > > +{
> > > +	int fd, ret;
> > > +
> > > +	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
> > > +	ASSERT_GE(fd, 0);
> > > +
> > > +	EXPECT_EQ(0, sys_kcmp(getpid(), self->pid, KCMP_FILE, fd, self->remote_fd));
> > 
> > So most of these tests seem to take place when the child has already
> > called exit() - or at least it's very likely that the child has already
> > called exit() - and remains a zombie. That's not ideal because
> > that's not the common scenario/use-case. Usually the task of which we
> > want to get an fd will be alive. Also, if the child has already called
> > exit(), by the time it returns to userspace it should have already
> > called exit_files() and so I wonder whether this test would fail if it's
> > run after the child has exited. Maybe I'm missing something here... Is
> > there some ordering enforced by TEST_F()?
> Yeah, I think perhaps I was being too clever.
> The timeline roughly goes something like this:
> 
> # Fixture bringup
> [parent] creates socket_pair
> [parent] forks, and passes pair down to child
> [parent] waits to read sizeof(int) from the sk_pair
> [child] creates memfd 
> [__child] sends local memfd number to parent via sk_pair
> [__child] waits to read from sk_pair
> [parent] reads remote memfd number from socket
> # Test
> [parent] performs tests
> # Fixture teardown
> [parent] closes sk_pair
> [__child] reads 0 from recv on sk_pair, implies the other end is closed
> [__child] Returns / exits 0
> [parent] Reaps child / reads exit code
> 
> ---
> The one case where this is not true, is if the parent sends 'P' to the sk pair,
> it triggers setting PR_SET_DUMPABLE to 0, and then resumes waiting for the fd to 
> close.
> 
> Maybe I'm being too clever? Instead, the alternative was to send explicit stop / 
> start messages across the sk_pair, but that got kind of ugly. Do you have a 
> better suggestion?

If I understand correctly you just need to block the child to stop it
from exiting. Couldn't you do this by simply calling recv() on the
socket in the child thereby blocking it? At the end you just send a
final message to proceed and if that doesn't work SIGKILL it?

> 
> > 
> > Also, what does self->pid point to? The fd of the already exited child?
> It's just the pid of the child. pidfd is the fd of the (unexited) child.

Ah, thanks!
Christian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 3/3] test: Add test for pidfd getfd
  2020-01-06 17:19       ` Christian Brauner
@ 2020-01-06 21:06         ` Sargun Dhillon
  2020-01-07  8:55           ` Christian Brauner
  0 siblings, 1 reply; 13+ messages in thread
From: Sargun Dhillon @ 2020-01-06 21:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, containers, linux-api, linux-fsdevel, tycho, jannh,
	cyphar, oleg, luto, viro, gpascutto, ealvarez, fweimer, jld,
	arnd

On Mon, Jan 06, 2020 at 06:19:41PM +0100, Christian Brauner wrote:
> On Sun, Jan 05, 2020 at 07:08:13PM +0000, Sargun Dhillon wrote:
> > On Sun, Jan 05, 2020 at 03:20:23PM +0100, Christian Brauner wrote:
> > > On Fri, Jan 03, 2020 at 08:29:28AM -0800, Sargun Dhillon wrote:
> > > > +static int sys_pidfd_getfd(int pidfd, int fd, int flags)
> > > > +{
> > > > +	return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
> > > > +}
> > > 
> > > I think you can move this to the pidfd.h header as:
> > > 
> > > static inline int sys_pidfd_getfd(int pidfd, int fd, int flags)
> > > {
> > > 	return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
> > > }
> > > 
> > > Note, this also needs an
> > > 
> > > #ifndef __NR_pidfd_getfd
> > > __NR_pidfd_getfd -1
> > > #endif
> > > so that compilation doesn't fail.
> > > 
> > I'll go ahead and move this into pidfd.h, and follow the pattern there. I
> > don't think it's worth checking if each time the return code is ENOSYS.
> > 
> > Does it make sense to add something like:
> > #ifdef __NR_pidfd_getfd
> > TEST_HARNESS_MAIN
> > #else
> > int main(void)
> > {
> > 	fprintf(stderr, "pidfd_getfd syscall not supported\n");
> > 	return KSFT_SKIP;
> > }
> > #endif
> > 
> > to short-circuit the entire test suite?
> 
> You mean the getfd testsuite? If so and that works, then sounds like a
> good idea to me.
> 
> > 
> > 
> > 
> > > 
> > > Hm, isn't it safer to do 65535 explicitly? Since USHRT_MAX can
> > > technically be greater than 65535.
> > > 
> > I borrowed this from the BPF tests. I can hardcode something like:
> > #define NOBODY_UID 65535
> > and setuid to that, if you think it's safer?
> 
> If you want to specifically seteuid() to 65535 then yes, using the
> hard-coded number or using a dedicated macro seems better.
> 
> > 
> > > > +
> > > > +	ASSERT_EQ(1, send(self->sk, "P", 1, 0));
> > > > +	ASSERT_EQ(1, recv(self->sk, &c, 1, 0));
> > > > +
> > > > +	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
> > > > +	EXPECT_EQ(-1, fd);
> > > > +	EXPECT_EQ(EPERM, errno);
> > > > +
> > > > +	if (uid == 0)
> > > > +		ASSERT_EQ(0, seteuid(0));
> > > > +}
> > > > +
> > > > +TEST_F(child, fetch_fd)
> > > > +{
> > > > +	int fd, ret;
> > > > +
> > > > +	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
> > > > +	ASSERT_GE(fd, 0);
> > > > +
> > > > +	EXPECT_EQ(0, sys_kcmp(getpid(), self->pid, KCMP_FILE, fd, self->remote_fd));
> > > 
> > > So most of these tests seem to take place when the child has already
> > > called exit() - or at least it's very likely that the child has already
> > > called exit() - and remains a zombie. That's not ideal because
> > > that's not the common scenario/use-case. Usually the task of which we
> > > want to get an fd will be alive. Also, if the child has already called
> > > exit(), by the time it returns to userspace it should have already
> > > called exit_files() and so I wonder whether this test would fail if it's
> > > run after the child has exited. Maybe I'm missing something here... Is
> > > there some ordering enforced by TEST_F()?
> > Yeah, I think perhaps I was being too clever.
> > The timeline roughly goes something like this:
> > 
> > # Fixture bringup
> > [parent] creates socket_pair
> > [parent] forks, and passes pair down to child
> > [parent] waits to read sizeof(int) from the sk_pair
> > [child] creates memfd 
> > [__child] sends local memfd number to parent via sk_pair
> > [__child] waits to read from sk_pair
> > [parent] reads remote memfd number from socket
> > # Test
> > [parent] performs tests
> > # Fixture teardown
> > [parent] closes sk_pair
> > [__child] reads 0 from recv on sk_pair, implies the other end is closed
> > [__child] Returns / exits 0
> > [parent] Reaps child / reads exit code
> > 
> > ---
> > The one case where this is not true, is if the parent sends 'P' to the sk pair,
> > it triggers setting PR_SET_DUMPABLE to 0, and then resumes waiting for the fd to 
> > close.
> > 
> > Maybe I'm being too clever? Instead, the alternative was to send explicit stop / 
> > start messages across the sk_pair, but that got kind of ugly. Do you have a 
> > better suggestion?
> 
> If I understand correctly you just need to block the child to stop it
> from exiting. Couldn't you do this by simply calling recv() on the
> socket in the child thereby blocking it? At the end you just send a
> final message to proceed and if that doesn't work SIGKILL it?
> 
This already exists in:
while ((ret = recv(sk, &buf, sizeof(buf), 0)) > 0) {
	if (buf == 'P') {
		ret = prctl(PR_SET_DUMPABLE, 0);
		if (ret < 0) {
			fprintf(stderr,
				"%s: Child failed to disable ptrace\n",
				strerror(errno));
			return -1;
		}
	} else {
		fprintf(stderr, "Child received unknown command %c\n",
			buf);
		return -1;
	}
	ret = send(sk, &buf, sizeof(buf), 0);
	if (ret != 1) {
		fprintf(stderr, "%s: Child failed to ack\n",
			strerror(errno));
		return -1;
	}
}
----
This will block until the close(self->sk) in the fixture teardown. Then ret
returns 0, and the child should exit. Maybe a comment like:
/*
 * The fixture setup is completed at this point. The tests will run.
 *
 * Either we will read 'P' off of the sk, indicating that we need
 * to disable ptrace, or if the other side of the socket is closed
 * recv will return 0-bytes. This indicates that the fixture teardown
 * has occured, and the child should exit.
 */
would be useful?

> > 
> > > 
> > > Also, what does self->pid point to? The fd of the already exited child?
> > It's just the pid of the child. pidfd is the fd of the (unexited) child.
I have no idea if it's pro / against the commenting style to blow up that
structure:
FIXTURE(child)
{
	/* pid points to the child which we are fetching FDs from */
	pid_t pid;
	/* pidfd is the pidfd of the child */
	int pidfd;
	/*
	 * sk is our side of the socketpair used to communicate with the child.
	 * When it is closed, the child will exit.
	 */
	int sk;
	/*
	 * remote_fd is the number of the FD which we are trying to retrieve
	 * from the child.
	 */
	int remote_fd;
};

> 
> Ah, thanks!
> Christian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 3/3] test: Add test for pidfd getfd
  2020-01-06 21:06         ` Sargun Dhillon
@ 2020-01-07  8:55           ` Christian Brauner
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2020-01-07  8:55 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-kernel, containers, linux-api, linux-fsdevel, tycho, jannh,
	cyphar, oleg, luto, viro, gpascutto, ealvarez, fweimer, jld,
	arnd

On Mon, Jan 06, 2020 at 09:06:47PM +0000, Sargun Dhillon wrote:
> On Mon, Jan 06, 2020 at 06:19:41PM +0100, Christian Brauner wrote:
> > On Sun, Jan 05, 2020 at 07:08:13PM +0000, Sargun Dhillon wrote:
> > > On Sun, Jan 05, 2020 at 03:20:23PM +0100, Christian Brauner wrote:
> > > > On Fri, Jan 03, 2020 at 08:29:28AM -0800, Sargun Dhillon wrote:
> > > > > +static int sys_pidfd_getfd(int pidfd, int fd, int flags)
> > > > > +{
> > > > > +	return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
> > > > > +}
> > > > 
> > > > I think you can move this to the pidfd.h header as:
> > > > 
> > > > static inline int sys_pidfd_getfd(int pidfd, int fd, int flags)
> > > > {
> > > > 	return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
> > > > }
> > > > 
> > > > Note, this also needs an
> > > > 
> > > > #ifndef __NR_pidfd_getfd
> > > > __NR_pidfd_getfd -1
> > > > #endif
> > > > so that compilation doesn't fail.
> > > > 
> > > I'll go ahead and move this into pidfd.h, and follow the pattern there. I
> > > don't think it's worth checking if each time the return code is ENOSYS.
> > > 
> > > Does it make sense to add something like:
> > > #ifdef __NR_pidfd_getfd
> > > TEST_HARNESS_MAIN
> > > #else
> > > int main(void)
> > > {
> > > 	fprintf(stderr, "pidfd_getfd syscall not supported\n");
> > > 	return KSFT_SKIP;
> > > }
> > > #endif
> > > 
> > > to short-circuit the entire test suite?
> > 
> > You mean the getfd testsuite? If so and that works, then sounds like a
> > good idea to me.
> > 
> > > 
> > > 
> > > 
> > > > 
> > > > Hm, isn't it safer to do 65535 explicitly? Since USHRT_MAX can
> > > > technically be greater than 65535.
> > > > 
> > > I borrowed this from the BPF tests. I can hardcode something like:
> > > #define NOBODY_UID 65535
> > > and setuid to that, if you think it's safer?
> > 
> > If you want to specifically seteuid() to 65535 then yes, using the
> > hard-coded number or using a dedicated macro seems better.
> > 
> > > 
> > > > > +
> > > > > +	ASSERT_EQ(1, send(self->sk, "P", 1, 0));
> > > > > +	ASSERT_EQ(1, recv(self->sk, &c, 1, 0));
> > > > > +
> > > > > +	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
> > > > > +	EXPECT_EQ(-1, fd);
> > > > > +	EXPECT_EQ(EPERM, errno);
> > > > > +
> > > > > +	if (uid == 0)
> > > > > +		ASSERT_EQ(0, seteuid(0));
> > > > > +}
> > > > > +
> > > > > +TEST_F(child, fetch_fd)
> > > > > +{
> > > > > +	int fd, ret;
> > > > > +
> > > > > +	fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
> > > > > +	ASSERT_GE(fd, 0);
> > > > > +
> > > > > +	EXPECT_EQ(0, sys_kcmp(getpid(), self->pid, KCMP_FILE, fd, self->remote_fd));
> > > > 
> > > > So most of these tests seem to take place when the child has already
> > > > called exit() - or at least it's very likely that the child has already
> > > > called exit() - and remains a zombie. That's not ideal because
> > > > that's not the common scenario/use-case. Usually the task of which we
> > > > want to get an fd will be alive. Also, if the child has already called
> > > > exit(), by the time it returns to userspace it should have already
> > > > called exit_files() and so I wonder whether this test would fail if it's
> > > > run after the child has exited. Maybe I'm missing something here... Is
> > > > there some ordering enforced by TEST_F()?
> > > Yeah, I think perhaps I was being too clever.
> > > The timeline roughly goes something like this:
> > > 
> > > # Fixture bringup
> > > [parent] creates socket_pair
> > > [parent] forks, and passes pair down to child
> > > [parent] waits to read sizeof(int) from the sk_pair
> > > [child] creates memfd 
> > > [__child] sends local memfd number to parent via sk_pair
> > > [__child] waits to read from sk_pair
> > > [parent] reads remote memfd number from socket
> > > # Test
> > > [parent] performs tests
> > > # Fixture teardown
> > > [parent] closes sk_pair
> > > [__child] reads 0 from recv on sk_pair, implies the other end is closed
> > > [__child] Returns / exits 0
> > > [parent] Reaps child / reads exit code
> > > 
> > > ---
> > > The one case where this is not true, is if the parent sends 'P' to the sk pair,
> > > it triggers setting PR_SET_DUMPABLE to 0, and then resumes waiting for the fd to 
> > > close.
> > > 
> > > Maybe I'm being too clever? Instead, the alternative was to send explicit stop / 
> > > start messages across the sk_pair, but that got kind of ugly. Do you have a 
> > > better suggestion?
> > 
> > If I understand correctly you just need to block the child to stop it
> > from exiting. Couldn't you do this by simply calling recv() on the
> > socket in the child thereby blocking it? At the end you just send a
> > final message to proceed and if that doesn't work SIGKILL it?
> > 
> This already exists in:
> while ((ret = recv(sk, &buf, sizeof(buf), 0)) > 0) {
> 	if (buf == 'P') {
> 		ret = prctl(PR_SET_DUMPABLE, 0);
> 		if (ret < 0) {
> 			fprintf(stderr,
> 				"%s: Child failed to disable ptrace\n",
> 				strerror(errno));
> 			return -1;
> 		}
> 	} else {
> 		fprintf(stderr, "Child received unknown command %c\n",
> 			buf);
> 		return -1;
> 	}
> 	ret = send(sk, &buf, sizeof(buf), 0);
> 	if (ret != 1) {
> 		fprintf(stderr, "%s: Child failed to ack\n",
> 			strerror(errno));
> 		return -1;
> 	}
> }
> ----
> This will block until the close(self->sk) in the fixture teardown. Then ret
> returns 0, and the child should exit. Maybe a comment like:
> /*
>  * The fixture setup is completed at this point. The tests will run.
>  *
>  * Either we will read 'P' off of the sk, indicating that we need
>  * to disable ptrace, or if the other side of the socket is closed
>  * recv will return 0-bytes. This indicates that the fixture teardown
>  * has occured, and the child should exit.
>  */
> would be useful?

Ah yeah, that would be helpful. I missed that while reading the code. So
the child is definitely still alive when te tests are run, it seems.
That explains why this didn't fail. :)

> 
> > > 
> > > > 
> > > > Also, what does self->pid point to? The fd of the already exited child?
> > > It's just the pid of the child. pidfd is the fd of the (unexited) child.
> I have no idea if it's pro / against the commenting style to blow up that
> structure:

I think it's fine to comment it like that.

Thanks!
Christian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 2/3] pid: Introduce pidfd_getfd syscall
  2020-01-03 16:29 ` [PATCH v8 2/3] pid: Introduce pidfd_getfd syscall Sargun Dhillon
  2020-01-05 13:30   ` Christian Brauner
@ 2020-01-17 23:06   ` Matthew Wilcox
  2020-01-17 23:14     ` Aleksa Sarai
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2020-01-17 23:06 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: linux-kernel, containers, linux-api, linux-fsdevel, tycho, jannh,
	cyphar, christian.brauner, oleg, luto, viro, gpascutto, ealvarez,
	fweimer, jld, arnd

On Fri, Jan 03, 2020 at 08:29:27AM -0800, Sargun Dhillon wrote:
> +++ b/kernel/pid.c
> @@ -578,3 +578,93 @@ void __init pid_idr_init(void)
>  	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
>  			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
>  }
> +
> +static struct file *__pidfd_fget(struct task_struct *task, int fd)
> +{
> +	struct file *file;
> +	int ret;
> +
> +	ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
> +		file = fget_task(task, fd);
> +	else
> +		file = ERR_PTR(-EPERM);
> +
> +	mutex_unlock(&task->signal->cred_guard_mutex);
> +
> +	return file ?: ERR_PTR(-EBADF);
> +}
> +
> +static int pidfd_getfd(struct pid *pid, int fd)
> +{
> +	struct task_struct *task;
> +	struct file *file;
> +	int ret;
> +
> +	task = get_pid_task(pid, PIDTYPE_PID);
> +	if (!task)
> +		return -ESRCH;
> +
> +	file = __pidfd_fget(task, fd);
> +	put_task_struct(task);
> +	if (IS_ERR(file))
> +		return PTR_ERR(file);
> +
> +	ret = security_file_receive(file);
> +	if (ret) {
> +		fput(file);
> +		return ret;
> +	}
> +
> +	ret = get_unused_fd_flags(O_CLOEXEC);
> +	if (ret < 0)
> +		fput(file);
> +	else
> +		fd_install(ret, file);
> +
> +	return ret;
> +}
> +
> +/**
> + * sys_pidfd_getfd() - Get a file descriptor from another process
> + *
> + * @pidfd:	the pidfd file descriptor of the process
> + * @fd:		the file descriptor number to get
> + * @flags:	flags on how to get the fd (reserved)
> + *
> + * This syscall gets a copy of a file descriptor from another process
> + * based on the pidfd, and file descriptor number. It requires that
> + * the calling process has the ability to ptrace the process represented
> + * by the pidfd. The process which is having its file descriptor copied
> + * is otherwise unaffected.
> + *
> + * Return: On success, a cloexec file descriptor is returned.
> + *         On error, a negative errno number will be returned.
> + */

We don't usually kernel-doc syscalls.  They should have manpages instead.

> +SYSCALL_DEFINE3(pidfd_getfd, int, pidfd, int, fd,
> +		unsigned int, flags)
> +{
> +	struct pid *pid;
> +	struct fd f;
> +	int ret;
> +
> +	/* flags is currently unused - make sure it's unset */
> +	if (flags)
> +		return -EINVAL;

Is EINVAL the right errno here?  Often we use ENOSYS for bad flags to
syscalls.

> +	f = fdget(pidfd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	pid = pidfd_pid(f.file);
> +	if (IS_ERR(pid))
> +		ret = PTR_ERR(pid);
> +	else
> +		ret = pidfd_getfd(pid, fd);

You can simplify this by having pidfd_pid() return ERR_PTR(-EBADF) if
!f.file, and having pidfd_getfd() return PTR_ERR() if IS_ERR(pid).  Then
this function looks like:

	if (flags)
		return -EINVAL;

	f = fdget(pidfd);
	pid = pidfd_pid(f.file);
	ret = pidfd_getfd(pid, fd);
	fdput(f);
	return ret;

You could even eliminate the 'pid' variable and just do:

	ret = pidfd_getfd(pidfd_pid(f.file), fd);

but that's a step too far for me.

It's unfortunate that -EBADF might mean that either the first or second
argument is a bad fd number.  I'm not sure I have a good alternative though.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v8 2/3] pid: Introduce pidfd_getfd syscall
  2020-01-17 23:06   ` Matthew Wilcox
@ 2020-01-17 23:14     ` Aleksa Sarai
  0 siblings, 0 replies; 13+ messages in thread
From: Aleksa Sarai @ 2020-01-17 23:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Sargun Dhillon, linux-kernel, containers, linux-api,
	linux-fsdevel, tycho, jannh, christian.brauner, oleg, luto, viro,
	gpascutto, ealvarez, fweimer, jld, arnd

[-- Attachment #1: Type: text/plain, Size: 807 bytes --]

On 2020-01-17, Matthew Wilcox <willy@infradead.org> wrote:
> On Fri, Jan 03, 2020 at 08:29:27AM -0800, Sargun Dhillon wrote:
> > +SYSCALL_DEFINE3(pidfd_getfd, int, pidfd, int, fd,
> > +		unsigned int, flags)
> > +{
> > +	struct pid *pid;
> > +	struct fd f;
> > +	int ret;
> > +
> > +	/* flags is currently unused - make sure it's unset */
> > +	if (flags)
> > +		return -EINVAL;
> 
> Is EINVAL the right errno here?  Often we use ENOSYS for bad flags to
> syscalls.

I don't think that's right -- every syscall I've seen gives you -EINVAL
for invalid flags (not to mention -ENOSYS would mean userspace would be
confused as to whether the syscall is actually supported by the kernel).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-01-17 23:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-03 16:29 [PATCH v8 0/3] Add pidfd_getfd syscall Sargun Dhillon
2020-01-03 16:29 ` [PATCH v8 1/3] vfs, fdtable: Add get_task_file helper Sargun Dhillon
2020-01-05 12:47   ` Christian Brauner
2020-01-03 16:29 ` [PATCH v8 2/3] pid: Introduce pidfd_getfd syscall Sargun Dhillon
2020-01-05 13:30   ` Christian Brauner
2020-01-17 23:06   ` Matthew Wilcox
2020-01-17 23:14     ` Aleksa Sarai
2020-01-03 16:29 ` [PATCH v8 3/3] test: Add test for pidfd getfd Sargun Dhillon
2020-01-05 14:20   ` Christian Brauner
2020-01-05 19:08     ` Sargun Dhillon
2020-01-06 17:19       ` Christian Brauner
2020-01-06 21:06         ` Sargun Dhillon
2020-01-07  8:55           ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).