linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] Support userspace-selected fds
@ 2020-04-14  2:14 Josh Triplett
  2020-04-14  2:15 ` [PATCH v4 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation Josh Triplett
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Josh Triplett @ 2020-04-14  2:14 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, io-uring, linux-arch
  Cc: Alexander Viro, Arnd Bergmann, Jens Axboe, Aleksa Sarai

5.8 material, not intended for 5.7.

Inspired by the X protocol's handling of XIDs, allow userspace to select
the file descriptor opened by a call like openat2, so that it can use
the resulting file descriptor in subsequent system calls without waiting
for the response to the initial openat2 syscall.

The first patch is independent of the other two; it allows reserving
file descriptors below a certain minimum for userspace-selected fd
allocation only.

The second patch implements userspace-selected fd allocation for
openat2, introducing a new O_SPECIFIC_FD flag and an fd field in struct
open_how. In io_uring, this allows sequences like openat2/read/close
without waiting for the openat2 to complete. Multiple such sequences can
overlap, as long as each uses a distinct file descriptor.

The third patch adds userspace-selected fd allocation to pipe2 as well.
I did this partly as a demonstration of how simple it is to wire up
O_SPECIFIC_FD support for any fd-allocating system call, and partly in
the hopes that this may make it more useful to wire up io_uring support
for pipe2 in the future.

If this gets accepted, I'm happy to also write corresponding manpage
patches.

v4:

Changed fd field to __u32.
Expanded and consolidated checks that return -EINVAL for invalid arguments.
Simplified and commented build_open_how.
Add documentation comment for fd field.
Add kselftests.

Thanks to Aleksa Sarai for feedback.

v3:

This new version has an API to atomically increase the minimum fd and
return the previous minimum, rather than just getting and setting the
minimum; this makes it easier to allocate a range. (A library that might
initialize after the program has already opened other file descriptors
may need to check for existing open fds in the range after reserving it,
and reserve more fds if needed; this can be done entirely in userspace,
and we can't really do anything simpler in the kernel due to limitations
on file-descriptor semantics, so this patch series avoids introducing
any extra complexity in the kernel.)

This new version also supports a __get_specific_unused_fd_flags call
which accepts the limit for RLIMIT_NOFILE as an argument, analogous to
__get_unused_fd_flags, since io_uring needs that to correctly handle
RLIMIT_NOFILE.

Thanks to Jens Axboe for review and feedback.

v2:

Version 2 was a version incorporated into a larger patch series from Jens Axboe
on io_uring.

Josh Triplett (3):
  fs: Support setting a minimum fd for "lowest available fd" allocation
  fs: openat2: Extend open_how to allow userspace-selected fds
  fs: pipe2: Support O_SPECIFIC_FD

 fs/fcntl.c                                    |  2 +-
 fs/file.c                                     | 62 +++++++++++++++++--
 fs/io_uring.c                                 |  3 +-
 fs/open.c                                     |  8 ++-
 fs/pipe.c                                     | 16 +++--
 include/linux/fcntl.h                         |  5 +-
 include/linux/fdtable.h                       |  1 +
 include/linux/file.h                          |  4 ++
 include/uapi/asm-generic/fcntl.h              |  4 ++
 include/uapi/linux/openat2.h                  |  3 +
 include/uapi/linux/prctl.h                    |  3 +
 kernel/sys.c                                  |  5 ++
 tools/testing/selftests/openat2/helpers.c     |  2 +-
 tools/testing/selftests/openat2/helpers.h     | 21 +++++--
 .../testing/selftests/openat2/openat2_test.c  | 29 ++++++++-
 15 files changed, 144 insertions(+), 24 deletions(-)

-- 
2.26.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v4 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation
  2020-04-14  2:14 [PATCH v4 0/3] Support userspace-selected fds Josh Triplett
@ 2020-04-14  2:15 ` Josh Triplett
  2020-04-14  2:15 ` [PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds Josh Triplett
  2020-04-14  2:16 ` [PATCH v4 3/3] fs: pipe2: Support O_SPECIFIC_FD Josh Triplett
  2 siblings, 0 replies; 13+ messages in thread
From: Josh Triplett @ 2020-04-14  2:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, io-uring, linux-arch
  Cc: Alexander Viro, Arnd Bergmann, Jens Axboe, Aleksa Sarai

Some applications want to prevent the usual "lowest available fd"
allocation from allocating certain file descriptors. For instance, they
may want to prevent allocation of a closed fd 0, 1, or 2 other than via
dup2/dup3, or reserve some low file descriptors for other purposes.

Add a prctl to increase the minimum fd and return the previous minimum.

System calls that allocate a specific file descriptor, such as
dup2/dup3, ignore this minimum.

exec resets the minimum fd, to prevent one program from interfering with
another program's expectations about fd allocation.

Test program:

    #include <err.h>
    #include <fcntl.h>
    #include <stdio.h>
    #include <sys/prctl.h>

    int main(int argc, char *argv[])
    {
        if (prctl(PR_INCREASE_MIN_FD, 100, 0, 0, 0) < 0)
            err(1, "prctl");
        int fd = open("/dev/null", O_RDONLY);
        if (fd < 0)
            err(1, "open");
        printf("%d\n", fd); // prints 100
        return 0;
    }

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
---
 fs/file.c                  | 23 +++++++++++++++++------
 include/linux/fdtable.h    |  1 +
 include/linux/file.h       |  1 +
 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c               |  5 +++++
 5 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..ba06140d89af 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -286,7 +286,6 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	spin_lock_init(&newf->file_lock);
 	newf->resize_in_progress = false;
 	init_waitqueue_head(&newf->resize_wait);
-	newf->next_fd = 0;
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
@@ -295,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt->fd = &newf->fd_array[0];
 
 	spin_lock(&oldf->file_lock);
+	newf->next_fd = newf->min_fd = oldf->min_fd;
 	old_fdt = files_fdtable(oldf);
 	open_files = count_open_files(old_fdt);
 
@@ -487,9 +487,7 @@ int __alloc_fd(struct files_struct *files,
 	spin_lock(&files->file_lock);
 repeat:
 	fdt = files_fdtable(files);
-	fd = start;
-	if (fd < files->next_fd)
-		fd = files->next_fd;
+	fd = max3(start, files->min_fd, files->next_fd);
 
 	if (fd < fdt->max_fds)
 		fd = find_next_fd(fdt, fd);
@@ -514,7 +512,7 @@ int __alloc_fd(struct files_struct *files,
 		goto repeat;
 
 	if (start <= files->next_fd)
-		files->next_fd = fd + 1;
+		files->next_fd = max(fd + 1, files->min_fd);
 
 	__set_open_fd(fd, fdt);
 	if (flags & O_CLOEXEC)
@@ -555,7 +553,7 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = files_fdtable(files);
 	__clear_open_fd(fd, fdt);
-	if (fd < files->next_fd)
+	if (fd < files->next_fd && fd >= files->min_fd)
 		files->next_fd = fd;
 }
 
@@ -684,6 +682,7 @@ void do_close_on_exec(struct files_struct *files)
 
 	/* exec unshares first */
 	spin_lock(&files->file_lock);
+	files->min_fd = 0;
 	for (i = 0; ; i++) {
 		unsigned long set;
 		unsigned fd = i * BITS_PER_LONG;
@@ -865,6 +864,18 @@ bool get_close_on_exec(unsigned int fd)
 	return res;
 }
 
+unsigned int increase_min_fd(unsigned int num)
+{
+	struct files_struct *files = current->files;
+	unsigned int old_min_fd;
+
+	spin_lock(&files->file_lock);
+	old_min_fd = files->min_fd;
+	files->min_fd += num;
+	spin_unlock(&files->file_lock);
+	return old_min_fd;
+}
+
 static int do_dup2(struct files_struct *files,
 	struct file *file, unsigned fd, unsigned flags)
 __releases(&files->file_lock)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..d1980443d8b3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -60,6 +60,7 @@ struct files_struct {
    */
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
 	unsigned int next_fd;
+	unsigned int min_fd; /* min for "lowest available fd" allocation */
 	unsigned long close_on_exec_init[1];
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
diff --git a/include/linux/file.h b/include/linux/file.h
index 142d102f285e..b67986f818d2 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -88,6 +88,7 @@ extern bool get_close_on_exec(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
 extern int get_unused_fd_flags(unsigned flags);
 extern void put_unused_fd(unsigned int fd);
+extern unsigned int increase_min_fd(unsigned int num);
 
 extern void fd_install(unsigned int fd, struct file *file);
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..916327272d21 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,7 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Increase minimum file descriptor for "lowest available fd" allocation */
+#define PR_INCREASE_MIN_FD		59
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a..daa0ce43cecc 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2514,6 +2514,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_INCREASE_MIN_FD:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = increase_min_fd((unsigned int)arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds
  2020-04-14  2:14 [PATCH v4 0/3] Support userspace-selected fds Josh Triplett
  2020-04-14  2:15 ` [PATCH v4 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation Josh Triplett
@ 2020-04-14  2:15 ` Josh Triplett
  2020-04-19 10:44   ` Aleksa Sarai
       [not found]   ` <20200427135210.GB5770@shao2-debian>
  2020-04-14  2:16 ` [PATCH v4 3/3] fs: pipe2: Support O_SPECIFIC_FD Josh Triplett
  2 siblings, 2 replies; 13+ messages in thread
From: Josh Triplett @ 2020-04-14  2:15 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, io-uring, linux-arch
  Cc: Alexander Viro, Arnd Bergmann, Jens Axboe, Aleksa Sarai

Inspired by the X protocol's handling of XIDs, allow userspace to select
the file descriptor opened by openat2, so that it can use the resulting
file descriptor in subsequent system calls without waiting for the
response to openat2.

In io_uring, this allows sequences like openat2/read/close without
waiting for the openat2 to complete. Multiple such sequences can
overlap, as long as each uses a distinct file descriptor.

Add a new O_SPECIFIC_FD open flag to enable this behavior, only accepted
by openat2 for now (ignored by open/openat like all unknown flags). Add
an fd field to struct open_how (along with appropriate padding, and
verify that the padding is 0 to allow replacing the padding with a field
in the future).

The file table has a corresponding new function
get_specific_unused_fd_flags, which gets the specified file descriptor
if O_SPECIFIC_FD is set (and the fd isn't -1); otherwise it falls back
to get_unused_fd_flags, to simplify callers.

The specified file descriptor must not already be open; if it is,
get_specific_unused_fd_flags will fail with -EBUSY. This helps catch
userspace errors.

When O_SPECIFIC_FD is set, and fd is not -1, openat2 will use the
specified file descriptor rather than finding the lowest available one.

Test program:

    #include <err.h>
    #include <fcntl.h>
    #include <stdio.h>
    #include <unistd.h>

    int main(void)
    {
        struct open_how how = {
	    .flags = O_RDONLY | O_SPECIFIC_FD,
	    .fd = 42
	};
        int fd = openat2(AT_FDCWD, "/dev/null", &how, sizeof(how));
        if (fd < 0)
            err(1, "openat2");
        printf("fd=%d\n", fd); // prints fd=42
        return 0;
    }

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
---
 fs/fcntl.c                                    |  2 +-
 fs/file.c                                     | 39 +++++++++++++++++++
 fs/io_uring.c                                 |  3 +-
 fs/open.c                                     |  8 +++-
 include/linux/fcntl.h                         |  5 ++-
 include/linux/file.h                          |  3 ++
 include/uapi/asm-generic/fcntl.h              |  4 ++
 include/uapi/linux/openat2.h                  |  3 ++
 tools/testing/selftests/openat2/helpers.c     |  2 +-
 tools/testing/selftests/openat2/helpers.h     | 21 +++++++---
 .../testing/selftests/openat2/openat2_test.c  | 29 +++++++++++++-
 11 files changed, 105 insertions(+), 14 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..0357ad667563 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/file.c b/fs/file.c
index ba06140d89af..0674c3a2d3a5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -567,6 +567,45 @@ void put_unused_fd(unsigned int fd)
 
 EXPORT_SYMBOL(put_unused_fd);
 
+int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
+				   unsigned long nofile)
+{
+	int ret;
+	struct fdtable *fdt;
+	struct files_struct *files = current->files;
+
+	if (!(flags & O_SPECIFIC_FD) || fd == UINT_MAX)
+		return __get_unused_fd_flags(flags, nofile);
+
+	if (fd >= nofile)
+		return -EBADF;
+
+	spin_lock(&files->file_lock);
+	ret = expand_files(files, fd);
+	if (unlikely(ret < 0))
+		goto out_unlock;
+	fdt = files_fdtable(files);
+	if (fdt->fd[fd]) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+	__set_open_fd(fd, fdt);
+	if (flags & O_CLOEXEC)
+		__set_close_on_exec(fd, fdt);
+	else
+		__clear_close_on_exec(fd, fdt);
+	ret = fd;
+
+out_unlock:
+	spin_unlock(&files->file_lock);
+	return ret;
+}
+
+int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags)
+{
+	return __get_specific_unused_fd_flags(fd, flags, rlimit(RLIMIT_NOFILE));
+}
+
 /*
  * Install a file pointer in the fd array.
  *
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 358f97be9c7b..4a69e1daf3fe 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2997,7 +2997,8 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
 	if (ret)
 		goto err;
 
-	ret = __get_unused_fd_flags(req->open.how.flags, req->open.nofile);
+	ret = __get_specific_unused_fd_flags(req->open.how.fd,
+			req->open.how.flags, req->open.nofile);
 	if (ret < 0)
 		goto err;
 
diff --git a/fs/open.c b/fs/open.c
index 719b320ede52..c1c2dd2d408d 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -962,6 +962,8 @@ inline struct open_how build_open_how(int flags, umode_t mode)
 		.mode = mode & S_IALLUGO,
 	};
 
+	/* O_SPECIFIC_FD doesn't work in open calls that use build_open_how. */
+	how.flags &= ~O_SPECIFIC_FD;
 	/* O_PATH beats everything else. */
 	if (how.flags & O_PATH)
 		how.flags &= O_PATH_FLAGS;
@@ -989,6 +991,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 		return -EINVAL;
 	if (how->resolve & ~VALID_RESOLVE_FLAGS)
 		return -EINVAL;
+	if (how->pad != 0)
+		return -EINVAL;
+	if (!(flags & O_SPECIFIC_FD) && how->fd != 0)
+		return -EINVAL;
 
 	/* Deal with the mode. */
 	if (WILL_CREATE(flags)) {
@@ -1143,7 +1149,7 @@ static long do_sys_openat2(int dfd, const char __user *filename,
 	if (IS_ERR(tmp))
 		return PTR_ERR(tmp);
 
-	fd = get_unused_fd_flags(how->flags);
+	fd = get_specific_unused_fd_flags(how->fd, how->flags);
 	if (fd >= 0) {
 		struct file *f = do_filp_open(dfd, tmp, &op);
 		if (IS_ERR(f)) {
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..728849bcd8fa 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_SPECIFIC_FD)
 
 /* List of all valid flags for the how->upgrade_mask argument: */
 #define VALID_UPGRADE_FLAGS \
@@ -23,7 +23,8 @@
 
 /* List of all open_how "versions". */
 #define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
-#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER0
+#define OPEN_HOW_SIZE_VER1	32 /* added fd and pad */
+#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER1
 
 #ifndef force_o_largefile
 #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
diff --git a/include/linux/file.h b/include/linux/file.h
index b67986f818d2..a63301864a36 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -87,6 +87,9 @@ extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
 extern int get_unused_fd_flags(unsigned flags);
+extern int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
+	unsigned long nofile);
+extern int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags);
 extern void put_unused_fd(unsigned int fd);
 extern unsigned int increase_min_fd(unsigned int num);
 
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..d3de5b8b3955 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_SPECIFIC_FD
+#define O_SPECIFIC_FD	01000000000	/* open as specified fd */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
index 58b1eb711360..63974a276fb2 100644
--- a/include/uapi/linux/openat2.h
+++ b/include/uapi/linux/openat2.h
@@ -15,11 +15,14 @@
  * @flags: O_* flags.
  * @mode: O_CREAT/O_TMPFILE file mode.
  * @resolve: RESOLVE_* flags.
+ * @fd: Specific file descriptor to use, for O_SPECIFIC_FD.
  */
 struct open_how {
 	__u64 flags;
 	__u64 mode;
 	__u64 resolve;
+	__u32 fd;
+	__u32 pad; /* Must be 0 in the current version */
 };
 
 /* how->resolve flags for openat2(2). */
diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c
index 5074681ffdc9..b6533f0b1124 100644
--- a/tools/testing/selftests/openat2/helpers.c
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -98,7 +98,7 @@ void __attribute__((constructor)) init(void)
 	struct open_how how = {};
 	int fd;
 
-	BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER0);
+	BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER1);
 
 	/* Check openat2(2) support. */
 	fd = sys_openat2(AT_FDCWD, ".", &how);
diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h
index a6ea27344db2..b7dea87c17b9 100644
--- a/tools/testing/selftests/openat2/helpers.h
+++ b/tools/testing/selftests/openat2/helpers.h
@@ -24,28 +24,37 @@
 #endif /* SYS_openat2 */
 
 /*
- * Arguments for how openat2(2) should open the target path. If @resolve is
- * zero, then openat2(2) operates very similarly to openat(2).
+ * Arguments for how openat2(2) should open the target path. If only @flags and
+ * @mode are non-zero, then openat2(2) operates very similarly to openat(2).
  *
- * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
- * than being silently ignored. @mode must be zero unless one of {O_CREAT,
- * O_TMPFILE} are set.
+ * However, unlike openat(2), unknown or invalid bits in @flags result in
+ * -EINVAL rather than being silently ignored. @mode must be zero unless one of
+ * {O_CREAT, O_TMPFILE} are set.
  *
  * @flags: O_* flags.
  * @mode: O_CREAT/O_TMPFILE file mode.
  * @resolve: RESOLVE_* flags.
+ * @fd: Specific file descriptor to use, for O_SPECIFIC_FD.
  */
 struct open_how {
 	__u64 flags;
 	__u64 mode;
 	__u64 resolve;
+	__u32 fd;
+	__u32 pad; /* Must be 0 in the current version */
 };
 
+/* List of all open_how "versions". */
 #define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
-#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER0
+#define OPEN_HOW_SIZE_VER1	32 /* added fd and pad */
+#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER1
 
 bool needs_openat2(const struct open_how *how);
 
+#ifndef O_SPECIFIC_FD
+#define O_SPECIFIC_FD  01000000000
+#endif
+
 #ifndef RESOLVE_IN_ROOT
 /* how->resolve flags for openat2(2). */
 #define RESOLVE_NO_XDEV		0x01 /* Block mount-point crossings
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c
index b386367c606b..cf21a83c70c9 100644
--- a/tools/testing/selftests/openat2/openat2_test.c
+++ b/tools/testing/selftests/openat2/openat2_test.c
@@ -40,7 +40,7 @@ struct struct_test {
 	int err;
 };
 
-#define NUM_OPENAT2_STRUCT_TESTS 7
+#define NUM_OPENAT2_STRUCT_TESTS 8
 #define NUM_OPENAT2_STRUCT_VARIATIONS 13
 
 void test_openat2_struct(void)
@@ -52,6 +52,9 @@ void test_openat2_struct(void)
 		{ .name = "normal struct",
 		  .arg.inner.flags = O_RDONLY,
 		  .size = sizeof(struct open_how) },
+		{ .name = "v0 struct",
+		  .arg.inner.flags = O_RDONLY,
+		  .size = OPEN_HOW_SIZE_VER0 },
 		/* Bigger struct, with zeroed out end. */
 		{ .name = "bigger struct (zeroed out)",
 		  .arg.inner.flags = O_RDONLY,
@@ -155,7 +158,7 @@ struct flag_test {
 	int err;
 };
 
-#define NUM_OPENAT2_FLAG_TESTS 23
+#define NUM_OPENAT2_FLAG_TESTS 29
 
 void test_openat2_flags(void)
 {
@@ -223,6 +226,24 @@ void test_openat2_flags(void)
 		{ .name = "invalid how.resolve and O_PATH",
 		  .how.flags = O_PATH,
 		  .how.resolve = 0x1337, .err = -EINVAL },
+
+		/* O_SPECIFIC_FD tests */
+		{ .name = "O_SPECIFIC_FD",
+		  .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 42 },
+		{ .name = "O_SPECIFIC_FD if fd exists",
+		  .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 2,
+		  .err = -EBUSY },
+		{ .name = "O_SPECIFIC_FD with fd -1",
+		  .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = -1 },
+		{ .name = "fd without O_SPECIFIC_FD",
+		  .how.flags = O_RDONLY, .how.fd = 42,
+		  .err = -EINVAL },
+		{ .name = "fd -1 without O_SPECIFIC_FD",
+		  .how.flags = O_RDONLY, .how.fd = -1,
+		  .err = -EINVAL },
+		{ .name = "existing fd without O_SPECIFIC_FD",
+		  .how.flags = O_RDONLY, .how.fd = 2,
+		  .err = -EINVAL },
 	};
 
 	BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_FLAG_TESTS);
@@ -268,6 +289,10 @@ void test_openat2_flags(void)
 			if (!(test->how.flags & O_LARGEFILE))
 				fdflags &= ~O_LARGEFILE;
 			failed |= (fdflags != test->how.flags);
+
+			if (test->how.flags & O_SPECIFIC_FD
+			    && test->how.fd != -1)
+				failed |= (fd != test->how.fd);
 		}
 
 		if (failed) {
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v4 3/3] fs: pipe2: Support O_SPECIFIC_FD
  2020-04-14  2:14 [PATCH v4 0/3] Support userspace-selected fds Josh Triplett
  2020-04-14  2:15 ` [PATCH v4 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation Josh Triplett
  2020-04-14  2:15 ` [PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds Josh Triplett
@ 2020-04-14  2:16 ` Josh Triplett
  2 siblings, 0 replies; 13+ messages in thread
From: Josh Triplett @ 2020-04-14  2:16 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, io-uring, linux-arch
  Cc: Alexander Viro, Arnd Bergmann, Jens Axboe, Aleksa Sarai

This allows the caller of pipe2 to specify one or both file descriptors
rather than having them automatically use the lowest available file
descriptor. The caller can specify either file descriptor as -1 to
allow that file descriptor to use the lowest available.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
---
 fs/pipe.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 16fb72e9abf7..4681a0d1d587 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -936,19 +936,19 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
 	int error;
 	int fdw, fdr;
 
-	if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT))
+	if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_SPECIFIC_FD))
 		return -EINVAL;
 
 	error = create_pipe_files(files, flags);
 	if (error)
 		return error;
 
-	error = get_unused_fd_flags(flags);
+	error = get_specific_unused_fd_flags(fd[0], flags);
 	if (error < 0)
 		goto err_read_pipe;
 	fdr = error;
 
-	error = get_unused_fd_flags(flags);
+	error = get_specific_unused_fd_flags(fd[1], flags);
 	if (error < 0)
 		goto err_fdr;
 	fdw = error;
@@ -969,7 +969,11 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
 int do_pipe_flags(int *fd, int flags)
 {
 	struct file *files[2];
-	int error = __do_pipe_flags(fd, files, flags);
+	int error;
+
+	if (flags & O_SPECIFIC_FD)
+		return -EINVAL;
+	error = __do_pipe_flags(fd, files, flags);
 	if (!error) {
 		fd_install(fd[0], files[0]);
 		fd_install(fd[1], files[1]);
@@ -987,6 +991,10 @@ static int do_pipe2(int __user *fildes, int flags)
 	int fd[2];
 	int error;
 
+	if (flags & O_SPECIFIC_FD)
+		if (copy_from_user(fd, fildes, sizeof(fd)))
+			return -EFAULT;
+
 	error = __do_pipe_flags(fd, files, flags);
 	if (!error) {
 		if (unlikely(copy_to_user(fildes, fd, sizeof(fd)))) {
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds
  2020-04-14  2:15 ` [PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds Josh Triplett
@ 2020-04-19 10:44   ` Aleksa Sarai
  2020-04-19 21:18     ` David Laight
                       ` (2 more replies)
       [not found]   ` <20200427135210.GB5770@shao2-debian>
  1 sibling, 3 replies; 13+ messages in thread
From: Aleksa Sarai @ 2020-04-19 10:44 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-fsdevel, linux-kernel, io-uring, linux-arch,
	Alexander Viro, Arnd Bergmann, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 15366 bytes --]

On 2020-04-13, Josh Triplett <josh@joshtriplett.org> wrote:
> Inspired by the X protocol's handling of XIDs, allow userspace to select
> the file descriptor opened by openat2, so that it can use the resulting
> file descriptor in subsequent system calls without waiting for the
> response to openat2.
> 
> In io_uring, this allows sequences like openat2/read/close without
> waiting for the openat2 to complete. Multiple such sequences can
> overlap, as long as each uses a distinct file descriptor.

I'm not sure I understand this explanation -- how can you trigger a
syscall with an fd that hasn't yet been registered (unless you're just
hoping the race goes in your favour)?

> Add a new O_SPECIFIC_FD open flag to enable this behavior, only accepted
> by openat2 for now (ignored by open/openat like all unknown flags). Add
> an fd field to struct open_how (along with appropriate padding, and
> verify that the padding is 0 to allow replacing the padding with a field
> in the future).
> 
> The file table has a corresponding new function
> get_specific_unused_fd_flags, which gets the specified file descriptor
> if O_SPECIFIC_FD is set (and the fd isn't -1); otherwise it falls back
> to get_unused_fd_flags, to simplify callers.
> 
> The specified file descriptor must not already be open; if it is,
> get_specific_unused_fd_flags will fail with -EBUSY. This helps catch
> userspace errors.
> 
> When O_SPECIFIC_FD is set, and fd is not -1, openat2 will use the
> specified file descriptor rather than finding the lowest available one.

I still don't like that you can enable this feature with O_SPECIFIC_FD
but then disable it by specifying fd as -1. I understand why this is
needed for pipe2() and socketpair() and that's totally fine, but I don't
think it makes sense for openat2() or other interfaces where there's
only one fd being returned -- what does it mean to say "give me a
specific fd, but actually I don't care what it is"?

I know this is a trade-off between consistency of O_SPECIFIC_FD
interfaces and having wart-less interfaces for each syscall, but I don't
think it breaks consistency to say "syscalls that only give you one fd
don't have a second way of disabling the feature -- just don't pass
O_SPECIFIC_FD".

> Test program:
> 
>     #include <err.h>
>     #include <fcntl.h>
>     #include <stdio.h>
>     #include <unistd.h>
> 
>     int main(void)
>     {
>         struct open_how how = {
> 	    .flags = O_RDONLY | O_SPECIFIC_FD,
> 	    .fd = 42
> 	};
>         int fd = openat2(AT_FDCWD, "/dev/null", &how, sizeof(how));
>         if (fd < 0)
>             err(1, "openat2");
>         printf("fd=%d\n", fd); // prints fd=42
>         return 0;
>     }
> 
> Signed-off-by: Josh Triplett <josh@joshtriplett.org>
> ---
>  fs/fcntl.c                                    |  2 +-
>  fs/file.c                                     | 39 +++++++++++++++++++
>  fs/io_uring.c                                 |  3 +-
>  fs/open.c                                     |  8 +++-
>  include/linux/fcntl.h                         |  5 ++-
>  include/linux/file.h                          |  3 ++
>  include/uapi/asm-generic/fcntl.h              |  4 ++
>  include/uapi/linux/openat2.h                  |  3 ++
>  tools/testing/selftests/openat2/helpers.c     |  2 +-
>  tools/testing/selftests/openat2/helpers.h     | 21 +++++++---
>  .../testing/selftests/openat2/openat2_test.c  | 29 +++++++++++++-
>  11 files changed, 105 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 2e4c0fa2074b..0357ad667563 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
>  	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
>  	 * is defined as O_NONBLOCK on some platforms and not on others.
>  	 */
> -	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> +	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
>  		HWEIGHT32(
>  			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
>  			__FMODE_EXEC | __FMODE_NONOTIFY));
> diff --git a/fs/file.c b/fs/file.c
> index ba06140d89af..0674c3a2d3a5 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -567,6 +567,45 @@ void put_unused_fd(unsigned int fd)
>  
>  EXPORT_SYMBOL(put_unused_fd);
>  
> +int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
> +				   unsigned long nofile)
> +{
> +	int ret;
> +	struct fdtable *fdt;
> +	struct files_struct *files = current->files;
> +
> +	if (!(flags & O_SPECIFIC_FD) || fd == UINT_MAX)
> +		return __get_unused_fd_flags(flags, nofile);
> +
> +	if (fd >= nofile)
> +		return -EBADF;
> +
> +	spin_lock(&files->file_lock);
> +	ret = expand_files(files, fd);
> +	if (unlikely(ret < 0))
> +		goto out_unlock;
> +	fdt = files_fdtable(files);
> +	if (fdt->fd[fd]) {
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}
> +	__set_open_fd(fd, fdt);
> +	if (flags & O_CLOEXEC)
> +		__set_close_on_exec(fd, fdt);
> +	else
> +		__clear_close_on_exec(fd, fdt);
> +	ret = fd;
> +
> +out_unlock:
> +	spin_unlock(&files->file_lock);
> +	return ret;
> +}
> +
> +int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags)
> +{
> +	return __get_specific_unused_fd_flags(fd, flags, rlimit(RLIMIT_NOFILE));
> +}
> +
>  /*
>   * Install a file pointer in the fd array.
>   *
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 358f97be9c7b..4a69e1daf3fe 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2997,7 +2997,8 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
>  	if (ret)
>  		goto err;
>  
> -	ret = __get_unused_fd_flags(req->open.how.flags, req->open.nofile);
> +	ret = __get_specific_unused_fd_flags(req->open.how.fd,
> +			req->open.how.flags, req->open.nofile);
>  	if (ret < 0)
>  		goto err;
>  
> diff --git a/fs/open.c b/fs/open.c
> index 719b320ede52..c1c2dd2d408d 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -962,6 +962,8 @@ inline struct open_how build_open_how(int flags, umode_t mode)
>  		.mode = mode & S_IALLUGO,
>  	};
>  
> +	/* O_SPECIFIC_FD doesn't work in open calls that use build_open_how. */
> +	how.flags &= ~O_SPECIFIC_FD;
>  	/* O_PATH beats everything else. */
>  	if (how.flags & O_PATH)
>  		how.flags &= O_PATH_FLAGS;
> @@ -989,6 +991,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
>  		return -EINVAL;
>  	if (how->resolve & ~VALID_RESOLVE_FLAGS)
>  		return -EINVAL;
> +	if (how->pad != 0)
> +		return -EINVAL;
> +	if (!(flags & O_SPECIFIC_FD) && how->fd != 0)
> +		return -EINVAL;
>  
>  	/* Deal with the mode. */
>  	if (WILL_CREATE(flags)) {
> @@ -1143,7 +1149,7 @@ static long do_sys_openat2(int dfd, const char __user *filename,
>  	if (IS_ERR(tmp))
>  		return PTR_ERR(tmp);
>  
> -	fd = get_unused_fd_flags(how->flags);
> +	fd = get_specific_unused_fd_flags(how->fd, how->flags);
>  	if (fd >= 0) {
>  		struct file *f = do_filp_open(dfd, tmp, &op);
>  		if (IS_ERR(f)) {
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index 7bcdcf4f6ab2..728849bcd8fa 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -10,7 +10,7 @@
>  	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
>  	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
>  	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> -	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> +	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_SPECIFIC_FD)
>  
>  /* List of all valid flags for the how->upgrade_mask argument: */
>  #define VALID_UPGRADE_FLAGS \
> @@ -23,7 +23,8 @@
>  
>  /* List of all open_how "versions". */
>  #define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
> -#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER0
> +#define OPEN_HOW_SIZE_VER1	32 /* added fd and pad */
> +#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER1
>  
>  #ifndef force_o_largefile
>  #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
> diff --git a/include/linux/file.h b/include/linux/file.h
> index b67986f818d2..a63301864a36 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -87,6 +87,9 @@ extern void set_close_on_exec(unsigned int fd, int flag);
>  extern bool get_close_on_exec(unsigned int fd);
>  extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
>  extern int get_unused_fd_flags(unsigned flags);
> +extern int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
> +	unsigned long nofile);
> +extern int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags);
>  extern void put_unused_fd(unsigned int fd);
>  extern unsigned int increase_min_fd(unsigned int num);
>  
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 9dc0bf0c5a6e..d3de5b8b3955 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -89,6 +89,10 @@
>  #define __O_TMPFILE	020000000
>  #endif
>  
> +#ifndef O_SPECIFIC_FD
> +#define O_SPECIFIC_FD	01000000000	/* open as specified fd */
> +#endif
> +
>  /* a horrid kludge trying to make sure that this will fail on old kernels */
>  #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
>  #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
> diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
> index 58b1eb711360..63974a276fb2 100644
> --- a/include/uapi/linux/openat2.h
> +++ b/include/uapi/linux/openat2.h
> @@ -15,11 +15,14 @@
>   * @flags: O_* flags.
>   * @mode: O_CREAT/O_TMPFILE file mode.
>   * @resolve: RESOLVE_* flags.
> + * @fd: Specific file descriptor to use, for O_SPECIFIC_FD.
>   */
>  struct open_how {
>  	__u64 flags;
>  	__u64 mode;
>  	__u64 resolve;
> +	__u32 fd;
> +	__u32 pad; /* Must be 0 in the current version */
>  };
>  
>  /* how->resolve flags for openat2(2). */
> diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c
> index 5074681ffdc9..b6533f0b1124 100644
> --- a/tools/testing/selftests/openat2/helpers.c
> +++ b/tools/testing/selftests/openat2/helpers.c
> @@ -98,7 +98,7 @@ void __attribute__((constructor)) init(void)
>  	struct open_how how = {};
>  	int fd;
>  
> -	BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER0);
> +	BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER1);
>  
>  	/* Check openat2(2) support. */
>  	fd = sys_openat2(AT_FDCWD, ".", &how);
> diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h
> index a6ea27344db2..b7dea87c17b9 100644
> --- a/tools/testing/selftests/openat2/helpers.h
> +++ b/tools/testing/selftests/openat2/helpers.h
> @@ -24,28 +24,37 @@
>  #endif /* SYS_openat2 */
>  
>  /*
> - * Arguments for how openat2(2) should open the target path. If @resolve is
> - * zero, then openat2(2) operates very similarly to openat(2).
> + * Arguments for how openat2(2) should open the target path. If only @flags and
> + * @mode are non-zero, then openat2(2) operates very similarly to openat(2).
>   *
> - * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> - * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> - * O_TMPFILE} are set.
> + * However, unlike openat(2), unknown or invalid bits in @flags result in
> + * -EINVAL rather than being silently ignored. @mode must be zero unless one of
> + * {O_CREAT, O_TMPFILE} are set.
>   *
>   * @flags: O_* flags.
>   * @mode: O_CREAT/O_TMPFILE file mode.
>   * @resolve: RESOLVE_* flags.
> + * @fd: Specific file descriptor to use, for O_SPECIFIC_FD.
>   */
>  struct open_how {
>  	__u64 flags;
>  	__u64 mode;
>  	__u64 resolve;
> +	__u32 fd;
> +	__u32 pad; /* Must be 0 in the current version */

Small nit: This field should be called __padding to make it more
explicit it's something internal and shouldn't be looked at by
userspace. And the comment should just be "must be zeroed".

>  };
>  
> +/* List of all open_how "versions". */
>  #define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
> -#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER0
> +#define OPEN_HOW_SIZE_VER1	32 /* added fd and pad */
> +#define OPEN_HOW_SIZE_LATEST	OPEN_HOW_SIZE_VER1
>  
>  bool needs_openat2(const struct open_how *how);
>  
> +#ifndef O_SPECIFIC_FD
> +#define O_SPECIFIC_FD  01000000000
> +#endif
> +
>  #ifndef RESOLVE_IN_ROOT
>  /* how->resolve flags for openat2(2). */
>  #define RESOLVE_NO_XDEV		0x01 /* Block mount-point crossings
> diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c
> index b386367c606b..cf21a83c70c9 100644
> --- a/tools/testing/selftests/openat2/openat2_test.c
> +++ b/tools/testing/selftests/openat2/openat2_test.c
> @@ -40,7 +40,7 @@ struct struct_test {
>  	int err;
>  };
>  
> -#define NUM_OPENAT2_STRUCT_TESTS 7
> +#define NUM_OPENAT2_STRUCT_TESTS 8
>  #define NUM_OPENAT2_STRUCT_VARIATIONS 13
>  
>  void test_openat2_struct(void)
> @@ -52,6 +52,9 @@ void test_openat2_struct(void)
>  		{ .name = "normal struct",
>  		  .arg.inner.flags = O_RDONLY,
>  		  .size = sizeof(struct open_how) },
> +		{ .name = "v0 struct",
> +		  .arg.inner.flags = O_RDONLY,
> +		  .size = OPEN_HOW_SIZE_VER0 },
>  		/* Bigger struct, with zeroed out end. */
>  		{ .name = "bigger struct (zeroed out)",
>  		  .arg.inner.flags = O_RDONLY,
> @@ -155,7 +158,7 @@ struct flag_test {
>  	int err;
>  };
>  
> -#define NUM_OPENAT2_FLAG_TESTS 23
> +#define NUM_OPENAT2_FLAG_TESTS 29
>  
>  void test_openat2_flags(void)
>  {
> @@ -223,6 +226,24 @@ void test_openat2_flags(void)
>  		{ .name = "invalid how.resolve and O_PATH",
>  		  .how.flags = O_PATH,
>  		  .how.resolve = 0x1337, .err = -EINVAL },
> +
> +		/* O_SPECIFIC_FD tests */
> +		{ .name = "O_SPECIFIC_FD",
> +		  .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 42 },
> +		{ .name = "O_SPECIFIC_FD if fd exists",
> +		  .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 2,
> +		  .err = -EBUSY },
> +		{ .name = "O_SPECIFIC_FD with fd -1",
> +		  .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = -1 },
> +		{ .name = "fd without O_SPECIFIC_FD",
> +		  .how.flags = O_RDONLY, .how.fd = 42,
> +		  .err = -EINVAL },
> +		{ .name = "fd -1 without O_SPECIFIC_FD",
> +		  .how.flags = O_RDONLY, .how.fd = -1,
> +		  .err = -EINVAL },
> +		{ .name = "existing fd without O_SPECIFIC_FD",
> +		  .how.flags = O_RDONLY, .how.fd = 2,
> +		  .err = -EINVAL },

It would be good to add a test to make sure that a non-zero value of
how->__padding also gives -EINVAL.

>  	};
>  
>  	BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_FLAG_TESTS);
> @@ -268,6 +289,10 @@ void test_openat2_flags(void)
>  			if (!(test->how.flags & O_LARGEFILE))
>  				fdflags &= ~O_LARGEFILE;
>  			failed |= (fdflags != test->how.flags);
> +
> +			if (test->how.flags & O_SPECIFIC_FD
> +			    && test->how.fd != -1)
> +				failed |= (fd != test->how.fd);
>  		}
>  
>  		if (failed) {
> -- 
> 2.26.0
> 


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds
  2020-04-19 10:44   ` Aleksa Sarai
@ 2020-04-19 21:18     ` David Laight
  2020-04-19 22:22     ` Jens Axboe
  2020-04-20 21:14     ` Josh Triplett
  2 siblings, 0 replies; 13+ messages in thread
From: David Laight @ 2020-04-19 21:18 UTC (permalink / raw)
  To: 'Aleksa Sarai', Josh Triplett
  Cc: linux-fsdevel, linux-kernel, io-uring, linux-arch,
	Alexander Viro, Arnd Bergmann, Jens Axboe

From: Aleksa Sarai
> Sent: 19 April 2020 11:44
> 
> On 2020-04-13, Josh Triplett <josh@joshtriplett.org> wrote:
> > Inspired by the X protocol's handling of XIDs, allow userspace to select
> > the file descriptor opened by openat2, so that it can use the resulting
> > file descriptor in subsequent system calls without waiting for the
> > response to openat2.
> >
> > In io_uring, this allows sequences like openat2/read/close without
> > waiting for the openat2 to complete. Multiple such sequences can
> > overlap, as long as each uses a distinct file descriptor.
> 
> I'm not sure I understand this explanation -- how can you trigger a
> syscall with an fd that hasn't yet been registered (unless you're just
> hoping the race goes in your favour)?

I suspect (there are no comments in the io_uring code to say what it does)
that the io_uring code uses a thread of the user process to sequentially
execute IO requests that the main application has added to a queue.

So it might make sense to queue up open/read/close.
But that ought to be within the io_uring code.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds
  2020-04-19 10:44   ` Aleksa Sarai
  2020-04-19 21:18     ` David Laight
@ 2020-04-19 22:22     ` Jens Axboe
  2020-04-20  2:06       ` Aleksa Sarai
  2020-04-20 21:14     ` Josh Triplett
  2 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2020-04-19 22:22 UTC (permalink / raw)
  To: Aleksa Sarai, Josh Triplett
  Cc: linux-fsdevel, linux-kernel, io-uring, linux-arch,
	Alexander Viro, Arnd Bergmann

On 4/19/20 4:44 AM, Aleksa Sarai wrote:
> On 2020-04-13, Josh Triplett <josh@joshtriplett.org> wrote:
>> Inspired by the X protocol's handling of XIDs, allow userspace to select
>> the file descriptor opened by openat2, so that it can use the resulting
>> file descriptor in subsequent system calls without waiting for the
>> response to openat2.
>>
>> In io_uring, this allows sequences like openat2/read/close without
>> waiting for the openat2 to complete. Multiple such sequences can
>> overlap, as long as each uses a distinct file descriptor.
> 
> I'm not sure I understand this explanation -- how can you trigger a
> syscall with an fd that hasn't yet been registered (unless you're just
> hoping the race goes in your favour)?

io_uring can do chains of requests, where each link in the chain isn't
started until the previous one has completed. Hence if you know what fd
that openat2 will return, you can submit a chain ala:

<open file X, give me fd Y><read from fd Y><close fd Y>

as a single submission. This isn't possible to do currently, as the read
will depend on the output of the open, and we have no good way of
knowing what that fd will be.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds
  2020-04-19 22:22     ` Jens Axboe
@ 2020-04-20  2:06       ` Aleksa Sarai
  0 siblings, 0 replies; 13+ messages in thread
From: Aleksa Sarai @ 2020-04-20  2:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Josh Triplett, linux-fsdevel, linux-kernel, io-uring, linux-arch,
	Alexander Viro, Arnd Bergmann

[-- Attachment #1: Type: text/plain, Size: 1545 bytes --]

On 2020-04-19, Jens Axboe <axboe@kernel.dk> wrote:
> On 4/19/20 4:44 AM, Aleksa Sarai wrote:
> > On 2020-04-13, Josh Triplett <josh@joshtriplett.org> wrote:
> >> Inspired by the X protocol's handling of XIDs, allow userspace to select
> >> the file descriptor opened by openat2, so that it can use the resulting
> >> file descriptor in subsequent system calls without waiting for the
> >> response to openat2.
> >>
> >> In io_uring, this allows sequences like openat2/read/close without
> >> waiting for the openat2 to complete. Multiple such sequences can
> >> overlap, as long as each uses a distinct file descriptor.
> > 
> > I'm not sure I understand this explanation -- how can you trigger a
> > syscall with an fd that hasn't yet been registered (unless you're just
> > hoping the race goes in your favour)?
> 
> io_uring can do chains of requests, where each link in the chain isn't
> started until the previous one has completed. Hence if you know what fd
> that openat2 will return, you can submit a chain ala:
> 
> <open file X, give me fd Y><read from fd Y><close fd Y>
> 
> as a single submission. This isn't possible to do currently, as the read
> will depend on the output of the open, and we have no good way of
> knowing what that fd will be.

Ah! I was aware of io_uring's chaining feature but thought it had access
to the return of the previous stage -- now this makes much more sense.
Thanks.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds
  2020-04-19 10:44   ` Aleksa Sarai
  2020-04-19 21:18     ` David Laight
  2020-04-19 22:22     ` Jens Axboe
@ 2020-04-20 21:14     ` Josh Triplett
  2 siblings, 0 replies; 13+ messages in thread
From: Josh Triplett @ 2020-04-20 21:14 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: linux-fsdevel, linux-kernel, io-uring, linux-arch,
	Alexander Viro, Arnd Bergmann, Jens Axboe

On Sun, Apr 19, 2020 at 08:44:04PM +1000, Aleksa Sarai wrote:
> On 2020-04-13, Josh Triplett <josh@joshtriplett.org> wrote:
> > Inspired by the X protocol's handling of XIDs, allow userspace to select
> > the file descriptor opened by openat2, so that it can use the resulting
> > file descriptor in subsequent system calls without waiting for the
> > response to openat2.
> > 
> > In io_uring, this allows sequences like openat2/read/close without
> > waiting for the openat2 to complete. Multiple such sequences can
> > overlap, as long as each uses a distinct file descriptor.
> 
> I'm not sure I understand this explanation -- how can you trigger a
> syscall with an fd that hasn't yet been registered (unless you're just
> hoping the race goes in your favour)?

See the response from Jens for an explanation of how this works in
io_uring.

> > Add a new O_SPECIFIC_FD open flag to enable this behavior, only accepted
> > by openat2 for now (ignored by open/openat like all unknown flags). Add
> > an fd field to struct open_how (along with appropriate padding, and
> > verify that the padding is 0 to allow replacing the padding with a field
> > in the future).
> > 
> > The file table has a corresponding new function
> > get_specific_unused_fd_flags, which gets the specified file descriptor
> > if O_SPECIFIC_FD is set (and the fd isn't -1); otherwise it falls back
> > to get_unused_fd_flags, to simplify callers.
> > 
> > The specified file descriptor must not already be open; if it is,
> > get_specific_unused_fd_flags will fail with -EBUSY. This helps catch
> > userspace errors.
> > 
> > When O_SPECIFIC_FD is set, and fd is not -1, openat2 will use the
> > specified file descriptor rather than finding the lowest available one.
> 
> I still don't like that you can enable this feature with O_SPECIFIC_FD
> but then disable it by specifying fd as -1. I understand why this is
> needed for pipe2() and socketpair() and that's totally fine, but I don't
> think it makes sense for openat2() or other interfaces where there's
> only one fd being returned -- what does it mean to say "give me a
> specific fd, but actually I don't care what it is"?
> 
> I know this is a trade-off between consistency of O_SPECIFIC_FD
> interfaces and having wart-less interfaces for each syscall, but I don't
> think it breaks consistency to say "syscalls that only give you one fd
> don't have a second way of disabling the feature -- just don't pass
> O_SPECIFIC_FD".

I think there's value in the orthogonality, and -1 can never be a valid
file descriptor. If this becomes a sticking point, it could certainly be
changed (just modify pipe2 to remove the O_SPECIFIC_FD flag if passed
-1), but at the same time, I'd rather have this logic implemented once
with a uniform semantic no matter what syscall uses it.

> >  struct open_how {
> >  	__u64 flags;
> >  	__u64 mode;
> >  	__u64 resolve;
> > +	__u32 fd;
> > +	__u32 pad; /* Must be 0 in the current version */
> 
> Small nit: This field should be called __padding to make it more
> explicit it's something internal and shouldn't be looked at by
> userspace. And the comment should just be "must be zeroed".

Good point. Done in v5.

> > --- a/tools/testing/selftests/openat2/openat2_test.c
> > +++ b/tools/testing/selftests/openat2/openat2_test.c
> > @@ -40,7 +40,7 @@ struct struct_test {
> >  	int err;
> >  };
> >  
> > -#define NUM_OPENAT2_STRUCT_TESTS 7
> > +#define NUM_OPENAT2_STRUCT_TESTS 8
> >  #define NUM_OPENAT2_STRUCT_VARIATIONS 13
> >  
> >  void test_openat2_struct(void)
> > @@ -52,6 +52,9 @@ void test_openat2_struct(void)
> >  		{ .name = "normal struct",
> >  		  .arg.inner.flags = O_RDONLY,
> >  		  .size = sizeof(struct open_how) },
> > +		{ .name = "v0 struct",
> > +		  .arg.inner.flags = O_RDONLY,
> > +		  .size = OPEN_HOW_SIZE_VER0 },
> >  		/* Bigger struct, with zeroed out end. */
> >  		{ .name = "bigger struct (zeroed out)",
> >  		  .arg.inner.flags = O_RDONLY,
> > @@ -155,7 +158,7 @@ struct flag_test {
> >  	int err;
> >  };
> >  
> > -#define NUM_OPENAT2_FLAG_TESTS 23
> > +#define NUM_OPENAT2_FLAG_TESTS 29
> >  
> >  void test_openat2_flags(void)
> >  {
> > @@ -223,6 +226,24 @@ void test_openat2_flags(void)
> >  		{ .name = "invalid how.resolve and O_PATH",
> >  		  .how.flags = O_PATH,
> >  		  .how.resolve = 0x1337, .err = -EINVAL },
> > +
> > +		/* O_SPECIFIC_FD tests */
> > +		{ .name = "O_SPECIFIC_FD",
> > +		  .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 42 },
> > +		{ .name = "O_SPECIFIC_FD if fd exists",
> > +		  .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 2,
> > +		  .err = -EBUSY },
> > +		{ .name = "O_SPECIFIC_FD with fd -1",
> > +		  .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = -1 },
> > +		{ .name = "fd without O_SPECIFIC_FD",
> > +		  .how.flags = O_RDONLY, .how.fd = 42,
> > +		  .err = -EINVAL },
> > +		{ .name = "fd -1 without O_SPECIFIC_FD",
> > +		  .how.flags = O_RDONLY, .how.fd = -1,
> > +		  .err = -EINVAL },
> > +		{ .name = "existing fd without O_SPECIFIC_FD",
> > +		  .how.flags = O_RDONLY, .how.fd = 2,
> > +		  .err = -EINVAL },
> 
> It would be good to add a test to make sure that a non-zero value of
> how->__padding also gives -EINVAL.

Done in v5.

- Josh Triplett

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LTP] [fs] ce436509a8: ltp.openat203.fail
       [not found]   ` <20200427135210.GB5770@shao2-debian>
@ 2020-04-27 14:27     ` Cyril Hrubis
  2020-04-28  0:51       ` Aleksa Sarai
  0 siblings, 1 reply; 13+ messages in thread
From: Cyril Hrubis @ 2020-04-27 14:27 UTC (permalink / raw)
  To: kernel test robot
  Cc: Josh Triplett, linux-arch, Jens Axboe, Arnd Bergmann,
	linux-kernel, lkp, Aleksa Sarai, Alexander Viro, linux-fsdevel,
	io-uring, ltp

Hi!
> commit: ce436509a8e109330c56bb4d8ec87d258788f5f4 ("[PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds")
> url: https://github.com/0day-ci/linux/commits/Josh-Triplett/Support-userspace-selected-fds/20200414-102939
> base: https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next

This commit adds fd parameter to the how structure where LTP test was
previously passing garbage, which obviously causes the difference in
errno.

This could be safely ignored for now, if the patch gets merged the test
needs to be updated.

-- 
chrubis@suse.cz

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LTP] [fs] ce436509a8: ltp.openat203.fail
  2020-04-27 14:27     ` [LTP] [fs] ce436509a8: ltp.openat203.fail Cyril Hrubis
@ 2020-04-28  0:51       ` Aleksa Sarai
  2020-04-28 15:30         ` Cyril Hrubis
  0 siblings, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2020-04-28  0:51 UTC (permalink / raw)
  To: Cyril Hrubis
  Cc: kernel test robot, Josh Triplett, linux-arch, Jens Axboe,
	Arnd Bergmann, linux-kernel, lkp, Alexander Viro, linux-fsdevel,
	io-uring, ltp

[-- Attachment #1: Type: text/plain, Size: 983 bytes --]

On 2020-04-27, Cyril Hrubis <chrubis@suse.cz> wrote:
> Hi!
> > commit: ce436509a8e109330c56bb4d8ec87d258788f5f4 ("[PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds")
> > url: https://github.com/0day-ci/linux/commits/Josh-Triplett/Support-userspace-selected-fds/20200414-102939
> > base: https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next
> 
> This commit adds fd parameter to the how structure where LTP test was
> previously passing garbage, which obviously causes the difference in
> errno.
> 
> This could be safely ignored for now, if the patch gets merged the test
> needs to be updated.

It wouldn't be a bad idea to switch the test to figure out the ksize of
the struct, so that you only add bad padding after that. But then again,
this would be a bit ugly -- having CHECK_FIELDS would make this simpler.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LTP] [fs] ce436509a8: ltp.openat203.fail
  2020-04-28  0:51       ` Aleksa Sarai
@ 2020-04-28 15:30         ` Cyril Hrubis
  2020-04-28 15:35           ` Aleksa Sarai
  0 siblings, 1 reply; 13+ messages in thread
From: Cyril Hrubis @ 2020-04-28 15:30 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: kernel test robot, Josh Triplett, linux-arch, Jens Axboe,
	Arnd Bergmann, linux-kernel, lkp, Alexander Viro, linux-fsdevel,
	io-uring, ltp

Hi!
> > > commit: ce436509a8e109330c56bb4d8ec87d258788f5f4 ("[PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds")
> > > url: https://github.com/0day-ci/linux/commits/Josh-Triplett/Support-userspace-selected-fds/20200414-102939
> > > base: https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next
> > 
> > This commit adds fd parameter to the how structure where LTP test was
> > previously passing garbage, which obviously causes the difference in
> > errno.
> > 
> > This could be safely ignored for now, if the patch gets merged the test
> > needs to be updated.
> 
> It wouldn't be a bad idea to switch the test to figure out the ksize of
> the struct, so that you only add bad padding after that. But then again,
> this would be a bit ugly -- having CHECK_FIELDS would make this simpler.

Any pointers how can be the size figured out without relying on the
E2BIG we are testing for? Does the kernel export it somewhere?

-- 
Cyril Hrubis
chrubis@suse.cz

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LTP] [fs] ce436509a8: ltp.openat203.fail
  2020-04-28 15:30         ` Cyril Hrubis
@ 2020-04-28 15:35           ` Aleksa Sarai
  0 siblings, 0 replies; 13+ messages in thread
From: Aleksa Sarai @ 2020-04-28 15:35 UTC (permalink / raw)
  To: Cyril Hrubis
  Cc: kernel test robot, Josh Triplett, linux-arch, Jens Axboe,
	Arnd Bergmann, linux-kernel, lkp, Alexander Viro, linux-fsdevel,
	io-uring, ltp

[-- Attachment #1: Type: text/plain, Size: 1501 bytes --]

On 2020-04-28, Cyril Hrubis <chrubis@suse.cz> wrote:
> Hi!
> > > > commit: ce436509a8e109330c56bb4d8ec87d258788f5f4 ("[PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds")
> > > > url: https://github.com/0day-ci/linux/commits/Josh-Triplett/Support-userspace-selected-fds/20200414-102939
> > > > base: https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next
> > > 
> > > This commit adds fd parameter to the how structure where LTP test was
> > > previously passing garbage, which obviously causes the difference in
> > > errno.
> > > 
> > > This could be safely ignored for now, if the patch gets merged the test
> > > needs to be updated.
> > 
> > It wouldn't be a bad idea to switch the test to figure out the ksize of
> > the struct, so that you only add bad padding after that. But then again,
> > this would be a bit ugly -- having CHECK_FIELDS would make this simpler.
> 
> Any pointers how can be the size figured out without relying on the
> E2BIG we are testing for? Does the kernel export it somewhere?

No, you would have to effectively binary search on -E2BIG at the moment.
CHECK_FIELDS is a proposal I have which would allow you to get get the
size of the in-kernel struct, but it's still a proposal.

In theory you could get the size through BTF, but it's probably more
effort than it's worth to implement that.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-04-28 15:36 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-14  2:14 [PATCH v4 0/3] Support userspace-selected fds Josh Triplett
2020-04-14  2:15 ` [PATCH v4 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation Josh Triplett
2020-04-14  2:15 ` [PATCH v4 2/3] fs: openat2: Extend open_how to allow userspace-selected fds Josh Triplett
2020-04-19 10:44   ` Aleksa Sarai
2020-04-19 21:18     ` David Laight
2020-04-19 22:22     ` Jens Axboe
2020-04-20  2:06       ` Aleksa Sarai
2020-04-20 21:14     ` Josh Triplett
     [not found]   ` <20200427135210.GB5770@shao2-debian>
2020-04-27 14:27     ` [LTP] [fs] ce436509a8: ltp.openat203.fail Cyril Hrubis
2020-04-28  0:51       ` Aleksa Sarai
2020-04-28 15:30         ` Cyril Hrubis
2020-04-28 15:35           ` Aleksa Sarai
2020-04-14  2:16 ` [PATCH v4 3/3] fs: pipe2: Support O_SPECIFIC_FD Josh Triplett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).