linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v2 0/4] io_uring: add support for accept(4)
@ 2019-10-25 17:30 Jens Axboe
  2019-10-25 17:30 ` [PATCH 1/4] io_uring: reorder struct sqe_submit Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Jens Axboe @ 2019-10-25 17:30 UTC (permalink / raw)
  To: linux-block; +Cc: davem, netdev, jannh

This series adds support for applications doing async accept()
through io_uring.

Patches 1+2 are just a prep patches, adding support for inheriting a
process file table for commands.

Patch 3 abstracts out __sys_accept4_file(), which is the same as
__sys_accept4(), except it takes a struct file and extra file flags.
Should not have any functional changes.

And finally patch 4 adds support for IORING_OP_ACCEPT. sqe->fd is
the file descriptor, sqe->addr holds a pointer to struct sockaddr,
sqe->addr2 holds a pointer to socklen_t, and finally sqe->accept_flags
holds the flags for accept(4).

The series is against my for-5.5/io_uring-wq tree, and also exists
as a for-5.5/io_uring-test branch. liburing has a basic test case for
this (test/accept.c).

Changes since v1:
- Respin on top of for-5.5/io_uring-wq branch, as io-wq is needed to
  support cancellation of requests.
- Rework files_struct inheritance to avoid cyclic dependencies. Thanks
  to Jann Horn for some great suggestions there.


 fs/io-wq.c                    |  14 +++
 fs/io-wq.h                    |   3 +
 fs/io_uring.c                 | 157 ++++++++++++++++++++++++++++++++--
 include/linux/socket.h        |   3 +
 include/uapi/linux/io_uring.h |   7 +-
 net/socket.c                  |  65 ++++++++------
 6 files changed, 217 insertions(+), 32 deletions(-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/4] io_uring: reorder struct sqe_submit
  2019-10-25 17:30 [PATCHSET v2 0/4] io_uring: add support for accept(4) Jens Axboe
@ 2019-10-25 17:30 ` Jens Axboe
  2019-10-25 17:30 ` [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2019-10-25 17:30 UTC (permalink / raw)
  To: linux-block; +Cc: davem, netdev, jannh, Jens Axboe

Reorder it to pack it better, takes it from 24 bytes to 16 bytes,
and io_kiocb from 192 to 184 bytes.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 13c1ebf96626..effa385ebe72 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -260,10 +260,10 @@ struct io_ring_ctx {
 struct sqe_submit {
 	const struct io_uring_sqe	*sqe;
 	unsigned short			index;
+	bool				has_user : 1;
+	bool				in_async : 1;
+	bool				needs_fixed_file : 1;
 	u32				sequence;
-	bool				has_user;
-	bool				in_async;
-	bool				needs_fixed_file;
 };
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files
  2019-10-25 17:30 [PATCHSET v2 0/4] io_uring: add support for accept(4) Jens Axboe
  2019-10-25 17:30 ` [PATCH 1/4] io_uring: reorder struct sqe_submit Jens Axboe
@ 2019-10-25 17:30 ` Jens Axboe
  2019-10-25 21:31   ` Pavel Begunkov
  2020-01-26 10:12   ` Andres Freund
  2019-10-25 17:30 ` [PATCH 3/4] net: add __sys_accept4_file() helper Jens Axboe
  2019-10-25 17:30 ` [PATCH 4/4] io_uring: add support for IORING_OP_ACCEPT Jens Axboe
  3 siblings, 2 replies; 11+ messages in thread
From: Jens Axboe @ 2019-10-25 17:30 UTC (permalink / raw)
  To: linux-block; +Cc: davem, netdev, jannh, Jens Axboe

This is in preparation for adding opcodes that need to add new files
in a process file table, system calls like open(2) or accept4(2).

If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
item. If work that needs to get punted to async context have this
set, the async worker will assume the original task file table before
executing the work.

Note that opcodes that need access to the current files of an
application cannot be done through IORING_SETUP_SQPOLL.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io-wq.c    |  14 ++++++
 fs/io-wq.h    |   3 ++
 fs/io_uring.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 129 insertions(+), 4 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 99ac5e338d99..134c4632c0be 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -52,6 +52,7 @@ struct io_worker {
 
 	struct rcu_head rcu;
 	struct mm_struct *mm;
+	struct files_struct *restore_files;
 };
 
 struct io_wq_nulls_list {
@@ -128,6 +129,12 @@ static bool __io_worker_unuse(struct io_wqe *wqe, struct io_worker *worker)
 	__must_hold(wqe->lock)
 	__releases(wqe->lock)
 {
+	if (current->files != worker->restore_files) {
+		task_lock(current);
+		current->files = worker->restore_files;
+		task_unlock(current);
+	}
+
 	/*
 	 * If we have an active mm, we need to drop the wq lock before unusing
 	 * it. If we do, return true and let the caller retry the idle loop.
@@ -188,6 +195,7 @@ static void io_worker_start(struct io_wqe *wqe, struct io_worker *worker)
 	current->flags |= PF_IO_WORKER;
 
 	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
+	worker->restore_files = current->files;
 	atomic_inc(&wqe->nr_running);
 }
 
@@ -278,6 +286,12 @@ static void io_worker_handle_work(struct io_worker *worker)
 		if (!work)
 			break;
 next:
+		if ((work->flags & IO_WQ_WORK_NEEDS_FILES) &&
+		    current->files != work->files) {
+			task_lock(current);
+			current->files = work->files;
+			task_unlock(current);
+		}
 		if ((work->flags & IO_WQ_WORK_NEEDS_USER) && !worker->mm &&
 		    wq->mm && mmget_not_zero(wq->mm)) {
 			use_mm(wq->mm);
diff --git a/fs/io-wq.h b/fs/io-wq.h
index be8f22c8937b..e93f764b1fa4 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -8,6 +8,7 @@ enum {
 	IO_WQ_WORK_HAS_MM	= 2,
 	IO_WQ_WORK_HASHED	= 4,
 	IO_WQ_WORK_NEEDS_USER	= 8,
+	IO_WQ_WORK_NEEDS_FILES	= 16,
 
 	IO_WQ_HASH_SHIFT	= 24,	/* upper 8 bits are used for hash key */
 };
@@ -22,12 +23,14 @@ struct io_wq_work {
 	struct list_head list;
 	void (*func)(struct io_wq_work **);
 	unsigned flags;
+	struct files_struct *files;
 };
 
 #define INIT_IO_WORK(work, _func)			\
 	do {						\
 		(work)->func = _func;			\
 		(work)->flags = 0;			\
+		(work)->files = NULL;			\
 	} while (0)					\
 
 struct io_wq *io_wq_create(unsigned concurrency, struct mm_struct *mm);
diff --git a/fs/io_uring.c b/fs/io_uring.c
index effa385ebe72..5a6f8e1dc718 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -196,6 +196,8 @@ struct io_ring_ctx {
 
 		struct list_head	defer_list;
 		struct list_head	timeout_list;
+
+		wait_queue_head_t	inflight_wait;
 	} ____cacheline_aligned_in_smp;
 
 	/* IO offload */
@@ -250,6 +252,9 @@ struct io_ring_ctx {
 		 */
 		struct list_head	poll_list;
 		struct list_head	cancel_list;
+
+		spinlock_t		inflight_lock;
+		struct list_head	inflight_list;
 	} ____cacheline_aligned_in_smp;
 
 #if defined(CONFIG_UNIX)
@@ -259,11 +264,13 @@ struct io_ring_ctx {
 
 struct sqe_submit {
 	const struct io_uring_sqe	*sqe;
+	struct file			*ring_file;
 	unsigned short			index;
 	bool				has_user : 1;
 	bool				in_async : 1;
 	bool				needs_fixed_file : 1;
 	u32				sequence;
+	int				ring_fd;
 };
 
 /*
@@ -318,10 +325,13 @@ struct io_kiocb {
 #define REQ_F_TIMEOUT		1024	/* timeout request */
 #define REQ_F_ISREG		2048	/* regular file */
 #define REQ_F_MUST_PUNT		4096	/* must be punted even for NONBLOCK */
+#define REQ_F_INFLIGHT		8192	/* on inflight list */
 	u64			user_data;
 	u32			result;
 	u32			sequence;
 
+	struct list_head	inflight_entry;
+
 	struct io_wq_work	work;
 };
 
@@ -402,6 +412,9 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->cancel_list);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
+	init_waitqueue_head(&ctx->inflight_wait);
+	spin_lock_init(&ctx->inflight_lock);
+	INIT_LIST_HEAD(&ctx->inflight_list);
 	return ctx;
 }
 
@@ -671,9 +684,20 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
 
 static void __io_free_req(struct io_kiocb *req)
 {
+	struct io_ring_ctx *ctx = req->ctx;
+
 	if (req->file && !(req->flags & REQ_F_FIXED_FILE))
 		fput(req->file);
-	percpu_ref_put(&req->ctx->refs);
+	if (req->flags & REQ_F_INFLIGHT) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&ctx->inflight_lock, flags);
+		list_del(&req->inflight_entry);
+		if (waitqueue_active(&ctx->inflight_wait))
+			wake_up(&ctx->inflight_wait);
+		spin_unlock_irqrestore(&ctx->inflight_lock, flags);
+	}
+	percpu_ref_put(&ctx->refs);
 	kmem_cache_free(req_cachep, req);
 }
 
@@ -2277,6 +2301,30 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 	return 0;
 }
 
+static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req)
+{
+	int ret = -EBADF;
+
+	rcu_read_lock();
+	spin_lock_irq(&ctx->inflight_lock);
+	/*
+	 * We use the f_ops->flush() handler to ensure that we can flush
+	 * out work accessing these files if the fd is closed. Check if
+	 * the fd has changed since we started down this path, and disallow
+	 * this operation if it has.
+	 */
+	if (fcheck(req->submit.ring_fd) == req->submit.ring_file) {
+		list_add(&req->inflight_entry, &ctx->inflight_list);
+		req->flags |= REQ_F_INFLIGHT;
+		req->work.files = current->files;
+		ret = 0;
+	}
+	spin_unlock_irq(&ctx->inflight_lock);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			struct sqe_submit *s)
 {
@@ -2296,17 +2344,25 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		if (sqe_copy) {
 			s->sqe = sqe_copy;
 			memcpy(&req->submit, s, sizeof(*s));
-			io_queue_async_work(ctx, req);
+			if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) {
+				ret = io_grab_files(ctx, req);
+				if (ret) {
+					kfree(sqe_copy);
+					goto err;
+				}
+			}
 
 			/*
 			 * Queued up for async execution, worker will release
 			 * submit reference when the iocb is actually submitted.
 			 */
+			io_queue_async_work(ctx, req);
 			return 0;
 		}
 	}
 
 	/* drop submission reference */
+err:
 	io_put_req(req, NULL);
 
 	/* and drop final reference, if we failed */
@@ -2509,6 +2565,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 
 	head = READ_ONCE(sq_array[head & ctx->sq_mask]);
 	if (head < ctx->sq_entries) {
+		s->ring_file = NULL;
 		s->index = head;
 		s->sqe = &ctx->sq_sqes[head];
 		s->sequence = ctx->cached_sq_head;
@@ -2716,7 +2773,8 @@ static int io_sq_thread(void *data)
 	return 0;
 }
 
-static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
+static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
+			  struct file *ring_file, int ring_fd)
 {
 	struct io_submit_state state, *statep = NULL;
 	struct io_kiocb *link = NULL;
@@ -2758,9 +2816,11 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 		}
 
 out:
+		s.ring_file = ring_file;
 		s.has_user = true;
 		s.in_async = false;
 		s.needs_fixed_file = false;
+		s.ring_fd = ring_fd;
 		submit++;
 		trace_io_uring_submit_sqe(ctx, true, false);
 		io_submit_sqe(ctx, &s, statep, &link);
@@ -3722,6 +3782,53 @@ static int io_uring_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+static void io_uring_cancel_files(struct io_ring_ctx *ctx,
+				  struct files_struct *files)
+{
+	struct io_kiocb *req;
+	DEFINE_WAIT(wait);
+
+	while (!list_empty_careful(&ctx->inflight_list)) {
+		enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND;
+
+		spin_lock_irq(&ctx->inflight_lock);
+		list_for_each_entry(req, &ctx->inflight_list, inflight_entry) {
+			if (req->work.files == files) {
+				ret = io_wq_cancel_work(ctx->io_wq, &req->work);
+				break;
+			}
+		}
+		if (ret == IO_WQ_CANCEL_RUNNING)
+			prepare_to_wait(&ctx->inflight_wait, &wait,
+					TASK_UNINTERRUPTIBLE);
+
+		spin_unlock_irq(&ctx->inflight_lock);
+
+		/*
+		 * We need to keep going until we get NOTFOUND. We only cancel
+		 * one work at the time.
+		 *
+		 * If we get CANCEL_RUNNING, then wait for a work to complete
+		 * before continuing.
+		 */
+		if (ret == IO_WQ_CANCEL_OK)
+			continue;
+		else if (ret != IO_WQ_CANCEL_RUNNING)
+			break;
+		schedule();
+	}
+}
+
+static int io_uring_flush(struct file *file, void *data)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	io_uring_cancel_files(ctx, data);
+	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
+		io_wq_cancel_all(ctx->io_wq);
+	return 0;
+}
+
 static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT;
@@ -3790,7 +3897,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		to_submit = min(to_submit, ctx->sq_entries);
 
 		mutex_lock(&ctx->uring_lock);
-		submitted = io_ring_submit(ctx, to_submit);
+		submitted = io_ring_submit(ctx, to_submit, f.file, fd);
 		mutex_unlock(&ctx->uring_lock);
 	}
 	if (flags & IORING_ENTER_GETEVENTS) {
@@ -3813,6 +3920,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 
 static const struct file_operations io_uring_fops = {
 	.release	= io_uring_release,
+	.flush		= io_uring_flush,
 	.mmap		= io_uring_mmap,
 	.poll		= io_uring_poll,
 	.fasync		= io_uring_fasync,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/4] net: add __sys_accept4_file() helper
  2019-10-25 17:30 [PATCHSET v2 0/4] io_uring: add support for accept(4) Jens Axboe
  2019-10-25 17:30 ` [PATCH 1/4] io_uring: reorder struct sqe_submit Jens Axboe
  2019-10-25 17:30 ` [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files Jens Axboe
@ 2019-10-25 17:30 ` Jens Axboe
  2019-10-25 17:30 ` [PATCH 4/4] io_uring: add support for IORING_OP_ACCEPT Jens Axboe
  3 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2019-10-25 17:30 UTC (permalink / raw)
  To: linux-block; +Cc: davem, netdev, jannh, Jens Axboe

This is identical to __sys_accept4(), except it takes a struct file
instead of an fd, and it also allows passing in extra file->f_flags
flags. The latter is done to support masking in O_NONBLOCK without
manipulating the original file flags.

No functional changes in this patch.

Cc: David Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/socket.h |  3 ++
 net/socket.c           | 65 ++++++++++++++++++++++++++----------------
 2 files changed, 44 insertions(+), 24 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index fc0bed59fc84..dd061f741bc1 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -392,6 +392,9 @@ extern int __sys_recvfrom(int fd, void __user *ubuf, size_t size,
 extern int __sys_sendto(int fd, void __user *buff, size_t len,
 			unsigned int flags, struct sockaddr __user *addr,
 			int addr_len);
+extern int __sys_accept4_file(struct file *file, unsigned file_flags,
+			struct sockaddr __user *upeer_sockaddr,
+			 int __user *upeer_addrlen, int flags);
 extern int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
 			 int __user *upeer_addrlen, int flags);
 extern int __sys_socket(int family, int type, int protocol);
diff --git a/net/socket.c b/net/socket.c
index 6a9ab7a8b1d2..40ab39f6c5d8 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1690,24 +1690,13 @@ SYSCALL_DEFINE2(listen, int, fd, int, backlog)
 	return __sys_listen(fd, backlog);
 }
 
-/*
- *	For accept, we attempt to create a new socket, set up the link
- *	with the client, wake up the client, then return the new
- *	connected fd. We collect the address of the connector in kernel
- *	space and move it to user at the very end. This is unclean because
- *	we open the socket then return an error.
- *
- *	1003.1g adds the ability to recvmsg() to query connection pending
- *	status to recvmsg. We need to add that support in a way thats
- *	clean when we restructure accept also.
- */
-
-int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
-		  int __user *upeer_addrlen, int flags)
+int __sys_accept4_file(struct file *file, unsigned file_flags,
+		       struct sockaddr __user *upeer_sockaddr,
+		       int __user *upeer_addrlen, int flags)
 {
 	struct socket *sock, *newsock;
 	struct file *newfile;
-	int err, len, newfd, fput_needed;
+	int err, len, newfd;
 	struct sockaddr_storage address;
 
 	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
@@ -1716,14 +1705,14 @@ int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
 		flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
 
-	sock = sockfd_lookup_light(fd, &err, &fput_needed);
+	sock = sock_from_file(file, &err);
 	if (!sock)
 		goto out;
 
 	err = -ENFILE;
 	newsock = sock_alloc();
 	if (!newsock)
-		goto out_put;
+		goto out;
 
 	newsock->type = sock->type;
 	newsock->ops = sock->ops;
@@ -1738,20 +1727,21 @@ int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
 	if (unlikely(newfd < 0)) {
 		err = newfd;
 		sock_release(newsock);
-		goto out_put;
+		goto out;
 	}
 	newfile = sock_alloc_file(newsock, flags, sock->sk->sk_prot_creator->name);
 	if (IS_ERR(newfile)) {
 		err = PTR_ERR(newfile);
 		put_unused_fd(newfd);
-		goto out_put;
+		goto out;
 	}
 
 	err = security_socket_accept(sock, newsock);
 	if (err)
 		goto out_fd;
 
-	err = sock->ops->accept(sock, newsock, sock->file->f_flags, false);
+	err = sock->ops->accept(sock, newsock, sock->file->f_flags | file_flags,
+					false);
 	if (err < 0)
 		goto out_fd;
 
@@ -1772,15 +1762,42 @@ int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
 
 	fd_install(newfd, newfile);
 	err = newfd;
-
-out_put:
-	fput_light(sock->file, fput_needed);
 out:
 	return err;
 out_fd:
 	fput(newfile);
 	put_unused_fd(newfd);
-	goto out_put;
+	goto out;
+
+}
+
+/*
+ *	For accept, we attempt to create a new socket, set up the link
+ *	with the client, wake up the client, then return the new
+ *	connected fd. We collect the address of the connector in kernel
+ *	space and move it to user at the very end. This is unclean because
+ *	we open the socket then return an error.
+ *
+ *	1003.1g adds the ability to recvmsg() to query connection pending
+ *	status to recvmsg. We need to add that support in a way thats
+ *	clean when we restructure accept also.
+ */
+
+int __sys_accept4(int fd, struct sockaddr __user *upeer_sockaddr,
+		  int __user *upeer_addrlen, int flags)
+{
+	int ret = -EBADF;
+	struct fd f;
+
+	f = fdget(fd);
+	if (f.file) {
+		ret = __sys_accept4_file(f.file, 0, upeer_sockaddr,
+						upeer_addrlen, flags);
+		if (f.flags)
+			fput(f.file);
+	}
+
+	return ret;
 }
 
 SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/4] io_uring: add support for IORING_OP_ACCEPT
  2019-10-25 17:30 [PATCHSET v2 0/4] io_uring: add support for accept(4) Jens Axboe
                   ` (2 preceding siblings ...)
  2019-10-25 17:30 ` [PATCH 3/4] net: add __sys_accept4_file() helper Jens Axboe
@ 2019-10-25 17:30 ` Jens Axboe
  3 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2019-10-25 17:30 UTC (permalink / raw)
  To: linux-block; +Cc: davem, netdev, jannh, Jens Axboe

This allows an application to call accept4() in an async fashion. Like
other opcodes, we first try a non-blocking accept, then punt to async
context if we have to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 35 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/io_uring.h |  7 ++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5a6f8e1dc718..4402485f0879 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1687,6 +1687,38 @@ static int io_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 #endif
 }
 
+static int io_accept(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+		     struct io_kiocb **nxt, bool force_nonblock)
+{
+#if defined(CONFIG_NET)
+	struct sockaddr __user *addr;
+	int __user *addr_len;
+	unsigned file_flags;
+	int flags, ret;
+
+	if (unlikely(req->ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL)))
+		return -EINVAL;
+
+	addr = (struct sockaddr __user *) (unsigned long) READ_ONCE(sqe->addr);
+	addr_len = (int __user *) (unsigned long) READ_ONCE(sqe->addr2);
+	flags = READ_ONCE(sqe->accept_flags);
+	file_flags = force_nonblock ? O_NONBLOCK : 0;
+
+	ret = __sys_accept4_file(req->file, file_flags, addr, addr_len, flags);
+	if (ret == -EAGAIN && force_nonblock) {
+		req->work.flags |= IO_WQ_WORK_NEEDS_FILES;
+		return -EAGAIN;
+	}
+	if (ret < 0 && (req->flags & REQ_F_LINK))
+		req->flags |= REQ_F_FAIL_LINK;
+	io_cqring_add_event(req->ctx, sqe->user_data, ret);
+	io_put_req(req, nxt);
+	return 0;
+#else
+	return -EOPNOTSUPP;
+#endif
+}
+
 static void io_poll_remove_one(struct io_kiocb *req)
 {
 	struct io_poll_iocb *poll = &req->poll;
@@ -2174,6 +2206,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	case IORING_OP_TIMEOUT_REMOVE:
 		ret = io_timeout_remove(req, s->sqe);
 		break;
+	case IORING_OP_ACCEPT:
+		ret = io_accept(req, s->sqe, nxt, force_nonblock);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 6dc5ced1c37a..f82d90e617a6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -19,7 +19,10 @@ struct io_uring_sqe {
 	__u8	flags;		/* IOSQE_ flags */
 	__u16	ioprio;		/* ioprio for the request */
 	__s32	fd;		/* file descriptor to do IO on */
-	__u64	off;		/* offset into file */
+	union {
+		__u64	off;	/* offset into file */
+		__u64	addr2;
+	};
 	__u64	addr;		/* pointer to buffer or iovecs */
 	__u32	len;		/* buffer size or number of iovecs */
 	union {
@@ -29,6 +32,7 @@ struct io_uring_sqe {
 		__u32		sync_range_flags;
 		__u32		msg_flags;
 		__u32		timeout_flags;
+		__u32		accept_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	union {
@@ -65,6 +69,7 @@ struct io_uring_sqe {
 #define IORING_OP_RECVMSG	10
 #define IORING_OP_TIMEOUT	11
 #define IORING_OP_TIMEOUT_REMOVE	12
+#define IORING_OP_ACCEPT	13
 
 /*
  * sqe->fsync_flags
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files
  2019-10-25 17:30 ` [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files Jens Axboe
@ 2019-10-25 21:31   ` Pavel Begunkov
  2019-10-25 21:45     ` Jens Axboe
  2020-01-26 10:12   ` Andres Freund
  1 sibling, 1 reply; 11+ messages in thread
From: Pavel Begunkov @ 2019-10-25 21:31 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: davem, netdev, jannh


[-- Attachment #1.1: Type: text/plain, Size: 10902 bytes --]



On 25/10/2019 20:30, Jens Axboe wrote:
> This is in preparation for adding opcodes that need to add new files
> in a process file table, system calls like open(2) or accept4(2).
> 
> If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
> item. If work that needs to get punted to async context have this
> set, the async worker will assume the original task file table before
> executing the work.
> 
> Note that opcodes that need access to the current files of an
> application cannot be done through IORING_SETUP_SQPOLL.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/io-wq.c    |  14 ++++++
>  fs/io-wq.h    |   3 ++
>  fs/io_uring.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 129 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index 99ac5e338d99..134c4632c0be 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -52,6 +52,7 @@ struct io_worker {
>  
>  	struct rcu_head rcu;
>  	struct mm_struct *mm;
> +	struct files_struct *restore_files;
>  };
>  
>  struct io_wq_nulls_list {
> @@ -128,6 +129,12 @@ static bool __io_worker_unuse(struct io_wqe *wqe, struct io_worker *worker)
>  	__must_hold(wqe->lock)
>  	__releases(wqe->lock)
>  {
> +	if (current->files != worker->restore_files) {
> +		task_lock(current);
> +		current->files = worker->restore_files;
> +		task_unlock(current);
> +	}
> +
>  	/*
>  	 * If we have an active mm, we need to drop the wq lock before unusing
>  	 * it. If we do, return true and let the caller retry the idle loop.
> @@ -188,6 +195,7 @@ static void io_worker_start(struct io_wqe *wqe, struct io_worker *worker)
>  	current->flags |= PF_IO_WORKER;
>  
>  	worker->flags |= (IO_WORKER_F_UP | IO_WORKER_F_RUNNING);
> +	worker->restore_files = current->files;
>  	atomic_inc(&wqe->nr_running);
>  }
>  
> @@ -278,6 +286,12 @@ static void io_worker_handle_work(struct io_worker *worker)
>  		if (!work)
>  			break;
>  next:
> +		if ((work->flags & IO_WQ_WORK_NEEDS_FILES) &&
> +		    current->files != work->files) {
> +			task_lock(current);
> +			current->files = work->files;
> +			task_unlock(current);
> +		}
>  		if ((work->flags & IO_WQ_WORK_NEEDS_USER) && !worker->mm &&
>  		    wq->mm && mmget_not_zero(wq->mm)) {
>  			use_mm(wq->mm);
> diff --git a/fs/io-wq.h b/fs/io-wq.h
> index be8f22c8937b..e93f764b1fa4 100644
> --- a/fs/io-wq.h
> +++ b/fs/io-wq.h
> @@ -8,6 +8,7 @@ enum {
>  	IO_WQ_WORK_HAS_MM	= 2,
>  	IO_WQ_WORK_HASHED	= 4,
>  	IO_WQ_WORK_NEEDS_USER	= 8,
> +	IO_WQ_WORK_NEEDS_FILES	= 16,
>  
>  	IO_WQ_HASH_SHIFT	= 24,	/* upper 8 bits are used for hash key */
>  };
> @@ -22,12 +23,14 @@ struct io_wq_work {
>  	struct list_head list;
>  	void (*func)(struct io_wq_work **);
>  	unsigned flags;
> +	struct files_struct *files;
>  };
>  
>  #define INIT_IO_WORK(work, _func)			\
>  	do {						\
>  		(work)->func = _func;			\
>  		(work)->flags = 0;			\
> +		(work)->files = NULL;			\
>  	} while (0)					\
>  
>  struct io_wq *io_wq_create(unsigned concurrency, struct mm_struct *mm);
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index effa385ebe72..5a6f8e1dc718 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -196,6 +196,8 @@ struct io_ring_ctx {
>  
>  		struct list_head	defer_list;
>  		struct list_head	timeout_list;
> +
> +		wait_queue_head_t	inflight_wait;
>  	} ____cacheline_aligned_in_smp;
>  
>  	/* IO offload */
> @@ -250,6 +252,9 @@ struct io_ring_ctx {
>  		 */
>  		struct list_head	poll_list;
>  		struct list_head	cancel_list;
> +
> +		spinlock_t		inflight_lock;
> +		struct list_head	inflight_list;
>  	} ____cacheline_aligned_in_smp;
>  
>  #if defined(CONFIG_UNIX)
> @@ -259,11 +264,13 @@ struct io_ring_ctx {
>  
>  struct sqe_submit {
>  	const struct io_uring_sqe	*sqe;
> +	struct file			*ring_file;
>  	unsigned short			index;
>  	bool				has_user : 1;
>  	bool				in_async : 1;
>  	bool				needs_fixed_file : 1;
>  	u32				sequence;
> +	int				ring_fd;
>  };
>  
>  /*
> @@ -318,10 +325,13 @@ struct io_kiocb {
>  #define REQ_F_TIMEOUT		1024	/* timeout request */
>  #define REQ_F_ISREG		2048	/* regular file */
>  #define REQ_F_MUST_PUNT		4096	/* must be punted even for NONBLOCK */
> +#define REQ_F_INFLIGHT		8192	/* on inflight list */
>  	u64			user_data;
>  	u32			result;
>  	u32			sequence;
>  
> +	struct list_head	inflight_entry;
> +
>  	struct io_wq_work	work;
>  };
>  
> @@ -402,6 +412,9 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>  	INIT_LIST_HEAD(&ctx->cancel_list);
>  	INIT_LIST_HEAD(&ctx->defer_list);
>  	INIT_LIST_HEAD(&ctx->timeout_list);
> +	init_waitqueue_head(&ctx->inflight_wait);
> +	spin_lock_init(&ctx->inflight_lock);
> +	INIT_LIST_HEAD(&ctx->inflight_list);
>  	return ctx;
>  }
>  
> @@ -671,9 +684,20 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
>  
>  static void __io_free_req(struct io_kiocb *req)
>  {
> +	struct io_ring_ctx *ctx = req->ctx;
> +
>  	if (req->file && !(req->flags & REQ_F_FIXED_FILE))
>  		fput(req->file);
> -	percpu_ref_put(&req->ctx->refs);
> +	if (req->flags & REQ_F_INFLIGHT) {
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&ctx->inflight_lock, flags);
> +		list_del(&req->inflight_entry);
> +		if (waitqueue_active(&ctx->inflight_wait))
> +			wake_up(&ctx->inflight_wait);
> +		spin_unlock_irqrestore(&ctx->inflight_lock, flags);
> +	}
> +	percpu_ref_put(&ctx->refs);
>  	kmem_cache_free(req_cachep, req);
>  }
>  
> @@ -2277,6 +2301,30 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s,
>  	return 0;
>  }
>  
> +static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req)
> +{
> +	int ret = -EBADF;
> +
> +	rcu_read_lock();
> +	spin_lock_irq(&ctx->inflight_lock);
> +	/*
> +	 * We use the f_ops->flush() handler to ensure that we can flush
> +	 * out work accessing these files if the fd is closed. Check if
> +	 * the fd has changed since we started down this path, and disallow
> +	 * this operation if it has.
> +	 */
> +	if (fcheck(req->submit.ring_fd) == req->submit.ring_file) {
Can we get here from io_submit_sqes()?
ring_fd will be uninitialised in this case.


> +		list_add(&req->inflight_entry, &ctx->inflight_list);
> +		req->flags |= REQ_F_INFLIGHT;
> +		req->work.files = current->files;
> +		ret = 0;
> +	}
> +	spin_unlock_irq(&ctx->inflight_lock);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
>  static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>  			struct sqe_submit *s)
>  {
> @@ -2296,17 +2344,25 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>  		if (sqe_copy) {
>  			s->sqe = sqe_copy;
>  			memcpy(&req->submit, s, sizeof(*s));
> -			io_queue_async_work(ctx, req);
> +			if (req->work.flags & IO_WQ_WORK_NEEDS_FILES) {
> +				ret = io_grab_files(ctx, req);
> +				if (ret) {
> +					kfree(sqe_copy);
> +					goto err;
> +				}
> +			}
>  
>  			/*
>  			 * Queued up for async execution, worker will release
>  			 * submit reference when the iocb is actually submitted.
>  			 */
> +			io_queue_async_work(ctx, req);
>  			return 0;
>  		}
>  	}
>  
>  	/* drop submission reference */
> +err:
>  	io_put_req(req, NULL);
>  
>  	/* and drop final reference, if we failed */
> @@ -2509,6 +2565,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
>  
>  	head = READ_ONCE(sq_array[head & ctx->sq_mask]);
>  	if (head < ctx->sq_entries) {
> +		s->ring_file = NULL;
>  		s->index = head;
>  		s->sqe = &ctx->sq_sqes[head];
>  		s->sequence = ctx->cached_sq_head;
> @@ -2716,7 +2773,8 @@ static int io_sq_thread(void *data)
>  	return 0;
>  }
>  
> -static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
> +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
> +			  struct file *ring_file, int ring_fd)
>  {
>  	struct io_submit_state state, *statep = NULL;
>  	struct io_kiocb *link = NULL;
> @@ -2758,9 +2816,11 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
>  		}
>  
>  out:
> +		s.ring_file = ring_file;
>  		s.has_user = true;
>  		s.in_async = false;
>  		s.needs_fixed_file = false;
> +		s.ring_fd = ring_fd;
>  		submit++;
>  		trace_io_uring_submit_sqe(ctx, true, false);
>  		io_submit_sqe(ctx, &s, statep, &link);
> @@ -3722,6 +3782,53 @@ static int io_uring_release(struct inode *inode, struct file *file)
>  	return 0;
>  }
>  
> +static void io_uring_cancel_files(struct io_ring_ctx *ctx,
> +				  struct files_struct *files)
> +{
> +	struct io_kiocb *req;
> +	DEFINE_WAIT(wait);
> +
> +	while (!list_empty_careful(&ctx->inflight_list)) {
> +		enum io_wq_cancel ret = IO_WQ_CANCEL_NOTFOUND;
> +
> +		spin_lock_irq(&ctx->inflight_lock);
> +		list_for_each_entry(req, &ctx->inflight_list, inflight_entry) {
> +			if (req->work.files == files) {
> +				ret = io_wq_cancel_work(ctx->io_wq, &req->work);
> +				break;
> +			}
> +		}
> +		if (ret == IO_WQ_CANCEL_RUNNING)
> +			prepare_to_wait(&ctx->inflight_wait, &wait,
> +					TASK_UNINTERRUPTIBLE);
> +
> +		spin_unlock_irq(&ctx->inflight_lock);
> +
> +		/*
> +		 * We need to keep going until we get NOTFOUND. We only cancel
> +		 * one work at the time.
> +		 *
> +		 * If we get CANCEL_RUNNING, then wait for a work to complete
> +		 * before continuing.
> +		 */
> +		if (ret == IO_WQ_CANCEL_OK)
> +			continue;
> +		else if (ret != IO_WQ_CANCEL_RUNNING)
> +			break;
> +		schedule();
> +	}
> +}
> +
> +static int io_uring_flush(struct file *file, void *data)
> +{
> +	struct io_ring_ctx *ctx = file->private_data;
> +
> +	io_uring_cancel_files(ctx, data);
> +	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
> +		io_wq_cancel_all(ctx->io_wq);
> +	return 0;
> +}
> +
>  static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT;
> @@ -3790,7 +3897,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>  		to_submit = min(to_submit, ctx->sq_entries);
>  
>  		mutex_lock(&ctx->uring_lock);
> -		submitted = io_ring_submit(ctx, to_submit);
> +		submitted = io_ring_submit(ctx, to_submit, f.file, fd);
>  		mutex_unlock(&ctx->uring_lock);
>  	}
>  	if (flags & IORING_ENTER_GETEVENTS) {
> @@ -3813,6 +3920,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>  
>  static const struct file_operations io_uring_fops = {
>  	.release	= io_uring_release,
> +	.flush		= io_uring_flush,
>  	.mmap		= io_uring_mmap,
>  	.poll		= io_uring_poll,
>  	.fasync		= io_uring_fasync,
> 

-- 
Yours sincerely,
Pavel Begunkov


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files
  2019-10-25 21:31   ` Pavel Begunkov
@ 2019-10-25 21:45     ` Jens Axboe
  0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2019-10-25 21:45 UTC (permalink / raw)
  To: Pavel Begunkov, linux-block; +Cc: davem, netdev, jannh

On 10/25/19 3:31 PM, Pavel Begunkov wrote:
> On 25/10/2019 20:30, Jens Axboe wrote:
>> +static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req)
>> +{
>> +	int ret = -EBADF;
>> +
>> +	rcu_read_lock();
>> +	spin_lock_irq(&ctx->inflight_lock);
>> +	/*
>> +	 * We use the f_ops->flush() handler to ensure that we can flush
>> +	 * out work accessing these files if the fd is closed. Check if
>> +	 * the fd has changed since we started down this path, and disallow
>> +	 * this operation if it has.
>> +	 */
>> +	if (fcheck(req->submit.ring_fd) == req->submit.ring_file) {
> Can we get here from io_submit_sqes()?
> ring_fd will be uninitialised in this case.

We can't, we disallow submission of any opcode (for now just accept)
from the sq thread that needs a file table.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files
  2019-10-25 17:30 ` [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files Jens Axboe
  2019-10-25 21:31   ` Pavel Begunkov
@ 2020-01-26 10:12   ` Andres Freund
  2020-01-26 17:10     ` Jens Axboe
  1 sibling, 1 reply; 11+ messages in thread
From: Andres Freund @ 2020-01-26 10:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, io-uring, davem, netdev, jannh

Hi,

On 2019-10-25 11:30:35 -0600, Jens Axboe wrote:
> This is in preparation for adding opcodes that need to add new files
> in a process file table, system calls like open(2) or accept4(2).
> 
> If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
> item. If work that needs to get punted to async context have this
> set, the async worker will assume the original task file table before
> executing the work.
> 
> Note that opcodes that need access to the current files of an
> application cannot be done through IORING_SETUP_SQPOLL.


Unfortunately this partially breaks sharing a uring across with forked
off processes, even though it initially appears to work:


> +static int io_uring_flush(struct file *file, void *data)
> +{
> +	struct io_ring_ctx *ctx = file->private_data;
> +
> +	io_uring_cancel_files(ctx, data);
> +	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
> +		io_wq_cancel_all(ctx->io_wq);
> +	return 0;
> +}

Once one process having the uring fd open (even if it were just a fork
never touching the uring, I believe) exits, this prevents the uring from
being usable for any async tasks. The process exiting closes the fd,
which triggers flush. io_wq_cancel_all() sets IO_WQ_BIT_CANCEL, which
never gets unset, which causes all future async sqes to be be
immediately returned as -ECANCELLED by the worker, via io_req_cancelled.

It's not clear to me why a close() should cancel the the wq (nor clear
the entire backlog, after 1d7bb1d50fb4)? Couldn't that even just be a
dup()ed fd? Or a fork that immediately exec()s?

After rudely ifdefing out the above if, and reverting 44d282796f81, my
WIP io_uring using version of postgres appears to pass its tests - which
are very sparse at this point - again with 5.5-rc7.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files
  2020-01-26 10:12   ` Andres Freund
@ 2020-01-26 17:10     ` Jens Axboe
  2020-01-26 17:17       ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2020-01-26 17:10 UTC (permalink / raw)
  To: Andres Freund; +Cc: linux-block, io-uring, davem, netdev, jannh

On 1/26/20 3:12 AM, Andres Freund wrote:
> Hi,
> 
> On 2019-10-25 11:30:35 -0600, Jens Axboe wrote:
>> This is in preparation for adding opcodes that need to add new files
>> in a process file table, system calls like open(2) or accept4(2).
>>
>> If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
>> item. If work that needs to get punted to async context have this
>> set, the async worker will assume the original task file table before
>> executing the work.
>>
>> Note that opcodes that need access to the current files of an
>> application cannot be done through IORING_SETUP_SQPOLL.
> 
> 
> Unfortunately this partially breaks sharing a uring across with forked
> off processes, even though it initially appears to work:
> 
> 
>> +static int io_uring_flush(struct file *file, void *data)
>> +{
>> +	struct io_ring_ctx *ctx = file->private_data;
>> +
>> +	io_uring_cancel_files(ctx, data);
>> +	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
>> +		io_wq_cancel_all(ctx->io_wq);
>> +	return 0;
>> +}
> 
> Once one process having the uring fd open (even if it were just a fork
> never touching the uring, I believe) exits, this prevents the uring from
> being usable for any async tasks. The process exiting closes the fd,
> which triggers flush. io_wq_cancel_all() sets IO_WQ_BIT_CANCEL, which
> never gets unset, which causes all future async sqes to be be
> immediately returned as -ECANCELLED by the worker, via io_req_cancelled.
> 
> It's not clear to me why a close() should cancel the the wq (nor clear
> the entire backlog, after 1d7bb1d50fb4)? Couldn't that even just be a
> dup()ed fd? Or a fork that immediately exec()s?
> 
> After rudely ifdefing out the above if, and reverting 44d282796f81, my
> WIP io_uring using version of postgres appears to pass its tests - which
> are very sparse at this point - again with 5.5-rc7.

We need to cancel work items using the files from this process if it
exits, but I think we should be fine not canceling all work. Especially
since thet setting of IO_WQ_BIT_CANCEL is a one way street...  I'm assuming
the below works for you?

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e5b502091804..e3ac2a6ff195 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5044,10 +5044,8 @@ static int io_uring_flush(struct file *file, void *data)
 	struct io_ring_ctx *ctx = file->private_data;
 
 	io_uring_cancel_files(ctx, data);
-	if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) {
+	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
 		io_cqring_overflow_flush(ctx, true);
-		io_wq_cancel_all(ctx->io_wq);
-	}
 	return 0;
 }
 
-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files
  2020-01-26 17:10     ` Jens Axboe
@ 2020-01-26 17:17       ` Jens Axboe
  2020-01-26 20:07         ` Andres Freund
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2020-01-26 17:17 UTC (permalink / raw)
  To: Andres Freund; +Cc: linux-block, io-uring, davem, netdev, jannh

On 1/26/20 10:10 AM, Jens Axboe wrote:
> On 1/26/20 3:12 AM, Andres Freund wrote:
>> Hi,
>>
>> On 2019-10-25 11:30:35 -0600, Jens Axboe wrote:
>>> This is in preparation for adding opcodes that need to add new files
>>> in a process file table, system calls like open(2) or accept4(2).
>>>
>>> If an opcode needs this, it must set IO_WQ_WORK_NEEDS_FILES in the work
>>> item. If work that needs to get punted to async context have this
>>> set, the async worker will assume the original task file table before
>>> executing the work.
>>>
>>> Note that opcodes that need access to the current files of an
>>> application cannot be done through IORING_SETUP_SQPOLL.
>>
>>
>> Unfortunately this partially breaks sharing a uring across with forked
>> off processes, even though it initially appears to work:
>>
>>
>>> +static int io_uring_flush(struct file *file, void *data)
>>> +{
>>> +	struct io_ring_ctx *ctx = file->private_data;
>>> +
>>> +	io_uring_cancel_files(ctx, data);
>>> +	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
>>> +		io_wq_cancel_all(ctx->io_wq);
>>> +	return 0;
>>> +}
>>
>> Once one process having the uring fd open (even if it were just a fork
>> never touching the uring, I believe) exits, this prevents the uring from
>> being usable for any async tasks. The process exiting closes the fd,
>> which triggers flush. io_wq_cancel_all() sets IO_WQ_BIT_CANCEL, which
>> never gets unset, which causes all future async sqes to be be
>> immediately returned as -ECANCELLED by the worker, via io_req_cancelled.
>>
>> It's not clear to me why a close() should cancel the the wq (nor clear
>> the entire backlog, after 1d7bb1d50fb4)? Couldn't that even just be a
>> dup()ed fd? Or a fork that immediately exec()s?
>>
>> After rudely ifdefing out the above if, and reverting 44d282796f81, my
>> WIP io_uring using version of postgres appears to pass its tests - which
>> are very sparse at this point - again with 5.5-rc7.
> 
> We need to cancel work items using the files from this process if it
> exits, but I think we should be fine not canceling all work. Especially
> since thet setting of IO_WQ_BIT_CANCEL is a one way street...  I'm assuming
> the below works for you?

Could be even simpler, for shared ring setup, it also doesn't make any sense
to flush the cq ring on exit.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e5b502091804..e54556b0fcc6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5044,10 +5044,6 @@ static int io_uring_flush(struct file *file, void *data)
 	struct io_ring_ctx *ctx = file->private_data;
 
 	io_uring_cancel_files(ctx, data);
-	if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) {
-		io_cqring_overflow_flush(ctx, true);
-		io_wq_cancel_all(ctx->io_wq);
-	}
 	return 0;
 }
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files
  2020-01-26 17:17       ` Jens Axboe
@ 2020-01-26 20:07         ` Andres Freund
  0 siblings, 0 replies; 11+ messages in thread
From: Andres Freund @ 2020-01-26 20:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, io-uring, davem, netdev, jannh

Hi,

On 2020-01-26 10:17:00 -0700, Jens Axboe wrote:
> On 1/26/20 10:10 AM, Jens Axboe wrote:
> >> Unfortunately this partially breaks sharing a uring across with forked
> >> off processes, even though it initially appears to work:
> >>
> >>
> >>> +static int io_uring_flush(struct file *file, void *data)
> >>> +{
> >>> +	struct io_ring_ctx *ctx = file->private_data;
> >>> +
> >>> +	io_uring_cancel_files(ctx, data);
> >>> +	if (fatal_signal_pending(current) || (current->flags & PF_EXITING))
> >>> +		io_wq_cancel_all(ctx->io_wq);
> >>> +	return 0;
> >>> +}
> >>
> >> Once one process having the uring fd open (even if it were just a fork
> >> never touching the uring, I believe) exits, this prevents the uring from
> >> being usable for any async tasks. The process exiting closes the fd,
> >> which triggers flush. io_wq_cancel_all() sets IO_WQ_BIT_CANCEL, which
> >> never gets unset, which causes all future async sqes to be be
> >> immediately returned as -ECANCELLED by the worker, via io_req_cancelled.
> >>
> >> It's not clear to me why a close() should cancel the the wq (nor clear
> >> the entire backlog, after 1d7bb1d50fb4)? Couldn't that even just be a
> >> dup()ed fd? Or a fork that immediately exec()s?
> >>
> >> After rudely ifdefing out the above if, and reverting 44d282796f81, my
> >> WIP io_uring using version of postgres appears to pass its tests - which
> >> are very sparse at this point - again with 5.5-rc7.
> > 
> > We need to cancel work items using the files from this process if it
> > exits, but I think we should be fine not canceling all work. Especially
> > since thet setting of IO_WQ_BIT_CANCEL is a one way street...  I'm assuming
> > the below works for you?
> 
> Could be even simpler, for shared ring setup, it also doesn't make any sense
> to flush the cq ring on exit.
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e5b502091804..e54556b0fcc6 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -5044,10 +5044,6 @@ static int io_uring_flush(struct file *file, void *data)
>  	struct io_ring_ctx *ctx = file->private_data;
>  
>  	io_uring_cancel_files(ctx, data);
> -	if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) {
> -		io_cqring_overflow_flush(ctx, true);
> -		io_wq_cancel_all(ctx->io_wq);
> -	}
>  	return 0;
>  }

Yea, that's what I basically did locally, and it seems to work for my
uses.


It took me quite a while to understand why I wasn't seeing anything
actively causing requests inside the workqueue being cancelled, but
submissions to it continuing to succeed.  Wonder if it's a good idea to
continue enqueing new jobs if the WQ is already cancelled. E.g. during
release, afaict the sqpoll thread is still running, and could just
continue pumping work into the wq?  Perhaps the sqthread (and the enter
syscall? Not sure if possible) need to stop submitting once the ring is
being closed.


Btw, one of the reasons, besides not being familiar with the code, it
took me a while to figure out what was happening, is that the flags for
individual wqes and the whole wq are named so similarly. Generally, for
the casual reader, the non-distinctive naming of the different parts
makes it somewhat hard to follow along. Like which part is about sq
offload (sqo_mm one might think, but its also used for sync work I
believe), which about the user-facing sq, which about the workqueue,
etc.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-01-26 20:07 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-25 17:30 [PATCHSET v2 0/4] io_uring: add support for accept(4) Jens Axboe
2019-10-25 17:30 ` [PATCH 1/4] io_uring: reorder struct sqe_submit Jens Axboe
2019-10-25 17:30 ` [PATCH 2/4] io_uring: io_uring: add support for async work inheriting files Jens Axboe
2019-10-25 21:31   ` Pavel Begunkov
2019-10-25 21:45     ` Jens Axboe
2020-01-26 10:12   ` Andres Freund
2020-01-26 17:10     ` Jens Axboe
2020-01-26 17:17       ` Jens Axboe
2020-01-26 20:07         ` Andres Freund
2019-10-25 17:30 ` [PATCH 3/4] net: add __sys_accept4_file() helper Jens Axboe
2019-10-25 17:30 ` [PATCH 4/4] io_uring: add support for IORING_OP_ACCEPT Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).