All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register
@ 2022-02-03 23:34 Usama Arif
  2022-02-03 23:34 ` [PATCH v5 1/4] io_uring: remove trace for eventfd Usama Arif
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Usama Arif @ 2022-02-03 23:34 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

For opcodes relating to registering/unregistering eventfds, this is done by
creating a new RCU data structure (io_ev_fd) as part of io_ring_ctx that
holds the eventfd_ctx, with reads to the structure protected by
rcu_read_lock and writes (register/unregister calls) protected by a mutex.

With the above approach ring quiesce can be avoided which is much more
expensive then using RCU lock. On the system tested, io_uring_reigster with
IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
before with ring quiesce.

The second patch creates the RCU protected data structure and removes ring
quiesce for IORING_REGISTER_EVENTFD and IORING_UNREGISTER_EVENTFD.

The third patch builds on top of the second patch and removes ring quiesce
for IORING_REGISTER_EVENTFD_ASYNC.

The fourth patch completely removes ring quiesce from io_uring_register,
as IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS dont need
them.

---
v4->v5:
- Remove ring quiesce completely from io_uring_register (Pavel Begunkov)
- Replaced rcu_barrier with unregistering flag (Jens Axboe)
- Created a faster check for ctx->io_ev_fd in io_eventfd_signal and cleaned up
io_eventfd_unregister (Jens Axboe)

v3->v4:
- Switch back to call_rcu and use rcu_barrier incase io_eventfd_register fails
to make sure all rcu callbacks have finished.

v2->v3:
- Switched to using synchronize_rcu from call_rcu in io_eventfd_unregister.

v1->v2:
- Added patch to remove eventfd from tracepoint (Patch 1) (Jens Axboe)
- Made the code of io_should_trigger_evfd as part of io_eventfd_signal (Jens Axboe)

Usama Arif (4):
  io_uring: remove trace for eventfd
  io_uring: avoid ring quiesce while registering/unregistering eventfd
  io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC
  io_uring: remove ring quiesce for io_uring_register

 fs/io_uring.c                   | 202 ++++++++++++++++----------------
 include/trace/events/io_uring.h |  13 +-
 2 files changed, 107 insertions(+), 108 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/4] io_uring: remove trace for eventfd
  2022-02-03 23:34 [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register Usama Arif
@ 2022-02-03 23:34 ` Usama Arif
  2022-02-03 23:34 ` [PATCH v5 2/4] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Usama Arif @ 2022-02-03 23:34 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

The information on whether eventfd is registered is not very useful
and would result in the tracepoint being enclosed in an rcu_readlock
in a later patch that tries to avoid ring quiesce for registering
eventfd.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 fs/io_uring.c                   |  3 +--
 include/trace/events/io_uring.h | 13 +++++--------
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2e04f718319d..21531609a9c6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -11171,8 +11171,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
 	mutex_lock(&ctx->uring_lock);
 	ret = __io_uring_register(ctx, opcode, arg, nr_args);
 	mutex_unlock(&ctx->uring_lock);
-	trace_io_uring_register(ctx, opcode, ctx->nr_user_files, ctx->nr_user_bufs,
-							ctx->cq_ev_fd != NULL, ret);
+	trace_io_uring_register(ctx, opcode, ctx->nr_user_files, ctx->nr_user_bufs, ret);
 out_fput:
 	fdput(f);
 	return ret;
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 7346f0164cf4..098beda7601a 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -57,10 +57,9 @@ TRACE_EVENT(io_uring_create,
  * @opcode:		describes which operation to perform
  * @nr_user_files:	number of registered files
  * @nr_user_bufs:	number of registered buffers
- * @cq_ev_fd:		whether eventfs registered or not
  * @ret:		return code
  *
- * Allows to trace fixed files/buffers/eventfds, that could be registered to
+ * Allows to trace fixed files/buffers, that could be registered to
  * avoid an overhead of getting references to them for every operation. This
  * event, together with io_uring_file_get, can provide a full picture of how
  * much overhead one can reduce via fixing.
@@ -68,16 +67,15 @@ TRACE_EVENT(io_uring_create,
 TRACE_EVENT(io_uring_register,
 
 	TP_PROTO(void *ctx, unsigned opcode, unsigned nr_files,
-			 unsigned nr_bufs, bool eventfd, long ret),
+			 unsigned nr_bufs, long ret),
 
-	TP_ARGS(ctx, opcode, nr_files, nr_bufs, eventfd, ret),
+	TP_ARGS(ctx, opcode, nr_files, nr_bufs, ret),
 
 	TP_STRUCT__entry (
 		__field(  void *,	ctx			)
 		__field(  unsigned,	opcode		)
 		__field(  unsigned,	nr_files	)
 		__field(  unsigned,	nr_bufs		)
-		__field(  bool,		eventfd		)
 		__field(  long,		ret			)
 	),
 
@@ -86,14 +84,13 @@ TRACE_EVENT(io_uring_register,
 		__entry->opcode		= opcode;
 		__entry->nr_files	= nr_files;
 		__entry->nr_bufs	= nr_bufs;
-		__entry->eventfd	= eventfd;
 		__entry->ret		= ret;
 	),
 
 	TP_printk("ring %p, opcode %d, nr_user_files %d, nr_user_bufs %d, "
-			  "eventfd %d, ret %ld",
+			  "ret %ld",
 			  __entry->ctx, __entry->opcode, __entry->nr_files,
-			  __entry->nr_bufs, __entry->eventfd, __entry->ret)
+			  __entry->nr_bufs, __entry->ret)
 );
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 2/4] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 23:34 [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register Usama Arif
  2022-02-03 23:34 ` [PATCH v5 1/4] io_uring: remove trace for eventfd Usama Arif
@ 2022-02-03 23:34 ` Usama Arif
  2022-02-03 23:46   ` Pavel Begunkov
  2022-02-03 23:34 ` [PATCH v5 3/4] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC Usama Arif
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Usama Arif @ 2022-02-03 23:34 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

This is done by creating a new RCU data structure (io_ev_fd) as part of
io_ring_ctx that holds the eventfd_ctx.

The function io_eventfd_signal is executed under rcu_read_lock with a
single rcu_dereference to io_ev_fd so that if another thread unregisters
the eventfd while io_eventfd_signal is still being executed, the
eventfd_signal for which io_eventfd_signal was called completes
successfully.

The process of registering/unregistering eventfd is done under a lock
so multiple threads don't enter a race condition while
registering/unregistering eventfd.

With the above approach ring quiesce can be avoided which is much more
expensive then using RCU lock. On the system tested, io_uring_reigster with
IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
before with ring quiesce.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 fs/io_uring.c | 116 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 93 insertions(+), 23 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 21531609a9c6..51602bddb9a8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -326,6 +326,13 @@ struct io_submit_state {
 	struct blk_plug		plug;
 };
 
+struct io_ev_fd {
+	struct eventfd_ctx	*cq_ev_fd;
+	struct io_ring_ctx	*ctx;
+	struct rcu_head		rcu;
+	bool 			unregistering;
+};
+
 struct io_ring_ctx {
 	/* const or read-mostly hot data */
 	struct {
@@ -399,7 +406,8 @@ struct io_ring_ctx {
 	struct {
 		unsigned		cached_cq_tail;
 		unsigned		cq_entries;
-		struct eventfd_ctx	*cq_ev_fd;
+		struct io_ev_fd	__rcu	*io_ev_fd;
+		struct mutex		ev_fd_lock;
 		struct wait_queue_head	cq_wait;
 		unsigned		cq_extra;
 		atomic_t		cq_timeouts;
@@ -1448,6 +1456,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	xa_init_flags(&ctx->io_buffers, XA_FLAGS_ALLOC1);
 	xa_init_flags(&ctx->personalities, XA_FLAGS_ALLOC1);
 	mutex_init(&ctx->uring_lock);
+	mutex_init(&ctx->ev_fd_lock);
 	init_waitqueue_head(&ctx->cq_wait);
 	spin_lock_init(&ctx->completion_lock);
 	spin_lock_init(&ctx->timeout_lock);
@@ -1726,13 +1735,32 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
 	return &rings->cqes[tail & mask];
 }
 
-static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
+static void io_eventfd_signal(struct io_ring_ctx *ctx)
 {
-	if (likely(!ctx->cq_ev_fd))
-		return false;
+	struct io_ev_fd *ev_fd;
+
+	/* Return quickly if ctx->io_ev_fd doesn't exist */
+	if (likely(!rcu_dereference_raw(ctx->io_ev_fd)))
+		return;
+
+	rcu_read_lock();
+	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking and eventfd_signal */
+	ev_fd = rcu_dereference(ctx->io_ev_fd);
+
+	/*
+	 * Check again if ev_fd exists incase an io_eventfd_unregister call completed between
+	 * the NULL check of ctx->io_ev_fd at the start of the function and rcu_read_lock.
+	 */
+	if (unlikely(!ev_fd))
+		goto out;
 	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
-		return false;
-	return !ctx->eventfd_async || io_wq_current_is_worker();
+		goto out;
+
+	if (!ctx->eventfd_async || io_wq_current_is_worker())
+		eventfd_signal(ev_fd->cq_ev_fd, 1);
+
+out:
+	rcu_read_unlock();
 }
 
 /*
@@ -1751,8 +1779,7 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 	 */
 	if (wq_has_sleeper(&ctx->cq_wait))
 		wake_up_all(&ctx->cq_wait);
-	if (io_should_trigger_evfd(ctx))
-		eventfd_signal(ctx->cq_ev_fd, 1);
+	io_eventfd_signal(ctx);
 }
 
 static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
@@ -1764,8 +1791,7 @@ static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
 		if (waitqueue_active(&ctx->cq_wait))
 			wake_up_all(&ctx->cq_wait);
 	}
-	if (io_should_trigger_evfd(ctx))
-		eventfd_signal(ctx->cq_ev_fd, 1);
+	io_eventfd_signal(ctx);
 }
 
 /* Returns true if there are no backlogged entries after the flush */
@@ -9353,34 +9379,76 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 
 static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
 {
+	struct io_ev_fd *ev_fd;
 	__s32 __user *fds = arg;
-	int fd;
+	int fd, ret;
 
-	if (ctx->cq_ev_fd)
-		return -EBUSY;
+	mutex_lock(&ctx->ev_fd_lock);
+	ret = -EBUSY;
+	ev_fd = rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock));
+	if (ev_fd) {
+		/*
+		 * If ev_fd exists, there are 2 possibilities:
+		 * - The rcu_callback to io_eventfd_put hasn't finished while unregistering
+		 * (hence ev_fd->unregistering is true) and io_eventfd_register
+		 * can continue and overwrite ctx->io_ev_fd with the new eventfd.
+		 * - Or io_eventfd_register has been called on an io_uring that has
+		 * already registered a valid eventfd in which case return -EBUSY.
+		 */
+		if(!ev_fd->unregistering)
+			goto out;
+	}
 
+	ret = -EFAULT;
 	if (copy_from_user(&fd, fds, sizeof(*fds)))
-		return -EFAULT;
+		goto out;
 
-	ctx->cq_ev_fd = eventfd_ctx_fdget(fd);
-	if (IS_ERR(ctx->cq_ev_fd)) {
-		int ret = PTR_ERR(ctx->cq_ev_fd);
+	ret = -ENOMEM;
+	ev_fd = kmalloc(sizeof(*ev_fd), GFP_KERNEL);
+	if (!ev_fd)
+		goto out;
 
-		ctx->cq_ev_fd = NULL;
-		return ret;
+	ev_fd->cq_ev_fd = eventfd_ctx_fdget(fd);
+	if (IS_ERR(ev_fd->cq_ev_fd)) {
+		ret = PTR_ERR(ev_fd->cq_ev_fd);
+		kfree(ev_fd);
+		goto out;
 	}
+	ev_fd->ctx = ctx;
+	ev_fd->unregistering = false;
 
-	return 0;
+	rcu_assign_pointer(ctx->io_ev_fd, ev_fd);
+	ret = 0;
+
+out:
+	mutex_unlock(&ctx->ev_fd_lock);
+	return ret;
+}
+
+static void io_eventfd_put(struct rcu_head *rcu)
+{
+	struct io_ev_fd *ev_fd = container_of(rcu, struct io_ev_fd, rcu);
+	struct io_ring_ctx *ctx = ev_fd->ctx;
+
+	eventfd_ctx_put(ev_fd->cq_ev_fd);
+	kfree(ev_fd);
+	rcu_assign_pointer(ctx->io_ev_fd, NULL);
 }
 
 static int io_eventfd_unregister(struct io_ring_ctx *ctx)
 {
-	if (ctx->cq_ev_fd) {
-		eventfd_ctx_put(ctx->cq_ev_fd);
-		ctx->cq_ev_fd = NULL;
+	struct io_ev_fd *ev_fd;
+
+	mutex_lock(&ctx->ev_fd_lock);
+	ev_fd = rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock));
+	if (ev_fd) {
+		ev_fd->unregistering = true;
+		call_rcu(&ev_fd->rcu, io_eventfd_put);
+		mutex_unlock(&ctx->ev_fd_lock);
 		return 0;
 	}
 
+	mutex_unlock(&ctx->ev_fd_lock);
 	return -ENXIO;
 }
 
@@ -10960,6 +11028,8 @@ static bool io_register_op_must_quiesce(int op)
 	case IORING_REGISTER_FILES:
 	case IORING_UNREGISTER_FILES:
 	case IORING_REGISTER_FILES_UPDATE:
+	case IORING_REGISTER_EVENTFD:
+	case IORING_UNREGISTER_EVENTFD:
 	case IORING_REGISTER_PROBE:
 	case IORING_REGISTER_PERSONALITY:
 	case IORING_UNREGISTER_PERSONALITY:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 3/4] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC
  2022-02-03 23:34 [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register Usama Arif
  2022-02-03 23:34 ` [PATCH v5 1/4] io_uring: remove trace for eventfd Usama Arif
  2022-02-03 23:34 ` [PATCH v5 2/4] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
@ 2022-02-03 23:34 ` Usama Arif
  2022-02-03 23:34 ` [PATCH v5 4/4] io_uring: remove ring quiesce for io_uring_register Usama Arif
  2022-02-04  0:02 ` [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register Pavel Begunkov
  4 siblings, 0 replies; 13+ messages in thread
From: Usama Arif @ 2022-02-03 23:34 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

This is done using the RCU data structure (io_ev_fd). eventfd_async
is moved from io_ring_ctx to io_ev_fd which is RCU protected hence
avoiding ring quiesce which is much more expensive than an RCU lock.
io_should_trigger_evfd is already under rcu_read_lock so there is
no extra RCU read-side critical section needed.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 fs/io_uring.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 51602bddb9a8..5ae51ea12f0f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -329,6 +329,7 @@ struct io_submit_state {
 struct io_ev_fd {
 	struct eventfd_ctx	*cq_ev_fd;
 	struct io_ring_ctx	*ctx;
+	unsigned int		eventfd_async: 1;
 	struct rcu_head		rcu;
 	bool 			unregistering;
 };
@@ -342,7 +343,6 @@ struct io_ring_ctx {
 		unsigned int		flags;
 		unsigned int		compat: 1;
 		unsigned int		drain_next: 1;
-		unsigned int		eventfd_async: 1;
 		unsigned int		restricted: 1;
 		unsigned int		off_timeout_used: 1;
 		unsigned int		drain_active: 1;
@@ -1756,7 +1756,7 @@ static void io_eventfd_signal(struct io_ring_ctx *ctx)
 	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
 		goto out;
 
-	if (!ctx->eventfd_async || io_wq_current_is_worker())
+	if (!ev_fd->eventfd_async || io_wq_current_is_worker())
 		eventfd_signal(ev_fd->cq_ev_fd, 1);
 
 out:
@@ -9377,7 +9377,8 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 	return done ? done : err;
 }
 
-static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
+static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg,
+			       unsigned int eventfd_async)
 {
 	struct io_ev_fd *ev_fd;
 	__s32 __user *fds = arg;
@@ -9416,6 +9417,7 @@ static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
 	}
 	ev_fd->ctx = ctx;
 	ev_fd->unregistering = false;
+	ev_fd->eventfd_async = eventfd_async;
 
 	rcu_assign_pointer(ctx->io_ev_fd, ev_fd);
 	ret = 0;
@@ -11029,6 +11031,7 @@ static bool io_register_op_must_quiesce(int op)
 	case IORING_UNREGISTER_FILES:
 	case IORING_REGISTER_FILES_UPDATE:
 	case IORING_REGISTER_EVENTFD:
+	case IORING_REGISTER_EVENTFD_ASYNC:
 	case IORING_UNREGISTER_EVENTFD:
 	case IORING_REGISTER_PROBE:
 	case IORING_REGISTER_PERSONALITY:
@@ -11129,17 +11132,16 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 		ret = io_register_files_update(ctx, arg, nr_args);
 		break;
 	case IORING_REGISTER_EVENTFD:
-	case IORING_REGISTER_EVENTFD_ASYNC:
 		ret = -EINVAL;
 		if (nr_args != 1)
 			break;
-		ret = io_eventfd_register(ctx, arg);
-		if (ret)
+		ret = io_eventfd_register(ctx, arg, 0);
+		break;
+	case IORING_REGISTER_EVENTFD_ASYNC:
+		ret = -EINVAL;
+		if (nr_args != 1)
 			break;
-		if (opcode == IORING_REGISTER_EVENTFD_ASYNC)
-			ctx->eventfd_async = 1;
-		else
-			ctx->eventfd_async = 0;
+		ret = io_eventfd_register(ctx, arg, 1);
 		break;
 	case IORING_UNREGISTER_EVENTFD:
 		ret = -EINVAL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 4/4] io_uring: remove ring quiesce for io_uring_register
  2022-02-03 23:34 [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register Usama Arif
                   ` (2 preceding siblings ...)
  2022-02-03 23:34 ` [PATCH v5 3/4] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC Usama Arif
@ 2022-02-03 23:34 ` Usama Arif
  2022-02-03 23:47   ` Pavel Begunkov
  2022-02-04  0:02 ` [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register Pavel Begunkov
  4 siblings, 1 reply; 13+ messages in thread
From: Usama Arif @ 2022-02-03 23:34 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

Ring quiesce is currently only used for 2 opcodes
IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS.
IORING_SETUP_R_DISABLED prevents submitting requests and
so there will be no requests until IORING_REGISTER_ENABLE_RINGS
is called. And IORING_REGISTER_RESTRICTIONS works only before
IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is
not needed for these opcodes and therefore io_uring_register.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 fs/io_uring.c | 69 ---------------------------------------------------
 1 file changed, 69 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5ae51ea12f0f..89e4dd7e8995 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -11022,64 +11022,6 @@ static __cold int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
 	return ret;
 }
 
-static bool io_register_op_must_quiesce(int op)
-{
-	switch (op) {
-	case IORING_REGISTER_BUFFERS:
-	case IORING_UNREGISTER_BUFFERS:
-	case IORING_REGISTER_FILES:
-	case IORING_UNREGISTER_FILES:
-	case IORING_REGISTER_FILES_UPDATE:
-	case IORING_REGISTER_EVENTFD:
-	case IORING_REGISTER_EVENTFD_ASYNC:
-	case IORING_UNREGISTER_EVENTFD:
-	case IORING_REGISTER_PROBE:
-	case IORING_REGISTER_PERSONALITY:
-	case IORING_UNREGISTER_PERSONALITY:
-	case IORING_REGISTER_FILES2:
-	case IORING_REGISTER_FILES_UPDATE2:
-	case IORING_REGISTER_BUFFERS2:
-	case IORING_REGISTER_BUFFERS_UPDATE:
-	case IORING_REGISTER_IOWQ_AFF:
-	case IORING_UNREGISTER_IOWQ_AFF:
-	case IORING_REGISTER_IOWQ_MAX_WORKERS:
-		return false;
-	default:
-		return true;
-	}
-}
-
-static __cold int io_ctx_quiesce(struct io_ring_ctx *ctx)
-{
-	long ret;
-
-	percpu_ref_kill(&ctx->refs);
-
-	/*
-	 * Drop uring mutex before waiting for references to exit. If another
-	 * thread is currently inside io_uring_enter() it might need to grab the
-	 * uring_lock to make progress. If we hold it here across the drain
-	 * wait, then we can deadlock. It's safe to drop the mutex here, since
-	 * no new references will come in after we've killed the percpu ref.
-	 */
-	mutex_unlock(&ctx->uring_lock);
-	do {
-		ret = wait_for_completion_interruptible_timeout(&ctx->ref_comp, HZ);
-		if (ret) {
-			ret = min(0L, ret);
-			break;
-		}
-
-		ret = io_run_task_work_sig();
-		io_req_caches_free(ctx);
-	} while (ret >= 0);
-	mutex_lock(&ctx->uring_lock);
-
-	if (ret)
-		io_refs_resurrect(&ctx->refs, &ctx->ref_comp);
-	return ret;
-}
-
 static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			       void __user *arg, unsigned nr_args)
 	__releases(ctx->uring_lock)
@@ -11103,12 +11045,6 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			return -EACCES;
 	}
 
-	if (io_register_op_must_quiesce(opcode)) {
-		ret = io_ctx_quiesce(ctx);
-		if (ret)
-			return ret;
-	}
-
 	switch (opcode) {
 	case IORING_REGISTER_BUFFERS:
 		ret = io_sqe_buffers_register(ctx, arg, nr_args, NULL);
@@ -11213,11 +11149,6 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 		break;
 	}
 
-	if (io_register_op_must_quiesce(opcode)) {
-		/* bring the ctx back to life */
-		percpu_ref_reinit(&ctx->refs);
-		reinit_completion(&ctx->ref_comp);
-	}
 	return ret;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/4] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 23:34 ` [PATCH v5 2/4] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
@ 2022-02-03 23:46   ` Pavel Begunkov
  2022-02-03 23:54     ` Pavel Begunkov
  2022-02-04  0:12     ` Jens Axboe
  0 siblings, 2 replies; 13+ messages in thread
From: Pavel Begunkov @ 2022-02-03 23:46 UTC (permalink / raw)
  To: Usama Arif, io-uring, axboe, linux-kernel; +Cc: fam.zheng

On 2/3/22 23:34, Usama Arif wrote:
> This is done by creating a new RCU data structure (io_ev_fd) as part of
> io_ring_ctx that holds the eventfd_ctx.
> 
> The function io_eventfd_signal is executed under rcu_read_lock with a
> single rcu_dereference to io_ev_fd so that if another thread unregisters
> the eventfd while io_eventfd_signal is still being executed, the
> eventfd_signal for which io_eventfd_signal was called completes
> successfully.
> 
> The process of registering/unregistering eventfd is done under a lock
> so multiple threads don't enter a race condition while
> registering/unregistering eventfd.
> 
> With the above approach ring quiesce can be avoided which is much more
> expensive then using RCU lock. On the system tested, io_uring_reigster with
> IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
> before with ring quiesce.
> 
> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
> ---
>   fs/io_uring.c | 116 ++++++++++++++++++++++++++++++++++++++++----------
>   1 file changed, 93 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 21531609a9c6..51602bddb9a8 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -326,6 +326,13 @@ struct io_submit_state {
>   	struct blk_plug		plug;
>   };
>   
> +struct io_ev_fd {
> +	struct eventfd_ctx	*cq_ev_fd;
> +	struct io_ring_ctx	*ctx;
> +	struct rcu_head		rcu;
> +	bool 			unregistering;
> +};
> +
>   struct io_ring_ctx {
>   	/* const or read-mostly hot data */
>   	struct {
> @@ -399,7 +406,8 @@ struct io_ring_ctx {
>   	struct {
>   		unsigned		cached_cq_tail;
>   		unsigned		cq_entries;
> -		struct eventfd_ctx	*cq_ev_fd;
> +		struct io_ev_fd	__rcu	*io_ev_fd;
> +		struct mutex		ev_fd_lock;
>   		struct wait_queue_head	cq_wait;
>   		unsigned		cq_extra;
>   		atomic_t		cq_timeouts;
> @@ -1448,6 +1456,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>   	xa_init_flags(&ctx->io_buffers, XA_FLAGS_ALLOC1);
>   	xa_init_flags(&ctx->personalities, XA_FLAGS_ALLOC1);
>   	mutex_init(&ctx->uring_lock);
> +	mutex_init(&ctx->ev_fd_lock);
>   	init_waitqueue_head(&ctx->cq_wait);
>   	spin_lock_init(&ctx->completion_lock);
>   	spin_lock_init(&ctx->timeout_lock);
> @@ -1726,13 +1735,32 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
>   	return &rings->cqes[tail & mask];
>   }
>   
> -static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
> +static void io_eventfd_signal(struct io_ring_ctx *ctx)
>   {
> -	if (likely(!ctx->cq_ev_fd))
> -		return false;
> +	struct io_ev_fd *ev_fd;
> +
> +	/* Return quickly if ctx->io_ev_fd doesn't exist */
> +	if (likely(!rcu_dereference_raw(ctx->io_ev_fd)))
> +		return;
> +
> +	rcu_read_lock();
> +	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking and eventfd_signal */
> +	ev_fd = rcu_dereference(ctx->io_ev_fd);
> +
> +	/*
> +	 * Check again if ev_fd exists incase an io_eventfd_unregister call completed between
> +	 * the NULL check of ctx->io_ev_fd at the start of the function and rcu_read_lock.
> +	 */
> +	if (unlikely(!ev_fd))
> +		goto out;
>   	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
> -		return false;
> -	return !ctx->eventfd_async || io_wq_current_is_worker();
> +		goto out;
> +
> +	if (!ctx->eventfd_async || io_wq_current_is_worker())
> +		eventfd_signal(ev_fd->cq_ev_fd, 1);
> +
> +out:
> +	rcu_read_unlock();
>   }
>   
>   /*
> @@ -1751,8 +1779,7 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
>   	 */
>   	if (wq_has_sleeper(&ctx->cq_wait))
>   		wake_up_all(&ctx->cq_wait);
> -	if (io_should_trigger_evfd(ctx))
> -		eventfd_signal(ctx->cq_ev_fd, 1);
> +	io_eventfd_signal(ctx);
>   }
>   
>   static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
> @@ -1764,8 +1791,7 @@ static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
>   		if (waitqueue_active(&ctx->cq_wait))
>   			wake_up_all(&ctx->cq_wait);
>   	}
> -	if (io_should_trigger_evfd(ctx))
> -		eventfd_signal(ctx->cq_ev_fd, 1);
> +	io_eventfd_signal(ctx);
>   }
>   
>   /* Returns true if there are no backlogged entries after the flush */
> @@ -9353,34 +9379,76 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
>   
>   static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
>   {
> +	struct io_ev_fd *ev_fd;
>   	__s32 __user *fds = arg;
> -	int fd;
> +	int fd, ret;
>   
> -	if (ctx->cq_ev_fd)
> -		return -EBUSY;
> +	mutex_lock(&ctx->ev_fd_lock);
> +	ret = -EBUSY;
> +	ev_fd = rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock));
> +	if (ev_fd) {
> +		/*
> +		 * If ev_fd exists, there are 2 possibilities:
> +		 * - The rcu_callback to io_eventfd_put hasn't finished while unregistering
> +		 * (hence ev_fd->unregistering is true) and io_eventfd_register
> +		 * can continue and overwrite ctx->io_ev_fd with the new eventfd.
> +		 * - Or io_eventfd_register has been called on an io_uring that has
> +		 * already registered a valid eventfd in which case return -EBUSY.
> +		 */
> +		if(!ev_fd->unregistering)
> +			goto out;
> +	}
>   
> +	ret = -EFAULT;
>   	if (copy_from_user(&fd, fds, sizeof(*fds)))
> -		return -EFAULT;
> +		goto out;
>   
> -	ctx->cq_ev_fd = eventfd_ctx_fdget(fd);
> -	if (IS_ERR(ctx->cq_ev_fd)) {
> -		int ret = PTR_ERR(ctx->cq_ev_fd);
> +	ret = -ENOMEM;
> +	ev_fd = kmalloc(sizeof(*ev_fd), GFP_KERNEL);
> +	if (!ev_fd)
> +		goto out;
>   
> -		ctx->cq_ev_fd = NULL;
> -		return ret;
> +	ev_fd->cq_ev_fd = eventfd_ctx_fdget(fd);
> +	if (IS_ERR(ev_fd->cq_ev_fd)) {
> +		ret = PTR_ERR(ev_fd->cq_ev_fd);
> +		kfree(ev_fd);
> +		goto out;
>   	}
> +	ev_fd->ctx = ctx;
> +	ev_fd->unregistering = false;
>   
> -	return 0;
> +	rcu_assign_pointer(ctx->io_ev_fd, ev_fd);
> +	ret = 0;
> +
> +out:
> +	mutex_unlock(&ctx->ev_fd_lock);
> +	return ret;
> +}
> +
> +static void io_eventfd_put(struct rcu_head *rcu)
> +{
> +	struct io_ev_fd *ev_fd = container_of(rcu, struct io_ev_fd, rcu);
> +	struct io_ring_ctx *ctx = ev_fd->ctx;
> +
> +	eventfd_ctx_put(ev_fd->cq_ev_fd);
> +	kfree(ev_fd);
> +	rcu_assign_pointer(ctx->io_ev_fd, NULL);
>   }

Emm, it happens after the grace period, so you have a gap where a
request may read a freed eventfd... What I think you wanted to do
is more like below:


io_eventfd_put() {
	evfd = ...;
	eventfd_ctx_put(evfd->evfd);
	kfree(io_ev_fd);
}

register() {
	mutex_lock();
	ev_fd = rcu_deref();
	if (ev_fd) {
		rcu_assign_pointer(ctx->evfd, NULL);
		call_rcu(&ev_fd->evfd, io_eventfd_put);
	}
	mutex_unlock();
}


Note, there's no need in ->unregistering. I also doubt you need
->ev_fd_lock, how about just using already taken ->uring_lock?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 4/4] io_uring: remove ring quiesce for io_uring_register
  2022-02-03 23:34 ` [PATCH v5 4/4] io_uring: remove ring quiesce for io_uring_register Usama Arif
@ 2022-02-03 23:47   ` Pavel Begunkov
  2022-02-04  0:28     ` Pavel Begunkov
  0 siblings, 1 reply; 13+ messages in thread
From: Pavel Begunkov @ 2022-02-03 23:47 UTC (permalink / raw)
  To: Usama Arif, io-uring, axboe, linux-kernel; +Cc: fam.zheng

On 2/3/22 23:34, Usama Arif wrote:
> Ring quiesce is currently only used for 2 opcodes
> IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS.
> IORING_SETUP_R_DISABLED prevents submitting requests and
> so there will be no requests until IORING_REGISTER_ENABLE_RINGS
> is called. And IORING_REGISTER_RESTRICTIONS works only before
> IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is
> not needed for these opcodes and therefore io_uring_register.

I think I'd prefer to retain quiesce code than reverting this
patch later.



> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
> ---
>   fs/io_uring.c | 69 ---------------------------------------------------
>   1 file changed, 69 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 5ae51ea12f0f..89e4dd7e8995 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -11022,64 +11022,6 @@ static __cold int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
>   	return ret;
>   }
>   
> -static bool io_register_op_must_quiesce(int op)
> -{
> -	switch (op) {
> -	case IORING_REGISTER_BUFFERS:
> -	case IORING_UNREGISTER_BUFFERS:
> -	case IORING_REGISTER_FILES:
> -	case IORING_UNREGISTER_FILES:
> -	case IORING_REGISTER_FILES_UPDATE:
> -	case IORING_REGISTER_EVENTFD:
> -	case IORING_REGISTER_EVENTFD_ASYNC:
> -	case IORING_UNREGISTER_EVENTFD:
> -	case IORING_REGISTER_PROBE:
> -	case IORING_REGISTER_PERSONALITY:
> -	case IORING_UNREGISTER_PERSONALITY:
> -	case IORING_REGISTER_FILES2:
> -	case IORING_REGISTER_FILES_UPDATE2:
> -	case IORING_REGISTER_BUFFERS2:
> -	case IORING_REGISTER_BUFFERS_UPDATE:
> -	case IORING_REGISTER_IOWQ_AFF:
> -	case IORING_UNREGISTER_IOWQ_AFF:
> -	case IORING_REGISTER_IOWQ_MAX_WORKERS:
> -		return false;
> -	default:
> -		return true;
> -	}
> -}
> -
> -static __cold int io_ctx_quiesce(struct io_ring_ctx *ctx)
> -{
> -	long ret;
> -
> -	percpu_ref_kill(&ctx->refs);
> -
> -	/*
> -	 * Drop uring mutex before waiting for references to exit. If another
> -	 * thread is currently inside io_uring_enter() it might need to grab the
> -	 * uring_lock to make progress. If we hold it here across the drain
> -	 * wait, then we can deadlock. It's safe to drop the mutex here, since
> -	 * no new references will come in after we've killed the percpu ref.
> -	 */
> -	mutex_unlock(&ctx->uring_lock);
> -	do {
> -		ret = wait_for_completion_interruptible_timeout(&ctx->ref_comp, HZ);
> -		if (ret) {
> -			ret = min(0L, ret);
> -			break;
> -		}
> -
> -		ret = io_run_task_work_sig();
> -		io_req_caches_free(ctx);
> -	} while (ret >= 0);
> -	mutex_lock(&ctx->uring_lock);
> -
> -	if (ret)
> -		io_refs_resurrect(&ctx->refs, &ctx->ref_comp);
> -	return ret;
> -}
> -
>   static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>   			       void __user *arg, unsigned nr_args)
>   	__releases(ctx->uring_lock)
> @@ -11103,12 +11045,6 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>   			return -EACCES;
>   	}
>   
> -	if (io_register_op_must_quiesce(opcode)) {
> -		ret = io_ctx_quiesce(ctx);
> -		if (ret)
> -			return ret;
> -	}
> -
>   	switch (opcode) {
>   	case IORING_REGISTER_BUFFERS:
>   		ret = io_sqe_buffers_register(ctx, arg, nr_args, NULL);
> @@ -11213,11 +11149,6 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>   		break;
>   	}
>   
> -	if (io_register_op_must_quiesce(opcode)) {
> -		/* bring the ctx back to life */
> -		percpu_ref_reinit(&ctx->refs);
> -		reinit_completion(&ctx->ref_comp);
> -	}
>   	return ret;
>   }
>   

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/4] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 23:46   ` Pavel Begunkov
@ 2022-02-03 23:54     ` Pavel Begunkov
  2022-02-04  0:12     ` Jens Axboe
  1 sibling, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2022-02-03 23:54 UTC (permalink / raw)
  To: Usama Arif, io-uring, axboe, linux-kernel; +Cc: fam.zheng

On 2/3/22 23:46, Pavel Begunkov wrote:
> On 2/3/22 23:34, Usama Arif wrote:
>> This is done by creating a new RCU data structure (io_ev_fd) as part of
>> io_ring_ctx that holds the eventfd_ctx.
>>
>> The function io_eventfd_signal is executed under rcu_read_lock with a
>> single rcu_dereference to io_ev_fd so that if another thread unregisters
>> the eventfd while io_eventfd_signal is still being executed, the
>> eventfd_signal for which io_eventfd_signal was called completes
>> successfully.
>>
>> The process of registering/unregistering eventfd is done under a lock
>> so multiple threads don't enter a race condition while
>> registering/unregistering eventfd.
>>
>> With the above approach ring quiesce can be avoided which is much more
>> expensive then using RCU lock. On the system tested, io_uring_reigster with
>> IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
>> before with ring quiesce.
>>
>> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
>> ---
>>   fs/io_uring.c | 116 ++++++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 93 insertions(+), 23 deletions(-)
>>
[...]
>> +
>> +static void io_eventfd_put(struct rcu_head *rcu)
>> +{
>> +    struct io_ev_fd *ev_fd = container_of(rcu, struct io_ev_fd, rcu);
>> +    struct io_ring_ctx *ctx = ev_fd->ctx;
>> +
>> +    eventfd_ctx_put(ev_fd->cq_ev_fd);
>> +    kfree(ev_fd);
>> +    rcu_assign_pointer(ctx->io_ev_fd, NULL);
>>   }
> 
> Emm, it happens after the grace period, so you have a gap where a
> request may read a freed eventfd... What I think you wanted to do
> is more like below:
> 
> 
> io_eventfd_put() {
>      evfd = ...;
>      eventfd_ctx_put(evfd->evfd);
>      kfree(io_ev_fd);
> }
> 
> register() {

s/register/unregister/

>      mutex_lock();
>      ev_fd = rcu_deref();
>      if (ev_fd) {
>          rcu_assign_pointer(ctx->evfd, NULL);
>          call_rcu(&ev_fd->evfd, io_eventfd_put);
>      }
>      mutex_unlock();
> }
> 
> 
> Note, there's no need in ->unregistering. I also doubt you need
> ->ev_fd_lock, how about just using already taken ->uring_lock?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register
  2022-02-03 23:34 [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register Usama Arif
                   ` (3 preceding siblings ...)
  2022-02-03 23:34 ` [PATCH v5 4/4] io_uring: remove ring quiesce for io_uring_register Usama Arif
@ 2022-02-04  0:02 ` Pavel Begunkov
  2022-02-04  0:15   ` Jens Axboe
  4 siblings, 1 reply; 13+ messages in thread
From: Pavel Begunkov @ 2022-02-04  0:02 UTC (permalink / raw)
  To: Usama Arif, io-uring, axboe, linux-kernel; +Cc: fam.zheng

On 2/3/22 23:34, Usama Arif wrote:
> For opcodes relating to registering/unregistering eventfds, this is done by
> creating a new RCU data structure (io_ev_fd) as part of io_ring_ctx that
> holds the eventfd_ctx, with reads to the structure protected by
> rcu_read_lock and writes (register/unregister calls) protected by a mutex.
> 
> With the above approach ring quiesce can be avoided which is much more
> expensive then using RCU lock. On the system tested, io_uring_reigster with
> IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
> before with ring quiesce.
> 
> The second patch creates the RCU protected data structure and removes ring
> quiesce for IORING_REGISTER_EVENTFD and IORING_UNREGISTER_EVENTFD.
> 
> The third patch builds on top of the second patch and removes ring quiesce
> for IORING_REGISTER_EVENTFD_ASYNC.
> 
> The fourth patch completely removes ring quiesce from io_uring_register,
> as IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS dont need
> them.

Let me leave it just for history: I strongly dislike it considering
there is no one who uses or going to use it. Even more, I can't find a
single user of io_uring_unregister_eventfd() in liburing tests, so most
probably the paths are not tested at all.



> ---
> v4->v5:
> - Remove ring quiesce completely from io_uring_register (Pavel Begunkov)
> - Replaced rcu_barrier with unregistering flag (Jens Axboe)
> - Created a faster check for ctx->io_ev_fd in io_eventfd_signal and cleaned up
> io_eventfd_unregister (Jens Axboe)
> 
> v3->v4:
> - Switch back to call_rcu and use rcu_barrier incase io_eventfd_register fails
> to make sure all rcu callbacks have finished.
> 
> v2->v3:
> - Switched to using synchronize_rcu from call_rcu in io_eventfd_unregister.
> 
> v1->v2:
> - Added patch to remove eventfd from tracepoint (Patch 1) (Jens Axboe)
> - Made the code of io_should_trigger_evfd as part of io_eventfd_signal (Jens Axboe)
> 
> Usama Arif (4):
>    io_uring: remove trace for eventfd
>    io_uring: avoid ring quiesce while registering/unregistering eventfd
>    io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC
>    io_uring: remove ring quiesce for io_uring_register
> 
>   fs/io_uring.c                   | 202 ++++++++++++++++----------------
>   include/trace/events/io_uring.h |  13 +-
>   2 files changed, 107 insertions(+), 108 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/4] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 23:46   ` Pavel Begunkov
  2022-02-03 23:54     ` Pavel Begunkov
@ 2022-02-04  0:12     ` Jens Axboe
  1 sibling, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2022-02-04  0:12 UTC (permalink / raw)
  To: Pavel Begunkov, Usama Arif, io-uring, linux-kernel; +Cc: fam.zheng

On 2/3/22 4:46 PM, Pavel Begunkov wrote:
> On 2/3/22 23:34, Usama Arif wrote:
>> This is done by creating a new RCU data structure (io_ev_fd) as part of
>> io_ring_ctx that holds the eventfd_ctx.
>>
>> The function io_eventfd_signal is executed under rcu_read_lock with a
>> single rcu_dereference to io_ev_fd so that if another thread unregisters
>> the eventfd while io_eventfd_signal is still being executed, the
>> eventfd_signal for which io_eventfd_signal was called completes
>> successfully.
>>
>> The process of registering/unregistering eventfd is done under a lock
>> so multiple threads don't enter a race condition while
>> registering/unregistering eventfd.
>>
>> With the above approach ring quiesce can be avoided which is much more
>> expensive then using RCU lock. On the system tested, io_uring_reigster with
>> IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
>> before with ring quiesce.
>>
>> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
>> ---
>>   fs/io_uring.c | 116 ++++++++++++++++++++++++++++++++++++++++----------
>>   1 file changed, 93 insertions(+), 23 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 21531609a9c6..51602bddb9a8 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -326,6 +326,13 @@ struct io_submit_state {
>>   	struct blk_plug		plug;
>>   };
>>   
>> +struct io_ev_fd {
>> +	struct eventfd_ctx	*cq_ev_fd;
>> +	struct io_ring_ctx	*ctx;
>> +	struct rcu_head		rcu;
>> +	bool 			unregistering;
>> +};
>> +
>>   struct io_ring_ctx {
>>   	/* const or read-mostly hot data */
>>   	struct {
>> @@ -399,7 +406,8 @@ struct io_ring_ctx {
>>   	struct {
>>   		unsigned		cached_cq_tail;
>>   		unsigned		cq_entries;
>> -		struct eventfd_ctx	*cq_ev_fd;
>> +		struct io_ev_fd	__rcu	*io_ev_fd;
>> +		struct mutex		ev_fd_lock;
>>   		struct wait_queue_head	cq_wait;
>>   		unsigned		cq_extra;
>>   		atomic_t		cq_timeouts;
>> @@ -1448,6 +1456,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>>   	xa_init_flags(&ctx->io_buffers, XA_FLAGS_ALLOC1);
>>   	xa_init_flags(&ctx->personalities, XA_FLAGS_ALLOC1);
>>   	mutex_init(&ctx->uring_lock);
>> +	mutex_init(&ctx->ev_fd_lock);
>>   	init_waitqueue_head(&ctx->cq_wait);
>>   	spin_lock_init(&ctx->completion_lock);
>>   	spin_lock_init(&ctx->timeout_lock);
>> @@ -1726,13 +1735,32 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
>>   	return &rings->cqes[tail & mask];
>>   }
>>   
>> -static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
>> +static void io_eventfd_signal(struct io_ring_ctx *ctx)
>>   {
>> -	if (likely(!ctx->cq_ev_fd))
>> -		return false;
>> +	struct io_ev_fd *ev_fd;
>> +
>> +	/* Return quickly if ctx->io_ev_fd doesn't exist */
>> +	if (likely(!rcu_dereference_raw(ctx->io_ev_fd)))
>> +		return;
>> +
>> +	rcu_read_lock();
>> +	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking and eventfd_signal */
>> +	ev_fd = rcu_dereference(ctx->io_ev_fd);
>> +
>> +	/*
>> +	 * Check again if ev_fd exists incase an io_eventfd_unregister call completed between
>> +	 * the NULL check of ctx->io_ev_fd at the start of the function and rcu_read_lock.
>> +	 */
>> +	if (unlikely(!ev_fd))
>> +		goto out;
>>   	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
>> -		return false;
>> -	return !ctx->eventfd_async || io_wq_current_is_worker();
>> +		goto out;
>> +
>> +	if (!ctx->eventfd_async || io_wq_current_is_worker())
>> +		eventfd_signal(ev_fd->cq_ev_fd, 1);
>> +
>> +out:
>> +	rcu_read_unlock();
>>   }
>>   
>>   /*
>> @@ -1751,8 +1779,7 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
>>   	 */
>>   	if (wq_has_sleeper(&ctx->cq_wait))
>>   		wake_up_all(&ctx->cq_wait);
>> -	if (io_should_trigger_evfd(ctx))
>> -		eventfd_signal(ctx->cq_ev_fd, 1);
>> +	io_eventfd_signal(ctx);
>>   }
>>   
>>   static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
>> @@ -1764,8 +1791,7 @@ static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
>>   		if (waitqueue_active(&ctx->cq_wait))
>>   			wake_up_all(&ctx->cq_wait);
>>   	}
>> -	if (io_should_trigger_evfd(ctx))
>> -		eventfd_signal(ctx->cq_ev_fd, 1);
>> +	io_eventfd_signal(ctx);
>>   }
>>   
>>   /* Returns true if there are no backlogged entries after the flush */
>> @@ -9353,34 +9379,76 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
>>   
>>   static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
>>   {
>> +	struct io_ev_fd *ev_fd;
>>   	__s32 __user *fds = arg;
>> -	int fd;
>> +	int fd, ret;
>>   
>> -	if (ctx->cq_ev_fd)
>> -		return -EBUSY;
>> +	mutex_lock(&ctx->ev_fd_lock);
>> +	ret = -EBUSY;
>> +	ev_fd = rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock));
>> +	if (ev_fd) {
>> +		/*
>> +		 * If ev_fd exists, there are 2 possibilities:
>> +		 * - The rcu_callback to io_eventfd_put hasn't finished while unregistering
>> +		 * (hence ev_fd->unregistering is true) and io_eventfd_register
>> +		 * can continue and overwrite ctx->io_ev_fd with the new eventfd.
>> +		 * - Or io_eventfd_register has been called on an io_uring that has
>> +		 * already registered a valid eventfd in which case return -EBUSY.
>> +		 */
>> +		if(!ev_fd->unregistering)
>> +			goto out;
>> +	}
>>   
>> +	ret = -EFAULT;
>>   	if (copy_from_user(&fd, fds, sizeof(*fds)))
>> -		return -EFAULT;
>> +		goto out;
>>   
>> -	ctx->cq_ev_fd = eventfd_ctx_fdget(fd);
>> -	if (IS_ERR(ctx->cq_ev_fd)) {
>> -		int ret = PTR_ERR(ctx->cq_ev_fd);
>> +	ret = -ENOMEM;
>> +	ev_fd = kmalloc(sizeof(*ev_fd), GFP_KERNEL);
>> +	if (!ev_fd)
>> +		goto out;
>>   
>> -		ctx->cq_ev_fd = NULL;
>> -		return ret;
>> +	ev_fd->cq_ev_fd = eventfd_ctx_fdget(fd);
>> +	if (IS_ERR(ev_fd->cq_ev_fd)) {
>> +		ret = PTR_ERR(ev_fd->cq_ev_fd);
>> +		kfree(ev_fd);
>> +		goto out;
>>   	}
>> +	ev_fd->ctx = ctx;
>> +	ev_fd->unregistering = false;
>>   
>> -	return 0;
>> +	rcu_assign_pointer(ctx->io_ev_fd, ev_fd);
>> +	ret = 0;
>> +
>> +out:
>> +	mutex_unlock(&ctx->ev_fd_lock);
>> +	return ret;
>> +}
>> +
>> +static void io_eventfd_put(struct rcu_head *rcu)
>> +{
>> +	struct io_ev_fd *ev_fd = container_of(rcu, struct io_ev_fd, rcu);
>> +	struct io_ring_ctx *ctx = ev_fd->ctx;
>> +
>> +	eventfd_ctx_put(ev_fd->cq_ev_fd);
>> +	kfree(ev_fd);
>> +	rcu_assign_pointer(ctx->io_ev_fd, NULL);
>>   }
> 
> Emm, it happens after the grace period, so you have a gap where a
> request may read a freed eventfd... What I think you wanted to do
> is more like below:
> 
> 
> io_eventfd_put() {
> 	evfd = ...;
> 	eventfd_ctx_put(evfd->evfd);
> 	kfree(io_ev_fd);
> }
> 
> register() {
> 	mutex_lock();
> 	ev_fd = rcu_deref();
> 	if (ev_fd) {
> 		rcu_assign_pointer(ctx->evfd, NULL);
> 		call_rcu(&ev_fd->evfd, io_eventfd_put);
> 	}
> 	mutex_unlock();
> }
> 
> 
> Note, there's no need in ->unregistering. I also doubt you need
> ->ev_fd_lock, how about just using already taken ->uring_lock?

Agree on both, for this scheme to work it's also critical that we assign
evfd = NULL before call_rcu() is done, not in the callback. Ditto on the
lock, the register side is under uring_lock anyway, so this one doesn't
add anything.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register
  2022-02-04  0:02 ` [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register Pavel Begunkov
@ 2022-02-04  0:15   ` Jens Axboe
  2022-02-04  0:24     ` Pavel Begunkov
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2022-02-04  0:15 UTC (permalink / raw)
  To: Pavel Begunkov, Usama Arif, io-uring, linux-kernel; +Cc: fam.zheng

On 2/3/22 5:02 PM, Pavel Begunkov wrote:
> On 2/3/22 23:34, Usama Arif wrote:
>> For opcodes relating to registering/unregistering eventfds, this is done by
>> creating a new RCU data structure (io_ev_fd) as part of io_ring_ctx that
>> holds the eventfd_ctx, with reads to the structure protected by
>> rcu_read_lock and writes (register/unregister calls) protected by a mutex.
>>
>> With the above approach ring quiesce can be avoided which is much more
>> expensive then using RCU lock. On the system tested, io_uring_reigster with
>> IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
>> before with ring quiesce.
>>
>> The second patch creates the RCU protected data structure and removes ring
>> quiesce for IORING_REGISTER_EVENTFD and IORING_UNREGISTER_EVENTFD.
>>
>> The third patch builds on top of the second patch and removes ring quiesce
>> for IORING_REGISTER_EVENTFD_ASYNC.
>>
>> The fourth patch completely removes ring quiesce from io_uring_register,
>> as IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS dont need
>> them.
> 
> Let me leave it just for history: I strongly dislike it considering
> there is no one who uses or going to use it.

Are you referring to the 4th patch? Or the patchset as a whole? Not clear
to me, because eventfd registration is most certainly used by folks
today.

> Even more, I can't find a single user of io_uring_unregister_eventfd()
> in liburing tests, so most probably the paths are not tested at all.

That's definitely a general issue, not related to this patchset.
Something that most certainly should get added! Ring exit will use the
same unregister path for eventfd, however, so it does get exercised from
there with existing tests too.

But for this change, we definitely need a test that exercises both
register and unregister, trying to trigger something funky there.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register
  2022-02-04  0:15   ` Jens Axboe
@ 2022-02-04  0:24     ` Pavel Begunkov
  0 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2022-02-04  0:24 UTC (permalink / raw)
  To: Jens Axboe, Usama Arif, io-uring, linux-kernel; +Cc: fam.zheng

On 2/4/22 00:15, Jens Axboe wrote:
> On 2/3/22 5:02 PM, Pavel Begunkov wrote:
>> On 2/3/22 23:34, Usama Arif wrote:
>>> For opcodes relating to registering/unregistering eventfds, this is done by
>>> creating a new RCU data structure (io_ev_fd) as part of io_ring_ctx that
>>> holds the eventfd_ctx, with reads to the structure protected by
>>> rcu_read_lock and writes (register/unregister calls) protected by a mutex.
>>>
>>> With the above approach ring quiesce can be avoided which is much more
>>> expensive then using RCU lock. On the system tested, io_uring_reigster with
>>> IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
>>> before with ring quiesce.
>>>
>>> The second patch creates the RCU protected data structure and removes ring
>>> quiesce for IORING_REGISTER_EVENTFD and IORING_UNREGISTER_EVENTFD.
>>>
>>> The third patch builds on top of the second patch and removes ring quiesce
>>> for IORING_REGISTER_EVENTFD_ASYNC.
>>>
>>> The fourth patch completely removes ring quiesce from io_uring_register,
>>> as IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS dont need
>>> them.
>>
>> Let me leave it just for history: I strongly dislike it considering
>> there is no one who uses or going to use it.
> 
> Are you referring to the 4th patch? Or the patchset as a whole? Not clear
> to me, because eventfd registration is most certainly used by folks
> today.

I refer to optimising eventfd unregister with no users of it, which
lead to the RCU approach.

1/4 is good, taking ENABLE_RINGS and RESTRICTIONS out of quiesce is
also great. 4/4 per se is not a problem, even if I'd need to revert
it later.

>> Even more, I can't find a single user of io_uring_unregister_eventfd()
>> in liburing tests, so most probably the paths are not tested at all.
> 
> That's definitely a general issue, not related to this patchset.
> Something that most certainly should get added! Ring exit will use the
> same unregister path for eventfd, however, so it does get exercised from
> there with existing tests too.

io_ring_ctx_free()
  -> io_eventfd_unregister()

It's called after full quiesce in io_ring_exit_work() + even more
extra sync, so not completely

> 
> But for this change, we definitely need a test that exercises both
> register and unregister, trying to trigger something funky there.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 4/4] io_uring: remove ring quiesce for io_uring_register
  2022-02-03 23:47   ` Pavel Begunkov
@ 2022-02-04  0:28     ` Pavel Begunkov
  0 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2022-02-04  0:28 UTC (permalink / raw)
  To: Usama Arif, io-uring, axboe, linux-kernel; +Cc: fam.zheng

On 2/3/22 23:47, Pavel Begunkov wrote:
> On 2/3/22 23:34, Usama Arif wrote:
>> Ring quiesce is currently only used for 2 opcodes
>> IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS.
>> IORING_SETUP_R_DISABLED prevents submitting requests and
>> so there will be no requests until IORING_REGISTER_ENABLE_RINGS
>> is called. And IORING_REGISTER_RESTRICTIONS works only before
>> IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is
>> not needed for these opcodes and therefore io_uring_register.
> 
> I think I'd prefer to retain quiesce code than reverting this
> patch later.

btw, if it gets to reverting it'll be easier if this patch
is split in 2. The first puts these 2 opcodes into
io_register_op_must_quiesce(), we definitely want to keep
that. And the other doing the rest of cleanup

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-02-04  0:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-03 23:34 [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register Usama Arif
2022-02-03 23:34 ` [PATCH v5 1/4] io_uring: remove trace for eventfd Usama Arif
2022-02-03 23:34 ` [PATCH v5 2/4] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
2022-02-03 23:46   ` Pavel Begunkov
2022-02-03 23:54     ` Pavel Begunkov
2022-02-04  0:12     ` Jens Axboe
2022-02-03 23:34 ` [PATCH v5 3/4] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC Usama Arif
2022-02-03 23:34 ` [PATCH v5 4/4] io_uring: remove ring quiesce for io_uring_register Usama Arif
2022-02-03 23:47   ` Pavel Begunkov
2022-02-04  0:28     ` Pavel Begunkov
2022-02-04  0:02 ` [PATCH v5 0/4] io_uring: remove ring quiesce in io_uring_register Pavel Begunkov
2022-02-04  0:15   ` Jens Axboe
2022-02-04  0:24     ` Pavel Begunkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.