All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] io_uring: avoid ring quiesce in io_uring_register for eventfd opcodes
@ 2022-02-03 17:41 Usama Arif
  2022-02-03 17:41 ` [PATCH v3 1/3] io_uring: remove trace for eventfd Usama Arif
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Usama Arif @ 2022-02-03 17:41 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

This is done by creating a new RCU data structure (io_ev_fd) as part of
io_ring_ctx that holds the eventfd_ctx, with reads to the structure protected
by rcu_read_lock and writes (register/unregister calls) protected by a mutex.

With the above approach ring quiesce can be avoided which is much more
expensive then using RCU lock. On the system tested, io_uring_reigster with
IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
before with ring quiesce.

The first patch creates the RCU protected data structure and removes ring
quiesce for IORING_REGISTER_EVENTFD and IORING_UNREGISTER_EVENTFD.

The second patch builds on top of the first patch and removes ring quiesce
for IORING_REGISTER_EVENTFD_ASYNC.

---
v2->v3:
- Switched to using synchronize_rcu from call_rcu in io_eventfd_unregister

v1->v2:
- Added patch to remove eventfd from tracepoint (Patch 1) (Jens Axboe)
- Made the code of io_should_trigger_evfd as part of io_eventfd_signal (Jens Axboe)

Usama Arif (3):
  io_uring: remove trace for eventfd
  io_uring: avoid ring quiesce while registering/unregistering eventfd
  io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC

 fs/io_uring.c                   | 114 ++++++++++++++++++++++----------
 include/trace/events/io_uring.h |  13 ++--
 2 files changed, 83 insertions(+), 44 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/3] io_uring: remove trace for eventfd
  2022-02-03 17:41 [PATCH v3 0/3] io_uring: avoid ring quiesce in io_uring_register for eventfd opcodes Usama Arif
@ 2022-02-03 17:41 ` Usama Arif
  2022-02-03 17:41 ` [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
  2022-02-03 17:41 ` [PATCH v3 3/3] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC Usama Arif
  2 siblings, 0 replies; 16+ messages in thread
From: Usama Arif @ 2022-02-03 17:41 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

The information on whether eventfd is registered is not very useful
and would result in the tracepoint being enclosed in an rcu_readlock
in a later patch that tries to avoid ring quiesce for registering
eventfd.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 fs/io_uring.c                   |  3 +--
 include/trace/events/io_uring.h | 13 +++++--------
 2 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2e04f718319d..21531609a9c6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -11171,8 +11171,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
 	mutex_lock(&ctx->uring_lock);
 	ret = __io_uring_register(ctx, opcode, arg, nr_args);
 	mutex_unlock(&ctx->uring_lock);
-	trace_io_uring_register(ctx, opcode, ctx->nr_user_files, ctx->nr_user_bufs,
-							ctx->cq_ev_fd != NULL, ret);
+	trace_io_uring_register(ctx, opcode, ctx->nr_user_files, ctx->nr_user_bufs, ret);
 out_fput:
 	fdput(f);
 	return ret;
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 7346f0164cf4..098beda7601a 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -57,10 +57,9 @@ TRACE_EVENT(io_uring_create,
  * @opcode:		describes which operation to perform
  * @nr_user_files:	number of registered files
  * @nr_user_bufs:	number of registered buffers
- * @cq_ev_fd:		whether eventfs registered or not
  * @ret:		return code
  *
- * Allows to trace fixed files/buffers/eventfds, that could be registered to
+ * Allows to trace fixed files/buffers, that could be registered to
  * avoid an overhead of getting references to them for every operation. This
  * event, together with io_uring_file_get, can provide a full picture of how
  * much overhead one can reduce via fixing.
@@ -68,16 +67,15 @@ TRACE_EVENT(io_uring_create,
 TRACE_EVENT(io_uring_register,
 
 	TP_PROTO(void *ctx, unsigned opcode, unsigned nr_files,
-			 unsigned nr_bufs, bool eventfd, long ret),
+			 unsigned nr_bufs, long ret),
 
-	TP_ARGS(ctx, opcode, nr_files, nr_bufs, eventfd, ret),
+	TP_ARGS(ctx, opcode, nr_files, nr_bufs, ret),
 
 	TP_STRUCT__entry (
 		__field(  void *,	ctx			)
 		__field(  unsigned,	opcode		)
 		__field(  unsigned,	nr_files	)
 		__field(  unsigned,	nr_bufs		)
-		__field(  bool,		eventfd		)
 		__field(  long,		ret			)
 	),
 
@@ -86,14 +84,13 @@ TRACE_EVENT(io_uring_register,
 		__entry->opcode		= opcode;
 		__entry->nr_files	= nr_files;
 		__entry->nr_bufs	= nr_bufs;
-		__entry->eventfd	= eventfd;
 		__entry->ret		= ret;
 	),
 
 	TP_printk("ring %p, opcode %d, nr_user_files %d, nr_user_bufs %d, "
-			  "eventfd %d, ret %ld",
+			  "ret %ld",
 			  __entry->ctx, __entry->opcode, __entry->nr_files,
-			  __entry->nr_bufs, __entry->eventfd, __entry->ret)
+			  __entry->nr_bufs, __entry->ret)
 );
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 17:41 [PATCH v3 0/3] io_uring: avoid ring quiesce in io_uring_register for eventfd opcodes Usama Arif
  2022-02-03 17:41 ` [PATCH v3 1/3] io_uring: remove trace for eventfd Usama Arif
@ 2022-02-03 17:41 ` Usama Arif
  2022-02-03 17:56   ` Jens Axboe
  2022-02-03 17:41 ` [PATCH v3 3/3] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC Usama Arif
  2 siblings, 1 reply; 16+ messages in thread
From: Usama Arif @ 2022-02-03 17:41 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

This is done by creating a new RCU data structure (io_ev_fd) as part of
io_ring_ctx that holds the eventfd_ctx.

The function io_eventfd_signal is executed under rcu_read_lock with a
single rcu_dereference to io_ev_fd so that if another thread unregisters
the eventfd while io_eventfd_signal is still being executed, the
eventfd_signal for which io_eventfd_signal was called completes
successfully.

The process of registering/unregistering eventfd is done under a lock
so multiple threads don't enter a race condition while
registering/unregistering eventfd.

With the above approach ring quiesce can be avoided which is much more
expensive then using RCU lock. On the system tested, io_uring_reigster with
IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms
before with ring quiesce.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 fs/io_uring.c | 91 +++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 66 insertions(+), 25 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 21531609a9c6..47d48020ae27 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -326,6 +326,10 @@ struct io_submit_state {
 	struct blk_plug		plug;
 };
 
+struct io_ev_fd {
+	struct eventfd_ctx	*cq_ev_fd;
+};
+
 struct io_ring_ctx {
 	/* const or read-mostly hot data */
 	struct {
@@ -399,7 +403,8 @@ struct io_ring_ctx {
 	struct {
 		unsigned		cached_cq_tail;
 		unsigned		cq_entries;
-		struct eventfd_ctx	*cq_ev_fd;
+		struct io_ev_fd	__rcu	*io_ev_fd;
+		struct mutex		ev_fd_lock;
 		struct wait_queue_head	cq_wait;
 		unsigned		cq_extra;
 		atomic_t		cq_timeouts;
@@ -1448,6 +1453,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	xa_init_flags(&ctx->io_buffers, XA_FLAGS_ALLOC1);
 	xa_init_flags(&ctx->personalities, XA_FLAGS_ALLOC1);
 	mutex_init(&ctx->uring_lock);
+	mutex_init(&ctx->ev_fd_lock);
 	init_waitqueue_head(&ctx->cq_wait);
 	spin_lock_init(&ctx->completion_lock);
 	spin_lock_init(&ctx->timeout_lock);
@@ -1726,13 +1732,24 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
 	return &rings->cqes[tail & mask];
 }
 
-static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
+static void io_eventfd_signal(struct io_ring_ctx *ctx)
 {
-	if (likely(!ctx->cq_ev_fd))
-		return false;
+	struct io_ev_fd *ev_fd;
+
+	rcu_read_lock();
+	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking and eventfd_signal */
+	ev_fd = rcu_dereference(ctx->io_ev_fd);
+
+	if (likely(!ev_fd))
+		goto out;
 	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
-		return false;
-	return !ctx->eventfd_async || io_wq_current_is_worker();
+		goto out;
+
+	if (!ctx->eventfd_async || io_wq_current_is_worker())
+		eventfd_signal(ev_fd->cq_ev_fd, 1);
+
+out:
+	rcu_read_unlock();
 }
 
 /*
@@ -1751,8 +1768,7 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 	 */
 	if (wq_has_sleeper(&ctx->cq_wait))
 		wake_up_all(&ctx->cq_wait);
-	if (io_should_trigger_evfd(ctx))
-		eventfd_signal(ctx->cq_ev_fd, 1);
+	io_eventfd_signal(ctx);
 }
 
 static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
@@ -1764,8 +1780,7 @@ static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
 		if (waitqueue_active(&ctx->cq_wait))
 			wake_up_all(&ctx->cq_wait);
 	}
-	if (io_should_trigger_evfd(ctx))
-		eventfd_signal(ctx->cq_ev_fd, 1);
+	io_eventfd_signal(ctx);
 }
 
 /* Returns true if there are no backlogged entries after the flush */
@@ -9353,35 +9368,59 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 
 static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
 {
+	struct io_ev_fd *ev_fd;
 	__s32 __user *fds = arg;
-	int fd;
+	int fd, ret;
 
-	if (ctx->cq_ev_fd)
-		return -EBUSY;
+	mutex_lock(&ctx->ev_fd_lock);
+	ret = -EBUSY;
+	if (rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock)))
+		goto out;
 
+	ret = -EFAULT;
 	if (copy_from_user(&fd, fds, sizeof(*fds)))
-		return -EFAULT;
+		goto out;
 
-	ctx->cq_ev_fd = eventfd_ctx_fdget(fd);
-	if (IS_ERR(ctx->cq_ev_fd)) {
-		int ret = PTR_ERR(ctx->cq_ev_fd);
+	ret = -ENOMEM;
+	ev_fd = kmalloc(sizeof(*ev_fd), GFP_KERNEL);
+	if (!ev_fd)
+		goto out;
 
-		ctx->cq_ev_fd = NULL;
-		return ret;
+	ev_fd->cq_ev_fd = eventfd_ctx_fdget(fd);
+	if (IS_ERR(ev_fd->cq_ev_fd)) {
+		ret = PTR_ERR(ev_fd->cq_ev_fd);
+		kfree(ev_fd);
+		goto out;
 	}
 
-	return 0;
+	rcu_assign_pointer(ctx->io_ev_fd, ev_fd);
+	ret = 0;
+
+out:
+	mutex_unlock(&ctx->ev_fd_lock);
+	return ret;
 }
 
 static int io_eventfd_unregister(struct io_ring_ctx *ctx)
 {
-	if (ctx->cq_ev_fd) {
-		eventfd_ctx_put(ctx->cq_ev_fd);
-		ctx->cq_ev_fd = NULL;
-		return 0;
+	struct io_ev_fd *ev_fd;
+	int ret;
+
+	mutex_lock(&ctx->ev_fd_lock);
+	ev_fd = rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock));
+	if (!ev_fd) {
+		ret = -ENXIO;
+		goto out;
 	}
+	synchronize_rcu();
+	eventfd_ctx_put(ev_fd->cq_ev_fd);
+	kfree(ev_fd);
+	rcu_assign_pointer(ctx->io_ev_fd, NULL);
+	ret = 0;
 
-	return -ENXIO;
+out:
+	mutex_unlock(&ctx->ev_fd_lock);
+	return ret;
 }
 
 static void io_destroy_buffers(struct io_ring_ctx *ctx)
@@ -10960,6 +10999,8 @@ static bool io_register_op_must_quiesce(int op)
 	case IORING_REGISTER_FILES:
 	case IORING_UNREGISTER_FILES:
 	case IORING_REGISTER_FILES_UPDATE:
+	case IORING_REGISTER_EVENTFD:
+	case IORING_UNREGISTER_EVENTFD:
 	case IORING_REGISTER_PROBE:
 	case IORING_REGISTER_PERSONALITY:
 	case IORING_UNREGISTER_PERSONALITY:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v3 3/3] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC
  2022-02-03 17:41 [PATCH v3 0/3] io_uring: avoid ring quiesce in io_uring_register for eventfd opcodes Usama Arif
  2022-02-03 17:41 ` [PATCH v3 1/3] io_uring: remove trace for eventfd Usama Arif
  2022-02-03 17:41 ` [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
@ 2022-02-03 17:41 ` Usama Arif
  2 siblings, 0 replies; 16+ messages in thread
From: Usama Arif @ 2022-02-03 17:41 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence, linux-kernel; +Cc: fam.zheng, Usama Arif

This is done using the RCU data structure (io_ev_fd). eventfd_async
is moved from io_ring_ctx to io_ev_fd which is RCU protected hence
avoiding ring quiesce which is much more expensive than an RCU lock.

Signed-off-by: Usama Arif <usama.arif@bytedance.com>
---
 fs/io_uring.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 47d48020ae27..25ed86533910 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -328,6 +328,7 @@ struct io_submit_state {
 
 struct io_ev_fd {
 	struct eventfd_ctx	*cq_ev_fd;
+	unsigned int		eventfd_async: 1;
 };
 
 struct io_ring_ctx {
@@ -339,7 +340,6 @@ struct io_ring_ctx {
 		unsigned int		flags;
 		unsigned int		compat: 1;
 		unsigned int		drain_next: 1;
-		unsigned int		eventfd_async: 1;
 		unsigned int		restricted: 1;
 		unsigned int		off_timeout_used: 1;
 		unsigned int		drain_active: 1;
@@ -1745,7 +1745,7 @@ static void io_eventfd_signal(struct io_ring_ctx *ctx)
 	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
 		goto out;
 
-	if (!ctx->eventfd_async || io_wq_current_is_worker())
+	if (!ev_fd->eventfd_async || io_wq_current_is_worker())
 		eventfd_signal(ev_fd->cq_ev_fd, 1);
 
 out:
@@ -9366,7 +9366,8 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 	return done ? done : err;
 }
 
-static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
+static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg,
+			       unsigned int eventfd_async)
 {
 	struct io_ev_fd *ev_fd;
 	__s32 __user *fds = arg;
@@ -9392,6 +9393,7 @@ static int io_eventfd_register(struct io_ring_ctx *ctx, void __user *arg)
 		kfree(ev_fd);
 		goto out;
 	}
+	ev_fd->eventfd_async = eventfd_async;
 
 	rcu_assign_pointer(ctx->io_ev_fd, ev_fd);
 	ret = 0;
@@ -11000,6 +11002,7 @@ static bool io_register_op_must_quiesce(int op)
 	case IORING_UNREGISTER_FILES:
 	case IORING_REGISTER_FILES_UPDATE:
 	case IORING_REGISTER_EVENTFD:
+	case IORING_REGISTER_EVENTFD_ASYNC:
 	case IORING_UNREGISTER_EVENTFD:
 	case IORING_REGISTER_PROBE:
 	case IORING_REGISTER_PERSONALITY:
@@ -11100,17 +11103,16 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 		ret = io_register_files_update(ctx, arg, nr_args);
 		break;
 	case IORING_REGISTER_EVENTFD:
-	case IORING_REGISTER_EVENTFD_ASYNC:
 		ret = -EINVAL;
 		if (nr_args != 1)
 			break;
-		ret = io_eventfd_register(ctx, arg);
-		if (ret)
+		ret = io_eventfd_register(ctx, arg, 0);
+		break;
+	case IORING_REGISTER_EVENTFD_ASYNC:
+		ret = -EINVAL;
+		if (nr_args != 1)
 			break;
-		if (opcode == IORING_REGISTER_EVENTFD_ASYNC)
-			ctx->eventfd_async = 1;
-		else
-			ctx->eventfd_async = 0;
+		ret = io_eventfd_register(ctx, arg, 1);
 		break;
 	case IORING_UNREGISTER_EVENTFD:
 		ret = -EINVAL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 17:41 ` [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
@ 2022-02-03 17:56   ` Jens Axboe
  2022-02-03 18:26     ` [External] " Usama Arif
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2022-02-03 17:56 UTC (permalink / raw)
  To: Usama Arif, io-uring, asml.silence, linux-kernel; +Cc: fam.zheng

On 2/3/22 10:41 AM, Usama Arif wrote:
> @@ -1726,13 +1732,24 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
>  	return &rings->cqes[tail & mask];
>  }
>  
> -static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
> +static void io_eventfd_signal(struct io_ring_ctx *ctx)
>  {
> -	if (likely(!ctx->cq_ev_fd))
> -		return false;
> +	struct io_ev_fd *ev_fd;
> +
> +	rcu_read_lock();
> +	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking and eventfd_signal */
> +	ev_fd = rcu_dereference(ctx->io_ev_fd);
> +
> +	if (likely(!ev_fd))
> +		goto out;
>  	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
> -		return false;
> -	return !ctx->eventfd_async || io_wq_current_is_worker();
> +		goto out;
> +
> +	if (!ctx->eventfd_async || io_wq_current_is_worker())
> +		eventfd_signal(ev_fd->cq_ev_fd, 1);
> +
> +out:
> +	rcu_read_unlock();
>  }

Like Pavel pointed out, we still need the fast path (of not having an
event fd registered at all) to just do the cheap check and not need rcu
lock/unlock. Outside of that, I think this looks fine.

>  static int io_eventfd_unregister(struct io_ring_ctx *ctx)
>  {
> -	if (ctx->cq_ev_fd) {
> -		eventfd_ctx_put(ctx->cq_ev_fd);
> -		ctx->cq_ev_fd = NULL;
> -		return 0;
> +	struct io_ev_fd *ev_fd;
> +	int ret;
> +
> +	mutex_lock(&ctx->ev_fd_lock);
> +	ev_fd = rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock));
> +	if (!ev_fd) {
> +		ret = -ENXIO;
> +		goto out;
>  	}
> +	synchronize_rcu();
> +	eventfd_ctx_put(ev_fd->cq_ev_fd);
> +	kfree(ev_fd);
> +	rcu_assign_pointer(ctx->io_ev_fd, NULL);
> +	ret = 0;
>  
> -	return -ENXIO;
> +out:
> +	mutex_unlock(&ctx->ev_fd_lock);
> +	return ret;
>  }

synchronize_rcu() can take a long time, and I think this is in the wrong
spot. It should be on the register side, IFF we need to expedite the
completion of a previous event fd unregistration. If we do it that way,
at least it'll only happen if it's necessary. What do you think?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 17:56   ` Jens Axboe
@ 2022-02-03 18:26     ` Usama Arif
  2022-02-03 18:29       ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Usama Arif @ 2022-02-03 18:26 UTC (permalink / raw)
  To: Jens Axboe, io-uring, asml.silence, linux-kernel; +Cc: fam.zheng



On 03/02/2022 17:56, Jens Axboe wrote:
> On 2/3/22 10:41 AM, Usama Arif wrote:
>> @@ -1726,13 +1732,24 @@ static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
>>   	return &rings->cqes[tail & mask];
>>   }
>>   
>> -static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
>> +static void io_eventfd_signal(struct io_ring_ctx *ctx)
>>   {
>> -	if (likely(!ctx->cq_ev_fd))
>> -		return false;
>> +	struct io_ev_fd *ev_fd;
>> +
>> +	rcu_read_lock();
>> +	/* rcu_dereference ctx->io_ev_fd once and use it for both for checking and eventfd_signal */
>> +	ev_fd = rcu_dereference(ctx->io_ev_fd);
>> +
>> +	if (likely(!ev_fd))
>> +		goto out;
>>   	if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
>> -		return false;
>> -	return !ctx->eventfd_async || io_wq_current_is_worker();
>> +		goto out;
>> +
>> +	if (!ctx->eventfd_async || io_wq_current_is_worker())
>> +		eventfd_signal(ev_fd->cq_ev_fd, 1);
>> +
>> +out:
>> +	rcu_read_unlock();
>>   }
> 
> Like Pavel pointed out, we still need the fast path (of not having an
> event fd registered at all) to just do the cheap check and not need rcu
> lock/unlock. Outside of that, I think this looks fine.
> 

Hmm, maybe i didn't understand you and Pavel correctly. Are you 
suggesting to do the below diff over patch 3? I dont think that would be 
correct, as it is possible that just after checking if ctx->io_ev_fd is 
present unregister can be called by another thread and set ctx->io_ev_fd 
to NULL that would cause a NULL pointer exception later? In the current 
patch, the check of whether ev_fd exists happens as the first thing 
after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 25ed86533910..0cf282fba14d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1736,12 +1736,13 @@ static void io_eventfd_signal(struct io_ring_ctx 
*ctx)
  {
         struct io_ev_fd *ev_fd;

+       if (likely(!ctx->io_ev_fd))
+               return;
+
         rcu_read_lock();
         /* rcu_dereference ctx->io_ev_fd once and use it for both for 
checking and eventfd_signal */
         ev_fd = rcu_dereference(ctx->io_ev_fd);

-       if (likely(!ev_fd))
-               goto out;
         if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
                 goto out;




>>   static int io_eventfd_unregister(struct io_ring_ctx *ctx)
>>   {
>> -	if (ctx->cq_ev_fd) {
>> -		eventfd_ctx_put(ctx->cq_ev_fd);
>> -		ctx->cq_ev_fd = NULL;
>> -		return 0;
>> +	struct io_ev_fd *ev_fd;
>> +	int ret;
>> +
>> +	mutex_lock(&ctx->ev_fd_lock);
>> +	ev_fd = rcu_dereference_protected(ctx->io_ev_fd, lockdep_is_held(&ctx->ev_fd_lock));
>> +	if (!ev_fd) {
>> +		ret = -ENXIO;
>> +		goto out;
>>   	}
>> +	synchronize_rcu();
>> +	eventfd_ctx_put(ev_fd->cq_ev_fd);
>> +	kfree(ev_fd);
>> +	rcu_assign_pointer(ctx->io_ev_fd, NULL);
>> +	ret = 0;
>>   
>> -	return -ENXIO;
>> +out:
>> +	mutex_unlock(&ctx->ev_fd_lock);
>> +	return ret;
>>   }
> 
> synchronize_rcu() can take a long time, and I think this is in the wrong
> spot. It should be on the register side, IFF we need to expedite the
> completion of a previous event fd unregistration. If we do it that way,
> at least it'll only happen if it's necessary. What do you think?
> 


How about the approach in v4? so switching back to call_rcu as in v2 and 
if ctx->io_ev_fd is NULL then we call rcu_barrier to make sure all rcu 
callbacks are finished and check for NULL again.

Thanks!
Usama

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 18:26     ` [External] " Usama Arif
@ 2022-02-03 18:29       ` Jens Axboe
  2022-02-03 19:00         ` Pavel Begunkov
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2022-02-03 18:29 UTC (permalink / raw)
  To: Usama Arif, io-uring, asml.silence, linux-kernel; +Cc: fam.zheng

On 2/3/22 11:26 AM, Usama Arif wrote:
> Hmm, maybe i didn't understand you and Pavel correctly. Are you 
> suggesting to do the below diff over patch 3? I dont think that would be 
> correct, as it is possible that just after checking if ctx->io_ev_fd is 
> present unregister can be called by another thread and set ctx->io_ev_fd 
> to NULL that would cause a NULL pointer exception later? In the current 
> patch, the check of whether ev_fd exists happens as the first thing 
> after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.

They are cheap, but they are still noticeable at high requests/sec
rates. So would be best to avoid them.

And yes it's obviously racy, there's the potential to miss an eventfd
notification if it races with registering an eventfd descriptor. But
that's not really a concern, as if you register with inflight IO
pending, then that always exists just depending on timing. The only
thing I care about here is that it's always _safe_. Hence something ala
what you did below is totally fine, as we're re-evaluating under rcu
protection.

> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 25ed86533910..0cf282fba14d 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1736,12 +1736,13 @@ static void io_eventfd_signal(struct io_ring_ctx 
> *ctx)
>   {
>          struct io_ev_fd *ev_fd;
> 
> +       if (likely(!ctx->io_ev_fd))
> +               return;
> +
>          rcu_read_lock();
>          /* rcu_dereference ctx->io_ev_fd once and use it for both for 
> checking and eventfd_signal */
>          ev_fd = rcu_dereference(ctx->io_ev_fd);
> 
> -       if (likely(!ev_fd))
> -               goto out;
>          if (READ_ONCE(ctx->rings->cq_flags) & IORING_CQ_EVENTFD_DISABLED)
>                  goto out;
> 
> 
>> synchronize_rcu() can take a long time, and I think this is in the wrong
>> spot. It should be on the register side, IFF we need to expedite the
>> completion of a previous event fd unregistration. If we do it that way,
>> at least it'll only happen if it's necessary. What do you think?
>>
> 
> 
> How about the approach in v4? so switching back to call_rcu as in v2 and 
> if ctx->io_ev_fd is NULL then we call rcu_barrier to make sure all rcu 
> callbacks are finished and check for NULL again.

I'll check, haven't looked at v4 yet!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 18:29       ` Jens Axboe
@ 2022-02-03 19:00         ` Pavel Begunkov
  2022-02-03 19:06           ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2022-02-03 19:00 UTC (permalink / raw)
  To: Jens Axboe, Usama Arif, io-uring, linux-kernel; +Cc: fam.zheng

On 2/3/22 18:29, Jens Axboe wrote:
> On 2/3/22 11:26 AM, Usama Arif wrote:
>> Hmm, maybe i didn't understand you and Pavel correctly. Are you
>> suggesting to do the below diff over patch 3? I dont think that would be
>> correct, as it is possible that just after checking if ctx->io_ev_fd is
>> present unregister can be called by another thread and set ctx->io_ev_fd
>> to NULL that would cause a NULL pointer exception later? In the current
>> patch, the check of whether ev_fd exists happens as the first thing
>> after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.
> 
> They are cheap, but they are still noticeable at high requests/sec
> rates. So would be best to avoid them.
> 
> And yes it's obviously racy, there's the potential to miss an eventfd
> notification if it races with registering an eventfd descriptor. But
> that's not really a concern, as if you register with inflight IO
> pending, then that always exists just depending on timing. The only
> thing I care about here is that it's always _safe_. Hence something ala
> what you did below is totally fine, as we're re-evaluating under rcu
> protection.

Indeed, the patch doesn't have any formal guarantees for propagation
to already inflight requests, so this extra unsynchronised check
doesn't change anything.

I'm still more сurious why we need RCU and extra complexity when
apparently there is no use case for that. If it's only about
initial initialisation, then as I described there is a much
simpler approach.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 19:00         ` Pavel Begunkov
@ 2022-02-03 19:06           ` Jens Axboe
  2022-02-03 19:43             ` Pavel Begunkov
  2022-02-03 19:54             ` Usama Arif
  0 siblings, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2022-02-03 19:06 UTC (permalink / raw)
  To: Pavel Begunkov, Usama Arif, io-uring, linux-kernel; +Cc: fam.zheng

On 2/3/22 12:00 PM, Pavel Begunkov wrote:
> On 2/3/22 18:29, Jens Axboe wrote:
>> On 2/3/22 11:26 AM, Usama Arif wrote:
>>> Hmm, maybe i didn't understand you and Pavel correctly. Are you
>>> suggesting to do the below diff over patch 3? I dont think that would be
>>> correct, as it is possible that just after checking if ctx->io_ev_fd is
>>> present unregister can be called by another thread and set ctx->io_ev_fd
>>> to NULL that would cause a NULL pointer exception later? In the current
>>> patch, the check of whether ev_fd exists happens as the first thing
>>> after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.
>>
>> They are cheap, but they are still noticeable at high requests/sec
>> rates. So would be best to avoid them.
>>
>> And yes it's obviously racy, there's the potential to miss an eventfd
>> notification if it races with registering an eventfd descriptor. But
>> that's not really a concern, as if you register with inflight IO
>> pending, then that always exists just depending on timing. The only
>> thing I care about here is that it's always _safe_. Hence something ala
>> what you did below is totally fine, as we're re-evaluating under rcu
>> protection.
> 
> Indeed, the patch doesn't have any formal guarantees for propagation
> to already inflight requests, so this extra unsynchronised check
> doesn't change anything.
> 
> I'm still more сurious why we need RCU and extra complexity when
> apparently there is no use case for that. If it's only about
> initial initialisation, then as I described there is a much
> simpler approach.

Would be nice if we could get rid of the quiesce code in general, but I
haven't done a check to see what'd be missing after this...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 19:06           ` Jens Axboe
@ 2022-02-03 19:43             ` Pavel Begunkov
  2022-02-03 22:18               ` Jens Axboe
  2022-02-03 19:54             ` Usama Arif
  1 sibling, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2022-02-03 19:43 UTC (permalink / raw)
  To: Jens Axboe, Usama Arif, io-uring, linux-kernel; +Cc: fam.zheng

On 2/3/22 19:06, Jens Axboe wrote:
> On 2/3/22 12:00 PM, Pavel Begunkov wrote:
>> On 2/3/22 18:29, Jens Axboe wrote:
>>> On 2/3/22 11:26 AM, Usama Arif wrote:
>>>> Hmm, maybe i didn't understand you and Pavel correctly. Are you
>>>> suggesting to do the below diff over patch 3? I dont think that would be
>>>> correct, as it is possible that just after checking if ctx->io_ev_fd is
>>>> present unregister can be called by another thread and set ctx->io_ev_fd
>>>> to NULL that would cause a NULL pointer exception later? In the current
>>>> patch, the check of whether ev_fd exists happens as the first thing
>>>> after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.
>>>
>>> They are cheap, but they are still noticeable at high requests/sec
>>> rates. So would be best to avoid them.
>>>
>>> And yes it's obviously racy, there's the potential to miss an eventfd
>>> notification if it races with registering an eventfd descriptor. But
>>> that's not really a concern, as if you register with inflight IO
>>> pending, then that always exists just depending on timing. The only
>>> thing I care about here is that it's always _safe_. Hence something ala
>>> what you did below is totally fine, as we're re-evaluating under rcu
>>> protection.
>>
>> Indeed, the patch doesn't have any formal guarantees for propagation
>> to already inflight requests, so this extra unsynchronised check
>> doesn't change anything.
>>
>> I'm still more сurious why we need RCU and extra complexity when
>> apparently there is no use case for that. If it's only about
>> initial initialisation, then as I described there is a much
>> simpler approach.
> 
> Would be nice if we could get rid of the quiesce code in general, but I
> haven't done a check to see what'd be missing after this...

Ok, I do think full quiesce is worth keeping as don't think all
registered parts need dynamic update. E.g. zc notification dynamic
reregistation doesn't make sense and I'd rather rely on existing
straightforward mechanisms than adding extra bits, even if it's
rsrc_nodes. That's not mentioning unnecessary extra overhead.

btw, I wouldn't say this eventfd specific sync is much simpler than
the whole full quiesce.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 19:06           ` Jens Axboe
  2022-02-03 19:43             ` Pavel Begunkov
@ 2022-02-03 19:54             ` Usama Arif
  2022-02-03 21:47               ` Pavel Begunkov
  2022-02-03 22:02               ` Pavel Begunkov
  1 sibling, 2 replies; 16+ messages in thread
From: Usama Arif @ 2022-02-03 19:54 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring, linux-kernel; +Cc: fam.zheng



On 03/02/2022 19:06, Jens Axboe wrote:
> On 2/3/22 12:00 PM, Pavel Begunkov wrote:
>> On 2/3/22 18:29, Jens Axboe wrote:
>>> On 2/3/22 11:26 AM, Usama Arif wrote:
>>>> Hmm, maybe i didn't understand you and Pavel correctly. Are you
>>>> suggesting to do the below diff over patch 3? I dont think that would be
>>>> correct, as it is possible that just after checking if ctx->io_ev_fd is
>>>> present unregister can be called by another thread and set ctx->io_ev_fd
>>>> to NULL that would cause a NULL pointer exception later? In the current
>>>> patch, the check of whether ev_fd exists happens as the first thing
>>>> after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.
>>>
>>> They are cheap, but they are still noticeable at high requests/sec
>>> rates. So would be best to avoid them.
>>>
>>> And yes it's obviously racy, there's the potential to miss an eventfd
>>> notification if it races with registering an eventfd descriptor. But
>>> that's not really a concern, as if you register with inflight IO
>>> pending, then that always exists just depending on timing. The only
>>> thing I care about here is that it's always _safe_. Hence something ala
>>> what you did below is totally fine, as we're re-evaluating under rcu
>>> protection.
>>
>> Indeed, the patch doesn't have any formal guarantees for propagation
>> to already inflight requests, so this extra unsynchronised check
>> doesn't change anything.
>>
>> I'm still more сurious why we need RCU and extra complexity when
>> apparently there is no use case for that. If it's only about
>> initial initialisation, then as I described there is a much
>> simpler approach.
> 
> Would be nice if we could get rid of the quiesce code in general, but I
> haven't done a check to see what'd be missing after this...
> 

I had checked! I had posted below in in reply to v1 
(https://lore.kernel.org/io-uring/02fb0bc3-fc38-b8f0-3067-edd2a525ef29@gmail.com/T/#m5ac7867ac61d86fe62c099be793ffe5a9a334976), 
but i think it got missed! Copy-pasting here for reference:

"
I see that if we remove ring quiesce from the the above 3 opcodes, then
only IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS is
left for ring quiesce. I just had a quick look at those, and from what i
see we might not need to enter ring quiesce in
IORING_REGISTER_ENABLE_RINGS as the ring is already disabled at that point?
And for IORING_REGISTER_RESTRICTIONS if we do a similar approach to
IORING_REGISTER_EVENTFD, i.e. wrap ctx->restrictions inside an RCU
protected data structure, use spin_lock to prevent multiple
io_register_restrictions calls at the same time, and use read_rcu_lock
in io_check_restriction, then we can remove ring quiesce from
io_uring_register altogether?

My usecase only uses IORING_REGISTER_EVENTFD, but i think entering ring
quiesce costs similar in other opcodes. If the above sounds reasonable,
please let me know and i can send patches for removing ring quiesce for
io_uring_register.
"

Let me know if above makes sense, i can add patches on top of the 
current patchset, or we can do it after they get merged.

As for why, quiesce state is very expensive. its making 
io_uring_register the most expensive syscall in my usecase (~15ms) 
compared to ~0.1ms now with RCU, which is why i started investigating 
this. And this patchset avoids ring quiesce for 3 of the opcodes, so it 
would generally be quite helpful if someone does registers and 
unregisters eventfd multiple times.

Thanks,
Usama

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 19:54             ` Usama Arif
@ 2022-02-03 21:47               ` Pavel Begunkov
  2022-02-03 22:16                 ` Jens Axboe
  2022-02-03 22:02               ` Pavel Begunkov
  1 sibling, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2022-02-03 21:47 UTC (permalink / raw)
  To: Usama Arif, Jens Axboe, io-uring, linux-kernel; +Cc: fam.zheng

On 2/3/22 19:54, Usama Arif wrote:
> On 03/02/2022 19:06, Jens Axboe wrote:
>> On 2/3/22 12:00 PM, Pavel Begunkov wrote:
>>> On 2/3/22 18:29, Jens Axboe wrote:
>>>> On 2/3/22 11:26 AM, Usama Arif wrote:
>>>>> Hmm, maybe i didn't understand you and Pavel correctly. Are you
>>>>> suggesting to do the below diff over patch 3? I dont think that would be
>>>>> correct, as it is possible that just after checking if ctx->io_ev_fd is
>>>>> present unregister can be called by another thread and set ctx->io_ev_fd
>>>>> to NULL that would cause a NULL pointer exception later? In the current
>>>>> patch, the check of whether ev_fd exists happens as the first thing
>>>>> after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.
>>>>
>>>> They are cheap, but they are still noticeable at high requests/sec
>>>> rates. So would be best to avoid them.
>>>>
>>>> And yes it's obviously racy, there's the potential to miss an eventfd
>>>> notification if it races with registering an eventfd descriptor. But
>>>> that's not really a concern, as if you register with inflight IO
>>>> pending, then that always exists just depending on timing. The only
>>>> thing I care about here is that it's always _safe_. Hence something ala
>>>> what you did below is totally fine, as we're re-evaluating under rcu
>>>> protection.
>>>
>>> Indeed, the patch doesn't have any formal guarantees for propagation
>>> to already inflight requests, so this extra unsynchronised check
>>> doesn't change anything.
>>>
>>> I'm still more сurious why we need RCU and extra complexity when
>>> apparently there is no use case for that. If it's only about
>>> initial initialisation, then as I described there is a much
>>> simpler approach.
>>
>> Would be nice if we could get rid of the quiesce code in general, but I
>> haven't done a check to see what'd be missing after this...
>>
> 
> I had checked! I had posted below in in reply to v1 (https://lore.kernel.org/io-uring/02fb0bc3-fc38-b8f0-3067-edd2a525ef29@gmail.com/T/#m5ac7867ac61d86fe62c099be793ffe5a9a334976), but i think it got missed! Copy-pasting here for reference:

May have missed it then, apologies

> "
> I see that if we remove ring quiesce from the the above 3 opcodes, then
> only IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS is
> left for ring quiesce. I just had a quick look at those, and from what i
> see we might not need to enter ring quiesce in
> IORING_REGISTER_ENABLE_RINGS as the ring is already disabled at that point?
> And for IORING_REGISTER_RESTRICTIONS if we do a similar approach to
> IORING_REGISTER_EVENTFD, i.e. wrap ctx->restrictions inside an RCU
> protected data structure, use spin_lock to prevent multiple
> io_register_restrictions calls at the same time, and use read_rcu_lock
> in io_check_restriction, then we can remove ring quiesce from
> io_uring_register altogether?
> 
> My usecase only uses IORING_REGISTER_EVENTFD, but i think entering ring
> quiesce costs similar in other opcodes. If the above sounds reasonable,
> please let me know and i can send patches for removing ring quiesce for
> io_uring_register.
> "
> 
> Let me know if above makes sense, i can add patches on top of the current patchset, or we can do it after they get merged.
> 
> As for why, quiesce state is very expensive. its making io_uring_register the most expensive syscall in my usecase (~15ms) compared to ~0.1ms now with RCU, which is why i started investigating this. And this patchset avoids ring quiesce for 3 of the opcodes, so it would generally be quite helpful if someone does registers and unregisters eventfd multiple times.

I agree that 15ms for initial setup is silly and it has to be
reduced. However, I'm trying weight the extra complexity against
potential benefits of _also_ optimising [de,re]-registration

Considering that you only register it one time at the beginning,
we risk adding a yet another feature that nobody is going to ever
use. This doesn't give me a nice feeling, well, unless you do
have a use case.

To emphasise, I'm comparing 15->0.1 improvement for only initial
registration (which is simpler) vs 15->0.1 for both registration
and unregistration.

fwiw, it alters userpace visible behaviour in either case, shouldn't
be as important here but there is always a chance to break userspace

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 19:54             ` Usama Arif
  2022-02-03 21:47               ` Pavel Begunkov
@ 2022-02-03 22:02               ` Pavel Begunkov
  1 sibling, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2022-02-03 22:02 UTC (permalink / raw)
  To: Usama Arif, Jens Axboe, io-uring, linux-kernel; +Cc: fam.zheng

On 2/3/22 19:54, Usama Arif wrote:
> On 03/02/2022 19:06, Jens Axboe wrote:
>> On 2/3/22 12:00 PM, Pavel Begunkov wrote:
>>> On 2/3/22 18:29, Jens Axboe wrote:
>>>> On 2/3/22 11:26 AM, Usama Arif wrote:
>>>>> Hmm, maybe i didn't understand you and Pavel correctly. Are you
>>>>> suggesting to do the below diff over patch 3? I dont think that would be
>>>>> correct, as it is possible that just after checking if ctx->io_ev_fd is
>>>>> present unregister can be called by another thread and set ctx->io_ev_fd
>>>>> to NULL that would cause a NULL pointer exception later? In the current
>>>>> patch, the check of whether ev_fd exists happens as the first thing
>>>>> after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.
>>>>
>>>> They are cheap, but they are still noticeable at high requests/sec
>>>> rates. So would be best to avoid them.
>>>>
>>>> And yes it's obviously racy, there's the potential to miss an eventfd
>>>> notification if it races with registering an eventfd descriptor. But
>>>> that's not really a concern, as if you register with inflight IO
>>>> pending, then that always exists just depending on timing. The only
>>>> thing I care about here is that it's always _safe_. Hence something ala
>>>> what you did below is totally fine, as we're re-evaluating under rcu
>>>> protection.
>>>
>>> Indeed, the patch doesn't have any formal guarantees for propagation
>>> to already inflight requests, so this extra unsynchronised check
>>> doesn't change anything.
>>>
>>> I'm still more сurious why we need RCU and extra complexity when
>>> apparently there is no use case for that. If it's only about
>>> initial initialisation, then as I described there is a much
>>> simpler approach.
>>
>> Would be nice if we could get rid of the quiesce code in general, but I
>> haven't done a check to see what'd be missing after this...
>>
> 
> I had checked! I had posted below in in reply to v1 (https://lore.kernel.org/io-uring/02fb0bc3-fc38-b8f0-3067-edd2a525ef29@gmail.com/T/#m5ac7867ac61d86fe62c099be793ffe5a9a334976), but i think it got missed! Copy-pasting here for reference:
> 
> "
> I see that if we remove ring quiesce from the the above 3 opcodes, then
> only IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS is
> left for ring quiesce. I just had a quick look at those, and from what i
> see we might not need to enter ring quiesce in
> IORING_REGISTER_ENABLE_RINGS as the ring is already disabled at that point?
> And for IORING_REGISTER_RESTRICTIONS if we do a similar approach to

IORING_REGISTER_RESTRICTIONS and IORING_REGISTER_ENABLE_RINGS are simpler,
we can just remove quiesce (i.e. put them into io_register_op_must_quiesce())
without any extra changes.

TL;DR;
That's because IORING_SETUP_R_DISABLED prevents submitting requests
and so there will be no requests until IORING_REGISTER_ENABLE_RINGS is
called. And IORING_REGISTER_RESTRICTIONS works only before
IORING_REGISTER_ENABLE_RINGS was called.


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 21:47               ` Pavel Begunkov
@ 2022-02-03 22:16                 ` Jens Axboe
  2022-02-03 23:21                   ` Pavel Begunkov
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2022-02-03 22:16 UTC (permalink / raw)
  To: Pavel Begunkov, Usama Arif, io-uring, linux-kernel; +Cc: fam.zheng

On 2/3/22 2:47 PM, Pavel Begunkov wrote:
> On 2/3/22 19:54, Usama Arif wrote:
>> On 03/02/2022 19:06, Jens Axboe wrote:
>>> On 2/3/22 12:00 PM, Pavel Begunkov wrote:
>>>> On 2/3/22 18:29, Jens Axboe wrote:
>>>>> On 2/3/22 11:26 AM, Usama Arif wrote:
>>>>>> Hmm, maybe i didn't understand you and Pavel correctly. Are you
>>>>>> suggesting to do the below diff over patch 3? I dont think that would be
>>>>>> correct, as it is possible that just after checking if ctx->io_ev_fd is
>>>>>> present unregister can be called by another thread and set ctx->io_ev_fd
>>>>>> to NULL that would cause a NULL pointer exception later? In the current
>>>>>> patch, the check of whether ev_fd exists happens as the first thing
>>>>>> after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.
>>>>>
>>>>> They are cheap, but they are still noticeable at high requests/sec
>>>>> rates. So would be best to avoid them.
>>>>>
>>>>> And yes it's obviously racy, there's the potential to miss an eventfd
>>>>> notification if it races with registering an eventfd descriptor. But
>>>>> that's not really a concern, as if you register with inflight IO
>>>>> pending, then that always exists just depending on timing. The only
>>>>> thing I care about here is that it's always _safe_. Hence something ala
>>>>> what you did below is totally fine, as we're re-evaluating under rcu
>>>>> protection.
>>>>
>>>> Indeed, the patch doesn't have any formal guarantees for propagation
>>>> to already inflight requests, so this extra unsynchronised check
>>>> doesn't change anything.
>>>>
>>>> I'm still more сurious why we need RCU and extra complexity when
>>>> apparently there is no use case for that. If it's only about
>>>> initial initialisation, then as I described there is a much
>>>> simpler approach.
>>>
>>> Would be nice if we could get rid of the quiesce code in general, but I
>>> haven't done a check to see what'd be missing after this...
>>>
>>
>> I had checked! I had posted below in in reply to v1 (https://lore.kernel.org/io-uring/02fb0bc3-fc38-b8f0-3067-edd2a525ef29@gmail.com/T/#m5ac7867ac61d86fe62c099be793ffe5a9a334976), but i think it got missed! Copy-pasting here for reference:
> 
> May have missed it then, apologies
> 
>> "
>> I see that if we remove ring quiesce from the the above 3 opcodes, then
>> only IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS is
>> left for ring quiesce. I just had a quick look at those, and from what i
>> see we might not need to enter ring quiesce in
>> IORING_REGISTER_ENABLE_RINGS as the ring is already disabled at that point?
>> And for IORING_REGISTER_RESTRICTIONS if we do a similar approach to
>> IORING_REGISTER_EVENTFD, i.e. wrap ctx->restrictions inside an RCU
>> protected data structure, use spin_lock to prevent multiple
>> io_register_restrictions calls at the same time, and use read_rcu_lock
>> in io_check_restriction, then we can remove ring quiesce from
>> io_uring_register altogether?
>>
>> My usecase only uses IORING_REGISTER_EVENTFD, but i think entering ring
>> quiesce costs similar in other opcodes. If the above sounds reasonable,
>> please let me know and i can send patches for removing ring quiesce for
>> io_uring_register.
>> "
>>
>> Let me know if above makes sense, i can add patches on top of the current patchset, or we can do it after they get merged.
>>
>> As for why, quiesce state is very expensive. its making io_uring_register the most expensive syscall in my usecase (~15ms) compared to ~0.1ms now with RCU, which is why i started investigating this. And this patchset avoids ring quiesce for 3 of the opcodes, so it would generally be quite helpful if someone does registers and unregisters eventfd multiple times.
> 
> I agree that 15ms for initial setup is silly and it has to be
> reduced. However, I'm trying weight the extra complexity against
> potential benefits of _also_ optimising [de,re]-registration
> 
> Considering that you only register it one time at the beginning,
> we risk adding a yet another feature that nobody is going to ever
> use. This doesn't give me a nice feeling, well, unless you do
> have a use case.

It's not really a new feature, it's just making the existing one not
suck quite as much...

> To emphasise, I'm comparing 15->0.1 improvement for only initial
> registration (which is simpler) vs 15->0.1 for both registration
> and unregistration.

reg+unreg should be way faster too, if done properly with the assignment
tricks.

> fwiw, it alters userpace visible behaviour in either case, shouldn't
> be as important here but there is always a chance to break userspace

It doesn't alter userspace behavior, if the registration works like I
described with being able to assign a new one while the old one is being
torn down.

Or do you mean wrt inflight IO? I don't think the risk is very high
there, to be honest.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 19:43             ` Pavel Begunkov
@ 2022-02-03 22:18               ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2022-02-03 22:18 UTC (permalink / raw)
  To: Pavel Begunkov, Usama Arif, io-uring, linux-kernel; +Cc: fam.zheng

On 2/3/22 12:43 PM, Pavel Begunkov wrote:
> On 2/3/22 19:06, Jens Axboe wrote:
>> On 2/3/22 12:00 PM, Pavel Begunkov wrote:
>>> On 2/3/22 18:29, Jens Axboe wrote:
>>>> On 2/3/22 11:26 AM, Usama Arif wrote:
>>>>> Hmm, maybe i didn't understand you and Pavel correctly. Are you
>>>>> suggesting to do the below diff over patch 3? I dont think that would be
>>>>> correct, as it is possible that just after checking if ctx->io_ev_fd is
>>>>> present unregister can be called by another thread and set ctx->io_ev_fd
>>>>> to NULL that would cause a NULL pointer exception later? In the current
>>>>> patch, the check of whether ev_fd exists happens as the first thing
>>>>> after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.
>>>>
>>>> They are cheap, but they are still noticeable at high requests/sec
>>>> rates. So would be best to avoid them.
>>>>
>>>> And yes it's obviously racy, there's the potential to miss an eventfd
>>>> notification if it races with registering an eventfd descriptor. But
>>>> that's not really a concern, as if you register with inflight IO
>>>> pending, then that always exists just depending on timing. The only
>>>> thing I care about here is that it's always _safe_. Hence something ala
>>>> what you did below is totally fine, as we're re-evaluating under rcu
>>>> protection.
>>>
>>> Indeed, the patch doesn't have any formal guarantees for propagation
>>> to already inflight requests, so this extra unsynchronised check
>>> doesn't change anything.
>>>
>>> I'm still more сurious why we need RCU and extra complexity when
>>> apparently there is no use case for that. If it's only about
>>> initial initialisation, then as I described there is a much
>>> simpler approach.
>>
>> Would be nice if we could get rid of the quiesce code in general, but I
>> haven't done a check to see what'd be missing after this...
> 
> Ok, I do think full quiesce is worth keeping as don't think all
> registered parts need dynamic update. E.g. zc notification dynamic
> reregistation doesn't make sense and I'd rather rely on existing
> straightforward mechanisms than adding extra bits, even if it's
> rsrc_nodes. That's not mentioning unnecessary extra overhead.
> 
> btw, I wouldn't say this eventfd specific sync is much simpler than
> the whole full quiesce.

It's easier to understand though, as it follows the usual rules of
RCU which are used throughout the kernel.

On quiesce in general, my curiosity was around whether we'd ever
get to the point where all register handlers are marked as not
needing quisce, and it seems it isn't that far off.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] Re: [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd
  2022-02-03 22:16                 ` Jens Axboe
@ 2022-02-03 23:21                   ` Pavel Begunkov
  0 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2022-02-03 23:21 UTC (permalink / raw)
  To: Jens Axboe, Usama Arif, io-uring, linux-kernel; +Cc: fam.zheng

On 2/3/22 22:16, Jens Axboe wrote:
> On 2/3/22 2:47 PM, Pavel Begunkov wrote:
>> On 2/3/22 19:54, Usama Arif wrote:
>>> On 03/02/2022 19:06, Jens Axboe wrote:
>>>> On 2/3/22 12:00 PM, Pavel Begunkov wrote:
>>>>> On 2/3/22 18:29, Jens Axboe wrote:
>>>>>> On 2/3/22 11:26 AM, Usama Arif wrote:
>>>>>>> Hmm, maybe i didn't understand you and Pavel correctly. Are you
>>>>>>> suggesting to do the below diff over patch 3? I dont think that would be
>>>>>>> correct, as it is possible that just after checking if ctx->io_ev_fd is
>>>>>>> present unregister can be called by another thread and set ctx->io_ev_fd
>>>>>>> to NULL that would cause a NULL pointer exception later? In the current
>>>>>>> patch, the check of whether ev_fd exists happens as the first thing
>>>>>>> after rcu_read_lock and the rcu_read_lock are extremely cheap i believe.
>>>>>>
>>>>>> They are cheap, but they are still noticeable at high requests/sec
>>>>>> rates. So would be best to avoid them.
>>>>>>
>>>>>> And yes it's obviously racy, there's the potential to miss an eventfd
>>>>>> notification if it races with registering an eventfd descriptor. But
>>>>>> that's not really a concern, as if you register with inflight IO
>>>>>> pending, then that always exists just depending on timing. The only
>>>>>> thing I care about here is that it's always _safe_. Hence something ala
>>>>>> what you did below is totally fine, as we're re-evaluating under rcu
>>>>>> protection.
>>>>>
>>>>> Indeed, the patch doesn't have any formal guarantees for propagation
>>>>> to already inflight requests, so this extra unsynchronised check
>>>>> doesn't change anything.
>>>>>
>>>>> I'm still more сurious why we need RCU and extra complexity when
>>>>> apparently there is no use case for that. If it's only about
>>>>> initial initialisation, then as I described there is a much
>>>>> simpler approach.
>>>>
>>>> Would be nice if we could get rid of the quiesce code in general, but I
>>>> haven't done a check to see what'd be missing after this...
>>>>
>>>
>>> I had checked! I had posted below in in reply to v1 (https://lore.kernel.org/io-uring/02fb0bc3-fc38-b8f0-3067-edd2a525ef29@gmail.com/T/#m5ac7867ac61d86fe62c099be793ffe5a9a334976), but i think it got missed! Copy-pasting here for reference:
>>
>> May have missed it then, apologies
>>
>>> "
>>> I see that if we remove ring quiesce from the the above 3 opcodes, then
>>> only IORING_REGISTER_ENABLE_RINGS and IORING_REGISTER_RESTRICTIONS is
>>> left for ring quiesce. I just had a quick look at those, and from what i
>>> see we might not need to enter ring quiesce in
>>> IORING_REGISTER_ENABLE_RINGS as the ring is already disabled at that point?
>>> And for IORING_REGISTER_RESTRICTIONS if we do a similar approach to
>>> IORING_REGISTER_EVENTFD, i.e. wrap ctx->restrictions inside an RCU
>>> protected data structure, use spin_lock to prevent multiple
>>> io_register_restrictions calls at the same time, and use read_rcu_lock
>>> in io_check_restriction, then we can remove ring quiesce from
>>> io_uring_register altogether?
>>>
>>> My usecase only uses IORING_REGISTER_EVENTFD, but i think entering ring
>>> quiesce costs similar in other opcodes. If the above sounds reasonable,
>>> please let me know and i can send patches for removing ring quiesce for
>>> io_uring_register.
>>> "
>>>
>>> Let me know if above makes sense, i can add patches on top of the current patchset, or we can do it after they get merged.
>>>
>>> As for why, quiesce state is very expensive. its making io_uring_register the most expensive syscall in my usecase (~15ms) compared to ~0.1ms now with RCU, which is why i started investigating this. And this patchset avoids ring quiesce for 3 of the opcodes, so it would generally be quite helpful if someone does registers and unregisters eventfd multiple times.
>>
>> I agree that 15ms for initial setup is silly and it has to be
>> reduced. However, I'm trying weight the extra complexity against
>> potential benefits of _also_ optimising [de,re]-registration
>>
>> Considering that you only register it one time at the beginning,
>> we risk adding a yet another feature that nobody is going to ever
>> use. This doesn't give me a nice feeling, well, unless you do
>> have a use case.
> 
> It's not really a new feature, it's just making the existing one not
> suck quite as much...

Does it matter when nobody uses it? My point is that does not.


>> To emphasise, I'm comparing 15->0.1 improvement for only initial
>> registration (which is simpler) vs 15->0.1 for both registration
>> and unregistration.
> 
> reg+unreg should be way faster too, if done properly with the assignment
> tricks.
> 
>> fwiw, it alters userpace visible behaviour in either case, shouldn't
>> be as important here but there is always a chance to break userspace
> 
> It doesn't alter userspace behavior, if the registration works like I
> described with being able to assign a new one while the old one is being
> torn down.
> 
> Or do you mean wrt inflight IO? I don't think the risk is very high
> there, to be honest.

Right, if somebody tries such a trick it'll be pretty confusing to
get randomly firing eventfd, though it's rather a marginal case.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-02-03 23:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-03 17:41 [PATCH v3 0/3] io_uring: avoid ring quiesce in io_uring_register for eventfd opcodes Usama Arif
2022-02-03 17:41 ` [PATCH v3 1/3] io_uring: remove trace for eventfd Usama Arif
2022-02-03 17:41 ` [PATCH v3 2/3] io_uring: avoid ring quiesce while registering/unregistering eventfd Usama Arif
2022-02-03 17:56   ` Jens Axboe
2022-02-03 18:26     ` [External] " Usama Arif
2022-02-03 18:29       ` Jens Axboe
2022-02-03 19:00         ` Pavel Begunkov
2022-02-03 19:06           ` Jens Axboe
2022-02-03 19:43             ` Pavel Begunkov
2022-02-03 22:18               ` Jens Axboe
2022-02-03 19:54             ` Usama Arif
2022-02-03 21:47               ` Pavel Begunkov
2022-02-03 22:16                 ` Jens Axboe
2022-02-03 23:21                   ` Pavel Begunkov
2022-02-03 22:02               ` Pavel Begunkov
2022-02-03 17:41 ` [PATCH v3 3/3] io_uring: avoid ring quiesce for IORING_REGISTER_EVENTFD_ASYNC Usama Arif

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.