All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 5.14 00/12] for-next optimisations
@ 2021-06-14 22:37 Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 01/12] io_uring: keep SQ pointers in a single cacheline Pavel Begunkov
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

There are two main lines intervened. The first one is pt.2 of ctx field
shuffling for better caching. There is a couple of things left on that
front.

The second is optimising (assumably) rarely used offset-based timeouts
and draining. There is a downside (see 12/12), which will be fixed
later. In plans to queue a task_work clearing drain_used (under
uring_lock) from io_queue_deferred() once all drainee are gone.

nops(batch=32):
    15.9 MIOPS vs 17.3 MIOPS
nullblk (irqmode=2 completion_nsec=0 submit_queues=16), no merges, no stat
    1002 KIOPS vs 1050 KIOPS

Though the second test is very slow comparing to what I've seen before,
so might be not represantative.

Pavel Begunkov (12):
  io_uring: keep SQ pointers in a single cacheline
  io_uring: move ctx->flags from SQ cacheline
  io_uring: shuffle more fields into SQ ctx section
  io_uring: refactor io_get_sqe()
  io_uring: don't cache number of dropped SQEs
  io_uring: optimise completion timeout flushing
  io_uring: small io_submit_sqe() optimisation
  io_uring: clean up check_overflow flag
  io_uring: wait heads renaming
  io_uring: move uring_lock location
  io_uring: refactor io_req_defer()
  io_uring: optimise non-drain path

 fs/io_uring.c | 226 +++++++++++++++++++++++++-------------------------
 1 file changed, 111 insertions(+), 115 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 01/12] io_uring: keep SQ pointers in a single cacheline
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 02/12] io_uring: move ctx->flags from SQ cacheline Pavel Begunkov
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

sq_array and sq_sqes are always used together, however they are in
different cachelines, where the borderline is right before
cq_overflow_list is rather rarely touched. Move the fields together so
it loads only one cacheline.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index d665c9419ad3..f3c827cd8ff8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -364,6 +364,7 @@ struct io_ring_ctx {
 		 * array.
 		 */
 		u32			*sq_array;
+		struct io_uring_sqe	*sq_sqes;
 		unsigned		cached_sq_head;
 		unsigned		sq_entries;
 		unsigned		sq_thread_idle;
@@ -373,8 +374,6 @@ struct io_ring_ctx {
 		struct list_head	defer_list;
 		struct list_head	timeout_list;
 		struct list_head	cq_overflow_list;
-
-		struct io_uring_sqe	*sq_sqes;
 	} ____cacheline_aligned_in_smp;
 
 	struct {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 02/12] io_uring: move ctx->flags from SQ cacheline
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 01/12] io_uring: keep SQ pointers in a single cacheline Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 03/12] io_uring: shuffle more fields into SQ ctx section Pavel Begunkov
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

ctx->flags are heavily used by both, completion and submission sides, so
move it out from the ctx fields related to submissions. Instead, place
it together with ctx->refs, because it's already cacheline-aligned and
so pads lots of space, and both almost never change. Also, in most
occasions they are accessed together as refs are taken at submission
time and put back during completion.

Do same with ctx->rings, where the pointer itself is never modified
apart from ring init/free.

Note: in percpu mode, struct percpu_ref doesn't modify the struct itself
but takes indirection with ref->percpu_count_ptr.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f3c827cd8ff8..a4460383bd25 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -341,17 +341,19 @@ struct io_submit_state {
 };
 
 struct io_ring_ctx {
+	/* const or read-mostly hot data */
 	struct {
 		struct percpu_ref	refs;
-	} ____cacheline_aligned_in_smp;
 
-	struct {
+		struct io_rings		*rings;
 		unsigned int		flags;
 		unsigned int		compat: 1;
 		unsigned int		drain_next: 1;
 		unsigned int		eventfd_async: 1;
 		unsigned int		restricted: 1;
+	} ____cacheline_aligned_in_smp;
 
+	struct {
 		/*
 		 * Ring buffer of indices into array of io_uring_sqe, which is
 		 * mmapped by the application using the IORING_OFF_SQES offset.
@@ -386,8 +388,6 @@ struct io_ring_ctx {
 	struct list_head	locked_free_list;
 	unsigned int		locked_free_nr;
 
-	struct io_rings	*rings;
-
 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 03/12] io_uring: shuffle more fields into SQ ctx section
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 01/12] io_uring: keep SQ pointers in a single cacheline Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 02/12] io_uring: move ctx->flags from SQ cacheline Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 04/12] io_uring: refactor io_get_sqe() Pavel Begunkov
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

Since moving locked_free_* out of struct io_submit_state
ctx->submit_state is accessed on submission side only, so move it into
the submission section. Same goes for rsrc table pointers/nodes/etc.,
they must be taken and checked during submission because sync'ed by
uring_lock, so move them there as well.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a4460383bd25..5cc0c4dd2709 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -353,6 +353,7 @@ struct io_ring_ctx {
 		unsigned int		restricted: 1;
 	} ____cacheline_aligned_in_smp;
 
+	/* submission data */
 	struct {
 		/*
 		 * Ring buffer of indices into array of io_uring_sqe, which is
@@ -369,13 +370,27 @@ struct io_ring_ctx {
 		struct io_uring_sqe	*sq_sqes;
 		unsigned		cached_sq_head;
 		unsigned		sq_entries;
-		unsigned		sq_thread_idle;
 		unsigned		cached_sq_dropped;
 		unsigned long		sq_check_overflow;
-
 		struct list_head	defer_list;
+
+		/*
+		 * Fixed resources fast path, should be accessed only under
+		 * uring_lock, and updated through io_uring_register(2)
+		 */
+		struct io_rsrc_node	*rsrc_node;
+		struct io_file_table	file_table;
+		unsigned		nr_user_files;
+		unsigned		nr_user_bufs;
+		struct io_mapped_ubuf	**user_bufs;
+
+		struct io_submit_state	submit_state;
 		struct list_head	timeout_list;
 		struct list_head	cq_overflow_list;
+		struct xarray		io_buffers;
+		struct xarray		personalities;
+		u32			pers_next;
+		unsigned		sq_thread_idle;
 	} ____cacheline_aligned_in_smp;
 
 	struct {
@@ -383,7 +398,6 @@ struct io_ring_ctx {
 		wait_queue_head_t	wait;
 	} ____cacheline_aligned_in_smp;
 
-	struct io_submit_state		submit_state;
 	/* IRQ completion list, under ->completion_lock */
 	struct list_head	locked_free_list;
 	unsigned int		locked_free_nr;
@@ -394,21 +408,6 @@ struct io_ring_ctx {
 	struct wait_queue_head	sqo_sq_wait;
 	struct list_head	sqd_list;
 
-	/*
-	 * Fixed resources fast path, should be accessed only under uring_lock,
-	 * and updated through io_uring_register(2)
-	 */
-	struct io_rsrc_node	*rsrc_node;
-
-	struct io_file_table	file_table;
-	unsigned		nr_user_files;
-	unsigned		nr_user_bufs;
-	struct io_mapped_ubuf	**user_bufs;
-
-	struct xarray		io_buffers;
-	struct xarray		personalities;
-	u32			pers_next;
-
 	struct {
 		unsigned		cached_cq_tail;
 		unsigned		cq_entries;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 04/12] io_uring: refactor io_get_sqe()
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
                   ` (2 preceding siblings ...)
  2021-06-14 22:37 ` [PATCH 03/12] io_uring: shuffle more fields into SQ ctx section Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 05/12] io_uring: don't cache number of dropped SQEs Pavel Begunkov
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

The line of io_get_sqe() evaluating @head consists of too many
operations including READ_ONCE(), it's not convenient for probing.
Refactor it also improving readability.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5cc0c4dd2709..3baacfe2c9b7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6685,8 +6685,8 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
  */
 static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
 {
-	u32 *sq_array = ctx->sq_array;
 	unsigned head, mask = ctx->sq_entries - 1;
+	unsigned sq_idx = ctx->cached_sq_head++ & mask;
 
 	/*
 	 * The cached sq head (or cq tail) serves two purposes:
@@ -6696,7 +6696,7 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
 	 * 2) allows the kernel side to track the head on its own, even
 	 *    though the application is the one updating it.
 	 */
-	head = READ_ONCE(sq_array[ctx->cached_sq_head++ & mask]);
+	head = READ_ONCE(ctx->sq_array[sq_idx]);
 	if (likely(head < ctx->sq_entries))
 		return &ctx->sq_sqes[head];
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 05/12] io_uring: don't cache number of dropped SQEs
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
                   ` (3 preceding siblings ...)
  2021-06-14 22:37 ` [PATCH 04/12] io_uring: refactor io_get_sqe() Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 06/12] io_uring: optimise completion timeout flushing Pavel Begunkov
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

Kill ->cached_sq_dropped and wire DRAIN sequence number correction via
->cq_extra, which is there exactly for that purpose. User visible
dropped counter will be populated by incrementing it instead of keeping
a copy, similarly as it was done not so long ago with cq_overflow.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3baacfe2c9b7..6dd14f4aa5f1 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -370,7 +370,6 @@ struct io_ring_ctx {
 		struct io_uring_sqe	*sq_sqes;
 		unsigned		cached_sq_head;
 		unsigned		sq_entries;
-		unsigned		cached_sq_dropped;
 		unsigned long		sq_check_overflow;
 		struct list_head	defer_list;
 
@@ -5994,13 +5993,11 @@ static u32 io_get_sequence(struct io_kiocb *req)
 {
 	struct io_kiocb *pos;
 	struct io_ring_ctx *ctx = req->ctx;
-	u32 total_submitted, nr_reqs = 0;
+	u32 nr_reqs = 0;
 
 	io_for_each_link(pos, req)
 		nr_reqs++;
-
-	total_submitted = ctx->cached_sq_head - ctx->cached_sq_dropped;
-	return total_submitted - nr_reqs;
+	return ctx->cached_sq_head - nr_reqs;
 }
 
 static int io_req_defer(struct io_kiocb *req)
@@ -6701,8 +6698,9 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
 		return &ctx->sq_sqes[head];
 
 	/* drop invalid entries */
-	ctx->cached_sq_dropped++;
-	WRITE_ONCE(ctx->rings->sq_dropped, ctx->cached_sq_dropped);
+	ctx->cq_extra--;
+	WRITE_ONCE(ctx->rings->sq_dropped,
+		   READ_ONCE(ctx->rings->sq_dropped) + 1);
 	return NULL;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 06/12] io_uring: optimise completion timeout flushing
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
                   ` (4 preceding siblings ...)
  2021-06-14 22:37 ` [PATCH 05/12] io_uring: don't cache number of dropped SQEs Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 07/12] io_uring: small io_submit_sqe() optimisation Pavel Begunkov
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

io_commit_cqring() might be very hot and we definitely don't want to
touch ->timeout_list there, because 1) it's shared with the submission
side so might lead to cache bouncing and 2) may need to load an extra
cache line, especially for IRQ completions.

We're interested in it at the completion side only when there are
offset-mode timeouts, which are not so popular. Replace
list_empty(->timeout_list) hot path check with a new one-way flag, which
is set when we prepare the first offset-mode timeout.

note: the flag sits in the same line as briefly used after ->rings

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6dd14f4aa5f1..fbf3b2149a4c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -351,6 +351,7 @@ struct io_ring_ctx {
 		unsigned int		drain_next: 1;
 		unsigned int		eventfd_async: 1;
 		unsigned int		restricted: 1;
+		unsigned int		off_timeout_used: 1;
 	} ____cacheline_aligned_in_smp;
 
 	/* submission data */
@@ -1318,12 +1319,12 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
 {
 	u32 seq;
 
-	if (list_empty(&ctx->timeout_list))
+	if (likely(!ctx->off_timeout_used))
 		return;
 
 	seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
 
-	do {
+	while (!list_empty(&ctx->timeout_list)) {
 		u32 events_needed, events_got;
 		struct io_kiocb *req = list_first_entry(&ctx->timeout_list,
 						struct io_kiocb, timeout.list);
@@ -1345,8 +1346,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
 
 		list_del_init(&req->timeout.list);
 		io_kill_timeout(req, 0);
-	} while (!list_empty(&ctx->timeout_list));
-
+	}
 	ctx->cq_last_tm_flush = seq;
 }
 
@@ -5651,6 +5651,8 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return -EINVAL;
 
 	req->timeout.off = off;
+	if (unlikely(off && !req->ctx->off_timeout_used))
+		req->ctx->off_timeout_used = true;
 
 	if (!req->async_data && io_alloc_async_data(req))
 		return -ENOMEM;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 07/12] io_uring: small io_submit_sqe() optimisation
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
                   ` (5 preceding siblings ...)
  2021-06-14 22:37 ` [PATCH 06/12] io_uring: optimise completion timeout flushing Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 08/12] io_uring: clean up check_overflow flag Pavel Begunkov
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

submit_state.link is used only to assemble a link and not used for
actual submission, so clear it before io_queue_sqe() in io_submit_sqe(),
awhile it's hot and in caches and queueing doesn't spoil it. May also
potentially help compiler with spilling or to do other optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index fbf3b2149a4c..9c73991465c8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6616,8 +6616,8 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 
 		/* last request of a link, enqueue the link */
 		if (!(req->flags & (REQ_F_LINK | REQ_F_HARDLINK))) {
-			io_queue_sqe(head);
 			link->head = NULL;
+			io_queue_sqe(head);
 		}
 	} else {
 		if (unlikely(ctx->drain_next)) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 08/12] io_uring: clean up check_overflow flag
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
                   ` (6 preceding siblings ...)
  2021-06-14 22:37 ` [PATCH 07/12] io_uring: small io_submit_sqe() optimisation Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 09/12] io_uring: wait heads renaming Pavel Begunkov
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

There are no users of ->sq_check_overflow, only ->cq_check_overflow is
used. Combine it and move out of completion related part of struct
io_ring_ctx.

A not so obvious benefit of it is fitting all completion side fields
into a single cacheline. It was taking 2 lines before with 56B padding,
and io_cqring_ev_posted*() were still touching both of them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9c73991465c8..65d51e2d5c15 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -371,7 +371,6 @@ struct io_ring_ctx {
 		struct io_uring_sqe	*sq_sqes;
 		unsigned		cached_sq_head;
 		unsigned		sq_entries;
-		unsigned long		sq_check_overflow;
 		struct list_head	defer_list;
 
 		/*
@@ -408,13 +407,14 @@ struct io_ring_ctx {
 	struct wait_queue_head	sqo_sq_wait;
 	struct list_head	sqd_list;
 
+	unsigned long		check_cq_overflow;
+
 	struct {
 		unsigned		cached_cq_tail;
 		unsigned		cq_entries;
 		atomic_t		cq_timeouts;
 		unsigned		cq_last_tm_flush;
 		unsigned		cq_extra;
-		unsigned long		cq_check_overflow;
 		struct wait_queue_head	cq_wait;
 		struct fasync_struct	*cq_fasync;
 		struct eventfd_ctx	*cq_ev_fd;
@@ -1464,8 +1464,7 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 
 	all_flushed = list_empty(&ctx->cq_overflow_list);
 	if (all_flushed) {
-		clear_bit(0, &ctx->sq_check_overflow);
-		clear_bit(0, &ctx->cq_check_overflow);
+		clear_bit(0, &ctx->check_cq_overflow);
 		ctx->rings->sq_flags &= ~IORING_SQ_CQ_OVERFLOW;
 	}
 
@@ -1481,7 +1480,7 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
 {
 	bool ret = true;
 
-	if (test_bit(0, &ctx->cq_check_overflow)) {
+	if (test_bit(0, &ctx->check_cq_overflow)) {
 		/* iopoll syncs against uring_lock, not completion_lock */
 		if (ctx->flags & IORING_SETUP_IOPOLL)
 			mutex_lock(&ctx->uring_lock);
@@ -1544,8 +1543,7 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
 		return false;
 	}
 	if (list_empty(&ctx->cq_overflow_list)) {
-		set_bit(0, &ctx->sq_check_overflow);
-		set_bit(0, &ctx->cq_check_overflow);
+		set_bit(0, &ctx->check_cq_overflow);
 		ctx->rings->sq_flags |= IORING_SQ_CQ_OVERFLOW;
 	}
 	ocqe->cqe.user_data = user_data;
@@ -2391,7 +2389,7 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, long min)
 	 * If we do, we can potentially be spinning for commands that
 	 * already triggered a CQE (eg in error).
 	 */
-	if (test_bit(0, &ctx->cq_check_overflow))
+	if (test_bit(0, &ctx->check_cq_overflow))
 		__io_cqring_overflow_flush(ctx, false);
 	if (io_cqring_events(ctx))
 		goto out;
@@ -6965,7 +6963,7 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 	 * Cannot safely flush overflowed CQEs from here, ensure we wake up
 	 * the task, and the next invocation will do it.
 	 */
-	if (io_should_wake(iowq) || test_bit(0, &iowq->ctx->cq_check_overflow))
+	if (io_should_wake(iowq) || test_bit(0, &iowq->ctx->check_cq_overflow))
 		return autoremove_wake_function(curr, mode, wake_flags, key);
 	return -1;
 }
@@ -6993,7 +6991,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 	if (ret || io_should_wake(iowq))
 		return ret;
 	/* let the caller flush overflows, retry */
-	if (test_bit(0, &ctx->cq_check_overflow))
+	if (test_bit(0, &ctx->check_cq_overflow))
 		return 1;
 
 	*timeout = schedule_timeout(*timeout);
@@ -8702,7 +8700,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
 	 * Users may get EPOLLIN meanwhile seeing nothing in cqring, this
 	 * pushs them to do the flush.
 	 */
-	if (io_cqring_events(ctx) || test_bit(0, &ctx->cq_check_overflow))
+	if (io_cqring_events(ctx) || test_bit(0, &ctx->check_cq_overflow))
 		mask |= EPOLLIN | EPOLLRDNORM;
 
 	return mask;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 09/12] io_uring: wait heads renaming
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
                   ` (7 preceding siblings ...)
  2021-06-14 22:37 ` [PATCH 08/12] io_uring: clean up check_overflow flag Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 10/12] io_uring: move uring_lock location Pavel Begunkov
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

We use several wait_queue_head's for different purposes, but namings are
confusing. First rename ctx->cq_wait into ctx->poll_wait, because this
one is used for polling an io_uring instance. Then rename ctx->wait into
ctx->cq_wait, which is responsible for CQE waiting.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 65d51e2d5c15..e9bf26fbf65d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -394,7 +394,7 @@ struct io_ring_ctx {
 
 	struct {
 		struct mutex		uring_lock;
-		wait_queue_head_t	wait;
+		wait_queue_head_t	cq_wait;
 	} ____cacheline_aligned_in_smp;
 
 	/* IRQ completion list, under ->completion_lock */
@@ -415,7 +415,7 @@ struct io_ring_ctx {
 		atomic_t		cq_timeouts;
 		unsigned		cq_last_tm_flush;
 		unsigned		cq_extra;
-		struct wait_queue_head	cq_wait;
+		struct wait_queue_head	poll_wait;
 		struct fasync_struct	*cq_fasync;
 		struct eventfd_ctx	*cq_ev_fd;
 	} ____cacheline_aligned_in_smp;
@@ -1178,13 +1178,13 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	ctx->flags = p->flags;
 	init_waitqueue_head(&ctx->sqo_sq_wait);
 	INIT_LIST_HEAD(&ctx->sqd_list);
-	init_waitqueue_head(&ctx->cq_wait);
+	init_waitqueue_head(&ctx->poll_wait);
 	INIT_LIST_HEAD(&ctx->cq_overflow_list);
 	init_completion(&ctx->ref_comp);
 	xa_init_flags(&ctx->io_buffers, XA_FLAGS_ALLOC1);
 	xa_init_flags(&ctx->personalities, XA_FLAGS_ALLOC1);
 	mutex_init(&ctx->uring_lock);
-	init_waitqueue_head(&ctx->wait);
+	init_waitqueue_head(&ctx->cq_wait);
 	spin_lock_init(&ctx->completion_lock);
 	INIT_LIST_HEAD(&ctx->iopoll_list);
 	INIT_LIST_HEAD(&ctx->defer_list);
@@ -1404,14 +1404,14 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
 	/* see waitqueue_active() comment */
 	smp_mb();
 
-	if (waitqueue_active(&ctx->wait))
-		wake_up(&ctx->wait);
+	if (waitqueue_active(&ctx->cq_wait))
+		wake_up(&ctx->cq_wait);
 	if (ctx->sq_data && waitqueue_active(&ctx->sq_data->wait))
 		wake_up(&ctx->sq_data->wait);
 	if (io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
-	if (waitqueue_active(&ctx->cq_wait)) {
-		wake_up_interruptible(&ctx->cq_wait);
+	if (waitqueue_active(&ctx->poll_wait)) {
+		wake_up_interruptible(&ctx->poll_wait);
 		kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
 	}
 }
@@ -1422,13 +1422,13 @@ static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
 	smp_mb();
 
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
-		if (waitqueue_active(&ctx->wait))
-			wake_up(&ctx->wait);
+		if (waitqueue_active(&ctx->cq_wait))
+			wake_up(&ctx->cq_wait);
 	}
 	if (io_should_trigger_evfd(ctx))
 		eventfd_signal(ctx->cq_ev_fd, 1);
-	if (waitqueue_active(&ctx->cq_wait)) {
-		wake_up_interruptible(&ctx->cq_wait);
+	if (waitqueue_active(&ctx->poll_wait)) {
+		wake_up_interruptible(&ctx->poll_wait);
 		kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
 	}
 }
@@ -7056,10 +7056,10 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			ret = -EBUSY;
 			break;
 		}
-		prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
+		prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
 						TASK_INTERRUPTIBLE);
 		ret = io_cqring_wait_schedule(ctx, &iowq, &timeout);
-		finish_wait(&ctx->wait, &iowq.wq);
+		finish_wait(&ctx->cq_wait, &iowq.wq);
 		cond_resched();
 	} while (ret > 0);
 
@@ -8678,7 +8678,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
 	struct io_ring_ctx *ctx = file->private_data;
 	__poll_t mask = 0;
 
-	poll_wait(file, &ctx->cq_wait, wait);
+	poll_wait(file, &ctx->poll_wait, wait);
 	/*
 	 * synchronizes with barrier from wq_has_sleeper call in
 	 * io_commit_cqring
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 10/12] io_uring: move uring_lock location
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
                   ` (8 preceding siblings ...)
  2021-06-14 22:37 ` [PATCH 09/12] io_uring: wait heads renaming Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 11/12] io_uring: refactor io_req_defer() Pavel Begunkov
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

->uring_lock is prevalently used for submission, even though it protects
many other things like iopoll, registeration, selected bufs, and more.
And it's placed together with ->cq_wait poked on completion and CQ
waiting sides. Move them apart, ->uring_lock goes to the submission
data, and cq_wait to completion related chunk. The last one requires
some reshuffling so everything needed by io_cqring_ev_posted*() is in
one cacheline.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e9bf26fbf65d..1b6cfc6b79c5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -356,6 +356,8 @@ struct io_ring_ctx {
 
 	/* submission data */
 	struct {
+		struct mutex		uring_lock;
+
 		/*
 		 * Ring buffer of indices into array of io_uring_sqe, which is
 		 * mmapped by the application using the IORING_OFF_SQES offset.
@@ -392,11 +394,6 @@ struct io_ring_ctx {
 		unsigned		sq_thread_idle;
 	} ____cacheline_aligned_in_smp;
 
-	struct {
-		struct mutex		uring_lock;
-		wait_queue_head_t	cq_wait;
-	} ____cacheline_aligned_in_smp;
-
 	/* IRQ completion list, under ->completion_lock */
 	struct list_head	locked_free_list;
 	unsigned int		locked_free_nr;
@@ -412,12 +409,13 @@ struct io_ring_ctx {
 	struct {
 		unsigned		cached_cq_tail;
 		unsigned		cq_entries;
-		atomic_t		cq_timeouts;
-		unsigned		cq_last_tm_flush;
-		unsigned		cq_extra;
+		struct eventfd_ctx	*cq_ev_fd;
 		struct wait_queue_head	poll_wait;
+		struct wait_queue_head	cq_wait;
+		unsigned		cq_extra;
+		atomic_t		cq_timeouts;
 		struct fasync_struct	*cq_fasync;
-		struct eventfd_ctx	*cq_ev_fd;
+		unsigned		cq_last_tm_flush;
 	} ____cacheline_aligned_in_smp;
 
 	struct {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 11/12] io_uring: refactor io_req_defer()
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
                   ` (9 preceding siblings ...)
  2021-06-14 22:37 ` [PATCH 10/12] io_uring: move uring_lock location Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-14 22:37 ` [PATCH 12/12] io_uring: optimise non-drain path Pavel Begunkov
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

Rename io_req_defer() into io_drain_req() and refactor it uncoupling it
from io_queue_sqe() error handling and preparing for coming
optimisations. Also, prioritise non IOSQE_ASYNC path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 39 +++++++++++++++++++--------------------
 1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1b6cfc6b79c5..29b705201ca3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5998,7 +5998,7 @@ static u32 io_get_sequence(struct io_kiocb *req)
 	return ctx->cached_sq_head - nr_reqs;
 }
 
-static int io_req_defer(struct io_kiocb *req)
+static bool io_drain_req(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_defer_entry *de;
@@ -6008,27 +6008,29 @@ static int io_req_defer(struct io_kiocb *req)
 	/* Still need defer if there is pending req in defer list. */
 	if (likely(list_empty_careful(&ctx->defer_list) &&
 		!(req->flags & REQ_F_IO_DRAIN)))
-		return 0;
+		return false;
 
 	seq = io_get_sequence(req);
 	/* Still a chance to pass the sequence check */
 	if (!req_need_defer(req, seq) && list_empty_careful(&ctx->defer_list))
-		return 0;
+		return false;
 
 	ret = io_req_prep_async(req);
 	if (ret)
 		return ret;
 	io_prep_async_link(req);
 	de = kmalloc(sizeof(*de), GFP_KERNEL);
-	if (!de)
-		return -ENOMEM;
+	if (!de) {
+		io_req_complete_failed(req, ret);
+		return true;
+	}
 
 	spin_lock_irq(&ctx->completion_lock);
 	if (!req_need_defer(req, seq) && list_empty(&ctx->defer_list)) {
 		spin_unlock_irq(&ctx->completion_lock);
 		kfree(de);
 		io_queue_async_work(req);
-		return -EIOCBQUEUED;
+		return true;
 	}
 
 	trace_io_uring_defer(ctx, req, req->user_data);
@@ -6036,7 +6038,7 @@ static int io_req_defer(struct io_kiocb *req)
 	de->seq = seq;
 	list_add_tail(&de->list, &ctx->defer_list);
 	spin_unlock_irq(&ctx->completion_lock);
-	return -EIOCBQUEUED;
+	return true;
 }
 
 static void io_clean_op(struct io_kiocb *req)
@@ -6447,21 +6449,18 @@ static void __io_queue_sqe(struct io_kiocb *req)
 
 static void io_queue_sqe(struct io_kiocb *req)
 {
-	int ret;
+	if (io_drain_req(req))
+		return;
 
-	ret = io_req_defer(req);
-	if (ret) {
-		if (ret != -EIOCBQUEUED) {
-fail_req:
-			io_req_complete_failed(req, ret);
-		}
-	} else if (req->flags & REQ_F_FORCE_ASYNC) {
-		ret = io_req_prep_async(req);
-		if (unlikely(ret))
-			goto fail_req;
-		io_queue_async_work(req);
-	} else {
+	if (likely(!(req->flags & REQ_F_FORCE_ASYNC))) {
 		__io_queue_sqe(req);
+	} else {
+		int ret = io_req_prep_async(req);
+
+		if (unlikely(ret))
+			io_req_complete_failed(req, ret);
+		else
+			io_queue_async_work(req);
 	}
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 12/12] io_uring: optimise non-drain path
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
                   ` (10 preceding siblings ...)
  2021-06-14 22:37 ` [PATCH 11/12] io_uring: refactor io_req_defer() Pavel Begunkov
@ 2021-06-14 22:37 ` Pavel Begunkov
  2021-06-15 12:27 ` [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
  2021-06-15 21:38 ` Jens Axboe
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-14 22:37 UTC (permalink / raw)
  To: Jens Axboe, io-uring

Replace drain checks with one-way flag set upon seeing the first
IOSQE_IO_DRAIN request. There are several places where it cuts cycles
well:

1) It's much faster than the fast check with two
conditions in io_drain_req() including pretty complex
list_empty_careful().

2) We can mark io_queue_sqe() inline now, that's a huge win.

3) It replaces timeout and drain checks in io_commit_cqring() with a
single flags test. Also great not touching ->defer_list there without a
reason so limiting cache bouncing.

It adds a small amount of overhead to drain path, but it's negligible.
The main nuisance is that once it meets any DRAIN request in io_uring
instance lifetime it will _always_ go through a slower path, so
drain-less and offset-mode timeout less applications are preferable.
The overhead in that case would be not big, but it's worth to bear in
mind.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 57 +++++++++++++++++++++++++++------------------------
 1 file changed, 30 insertions(+), 27 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 29b705201ca3..5828ffdbea82 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -352,6 +352,7 @@ struct io_ring_ctx {
 		unsigned int		eventfd_async: 1;
 		unsigned int		restricted: 1;
 		unsigned int		off_timeout_used: 1;
+		unsigned int		drain_used: 1;
 	} ____cacheline_aligned_in_smp;
 
 	/* submission data */
@@ -1299,9 +1300,9 @@ static void io_kill_timeout(struct io_kiocb *req, int status)
 	}
 }
 
-static void __io_queue_deferred(struct io_ring_ctx *ctx)
+static void io_queue_deferred(struct io_ring_ctx *ctx)
 {
-	do {
+	while (!list_empty(&ctx->defer_list)) {
 		struct io_defer_entry *de = list_first_entry(&ctx->defer_list,
 						struct io_defer_entry, list);
 
@@ -1310,17 +1311,12 @@ static void __io_queue_deferred(struct io_ring_ctx *ctx)
 		list_del_init(&de->list);
 		io_req_task_queue(de->req);
 		kfree(de);
-	} while (!list_empty(&ctx->defer_list));
+	}
 }
 
 static void io_flush_timeouts(struct io_ring_ctx *ctx)
 {
-	u32 seq;
-
-	if (likely(!ctx->off_timeout_used))
-		return;
-
-	seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
+	u32 seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
 
 	while (!list_empty(&ctx->timeout_list)) {
 		u32 events_needed, events_got;
@@ -1350,13 +1346,14 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
 
 static void io_commit_cqring(struct io_ring_ctx *ctx)
 {
-	io_flush_timeouts(ctx);
-
+	if (unlikely(ctx->off_timeout_used || ctx->drain_used)) {
+		if (ctx->off_timeout_used)
+			io_flush_timeouts(ctx);
+		if (ctx->drain_used)
+			io_queue_deferred(ctx);
+	}
 	/* order cqe stores with ring update */
 	smp_store_release(&ctx->rings->cq.tail, ctx->cached_cq_tail);
-
-	if (unlikely(!list_empty(&ctx->defer_list)))
-		__io_queue_deferred(ctx);
 }
 
 static inline bool io_sqring_full(struct io_ring_ctx *ctx)
@@ -6447,9 +6444,9 @@ static void __io_queue_sqe(struct io_kiocb *req)
 		io_queue_linked_timeout(linked_timeout);
 }
 
-static void io_queue_sqe(struct io_kiocb *req)
+static inline void io_queue_sqe(struct io_kiocb *req)
 {
-	if (io_drain_req(req))
+	if (unlikely(req->ctx->drain_used) && io_drain_req(req))
 		return;
 
 	if (likely(!(req->flags & REQ_F_FORCE_ASYNC))) {
@@ -6573,6 +6570,23 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		io_req_complete_failed(req, ret);
 		return ret;
 	}
+
+	if (unlikely(req->flags & REQ_F_IO_DRAIN)) {
+		ctx->drain_used = true;
+
+		/*
+		 * Taking sequential execution of a link, draining both sides
+		 * of the link also fullfils IOSQE_IO_DRAIN semantics for all
+		 * requests in the link. So, it drains the head and the
+		 * next after the link request. The last one is done via
+		 * drain_next flag to persist the effect across calls.
+		 */
+		if (link->head) {
+			link->head->flags |= REQ_F_IO_DRAIN;
+			ctx->drain_next = 1;
+		}
+	}
+
 	ret = io_req_prep(req, sqe);
 	if (unlikely(ret))
 		goto fail_req;
@@ -6591,17 +6605,6 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	if (link->head) {
 		struct io_kiocb *head = link->head;
 
-		/*
-		 * Taking sequential execution of a link, draining both sides
-		 * of the link also fullfils IOSQE_IO_DRAIN semantics for all
-		 * requests in the link. So, it drains the head and the
-		 * next after the link request. The last one is done via
-		 * drain_next flag to persist the effect across calls.
-		 */
-		if (req->flags & REQ_F_IO_DRAIN) {
-			head->flags |= REQ_F_IO_DRAIN;
-			ctx->drain_next = 1;
-		}
 		ret = io_req_prep_async(req);
 		if (unlikely(ret))
 			goto fail_req;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 5.14 00/12] for-next optimisations
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
                   ` (11 preceding siblings ...)
  2021-06-14 22:37 ` [PATCH 12/12] io_uring: optimise non-drain path Pavel Begunkov
@ 2021-06-15 12:27 ` Pavel Begunkov
  2021-06-15 21:38 ` Jens Axboe
  13 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2021-06-15 12:27 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 6/14/21 11:37 PM, Pavel Begunkov wrote:
> There are two main lines intervened. The first one is pt.2 of ctx field
> shuffling for better caching. There is a couple of things left on that
> front.
> 
> The second is optimising (assumably) rarely used offset-based timeouts
> and draining. There is a downside (see 12/12), which will be fixed
> later. In plans to queue a task_work clearing drain_used (under
> uring_lock) from io_queue_deferred() once all drainee are gone.

There is a much better way to disable it back: in io_drain_req().
The series is good to go, I'll patch on top later.

> nops(batch=32):
>     15.9 MIOPS vs 17.3 MIOPS
> nullblk (irqmode=2 completion_nsec=0 submit_queues=16), no merges, no stat
>     1002 KIOPS vs 1050 KIOPS
> 
> Though the second test is very slow comparing to what I've seen before,
> so might be not represantative.
> 
> Pavel Begunkov (12):
>   io_uring: keep SQ pointers in a single cacheline
>   io_uring: move ctx->flags from SQ cacheline
>   io_uring: shuffle more fields into SQ ctx section
>   io_uring: refactor io_get_sqe()
>   io_uring: don't cache number of dropped SQEs
>   io_uring: optimise completion timeout flushing
>   io_uring: small io_submit_sqe() optimisation
>   io_uring: clean up check_overflow flag
>   io_uring: wait heads renaming
>   io_uring: move uring_lock location
>   io_uring: refactor io_req_defer()
>   io_uring: optimise non-drain path
> 
>  fs/io_uring.c | 226 +++++++++++++++++++++++++-------------------------
>  1 file changed, 111 insertions(+), 115 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5.14 00/12] for-next optimisations
  2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
                   ` (12 preceding siblings ...)
  2021-06-15 12:27 ` [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
@ 2021-06-15 21:38 ` Jens Axboe
  13 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2021-06-15 21:38 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 6/14/21 4:37 PM, Pavel Begunkov wrote:
> There are two main lines intervened. The first one is pt.2 of ctx field
> shuffling for better caching. There is a couple of things left on that
> front.
> 
> The second is optimising (assumably) rarely used offset-based timeouts
> and draining. There is a downside (see 12/12), which will be fixed
> later. In plans to queue a task_work clearing drain_used (under
> uring_lock) from io_queue_deferred() once all drainee are gone.
> 
> nops(batch=32):
>     15.9 MIOPS vs 17.3 MIOPS
> nullblk (irqmode=2 completion_nsec=0 submit_queues=16), no merges, no stat
>     1002 KIOPS vs 1050 KIOPS
> 
> Though the second test is very slow comparing to what I've seen before,
> so might be not represantative.

Applied, thanks. I'll run this through my testing, too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-06-15 21:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-14 22:37 [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
2021-06-14 22:37 ` [PATCH 01/12] io_uring: keep SQ pointers in a single cacheline Pavel Begunkov
2021-06-14 22:37 ` [PATCH 02/12] io_uring: move ctx->flags from SQ cacheline Pavel Begunkov
2021-06-14 22:37 ` [PATCH 03/12] io_uring: shuffle more fields into SQ ctx section Pavel Begunkov
2021-06-14 22:37 ` [PATCH 04/12] io_uring: refactor io_get_sqe() Pavel Begunkov
2021-06-14 22:37 ` [PATCH 05/12] io_uring: don't cache number of dropped SQEs Pavel Begunkov
2021-06-14 22:37 ` [PATCH 06/12] io_uring: optimise completion timeout flushing Pavel Begunkov
2021-06-14 22:37 ` [PATCH 07/12] io_uring: small io_submit_sqe() optimisation Pavel Begunkov
2021-06-14 22:37 ` [PATCH 08/12] io_uring: clean up check_overflow flag Pavel Begunkov
2021-06-14 22:37 ` [PATCH 09/12] io_uring: wait heads renaming Pavel Begunkov
2021-06-14 22:37 ` [PATCH 10/12] io_uring: move uring_lock location Pavel Begunkov
2021-06-14 22:37 ` [PATCH 11/12] io_uring: refactor io_req_defer() Pavel Begunkov
2021-06-14 22:37 ` [PATCH 12/12] io_uring: optimise non-drain path Pavel Begunkov
2021-06-15 12:27 ` [PATCH 5.14 00/12] for-next optimisations Pavel Begunkov
2021-06-15 21:38 ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.