[PATCH v5 0/6] task work optimization

io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/6] task work optimization
@ 2021-11-24 12:21 Hao Xu
  2021-11-24 12:21 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Hao Xu @ 2021-11-24 12:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

v4->v5
- change the implementation of merge_wq_list

Hao Xu (6):
  io-wq: add helper to merge two wq_lists
  io_uring: add a priority tw list for irq completion work
  io_uring: add helper for task work execution code
  io_uring: split io_req_complete_post() and add a helper
  io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
  io_uring: batch completion in prior_task_list

 fs/io-wq.h    |  22 +++++++
 fs/io_uring.c | 158 +++++++++++++++++++++++++++++++++-----------------
 2 files changed, 128 insertions(+), 52 deletions(-)

-- 
2.24.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/6] io-wq: add helper to merge two wq_lists
  2021-11-24 12:21 [PATCH v5 0/6] task work optimization Hao Xu
@ 2021-11-24 12:21 ` Hao Xu
  2021-11-24 12:21 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Hao Xu @ 2021-11-24 12:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

add a helper to merge two wq_lists, it will be useful in the next
patches.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io-wq.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/fs/io-wq.h b/fs/io-wq.h
index 41bf37674a49..3709b7c5ec98 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -52,6 +52,28 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
 		list->last = node;
 }
 
+/**
+ * wq_list_merge - merge the second list to the first one.
+ * @list0: the first list
+ * @list1: the second list
+ * Return the first node after mergence.
+ */
+static inline struct io_wq_work_node *wq_list_merge(struct io_wq_work_list *list0,
+						    struct io_wq_work_list *list1)
+{
+	struct io_wq_work_node *ret;
+
+	if (!list0->first) {
+		ret = list1->first;
+	} else {
+		ret = list0->first;
+		list0->last->next = list1->first;
+	}
+	INIT_WQ_LIST(list0);
+	INIT_WQ_LIST(list1);
+	return ret;
+}
+
 static inline void wq_list_add_tail(struct io_wq_work_node *node,
 				    struct io_wq_work_list *list)
 {
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/6] io_uring: add a priority tw list for irq completion work
  2021-11-24 12:21 [PATCH v5 0/6] task work optimization Hao Xu
  2021-11-24 12:21 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
@ 2021-11-24 12:21 ` Hao Xu
  2021-11-24 12:21 ` [PATCH 3/6] io_uring: add helper for task work execution code Hao Xu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Hao Xu @ 2021-11-24 12:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency and users can issue next round of sqes earlier.
An explanatory case:

origin timeline:
    submit_sqe-->irq-->add completion task_work
    -->run heavy work0~n-->run completion task_work
now timeline:
    submit_sqe-->irq-->add completion task_work
    -->run completion task_work-->run heavy work0~n

Limitation: this optimization is only for those that submission and
reaping process are in different threads. Otherwise anyhow we have to
submit new sqes after returning to userspace, then the order of TWs
doesn't matter.

Tested this patch(and the following ones) by manually replace
__io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
'heavy' task works. Then test with fio:

ioengine=io_uring
sqpoll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1

Tried various iodepth.
The peak IOPS for this patch is 710K, while the old one is 665K.
For avg latency, difference shows when iodepth grow:
depth and avg latency(usec):
	depth      new          old
	 1        7.05         7.10
	 2        8.47         8.60
	 4        10.42        10.42
	 8        13.78        13.22
	 16       27.41        24.33
	 32       49.40        53.08
	 64       102.53       103.36
	 128      196.98       205.61
	 256      372.99       414.88
         512      747.23       791.30
         1024     1472.59      1538.72
         2048     3153.49      3329.01
         4096     6387.86      6682.54
         8192     12150.25     12774.14
         16384    23085.58     26044.71

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c887e4e19e9e..aca68c4a7964 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -467,6 +467,7 @@ struct io_uring_task {
 
 	spinlock_t		task_lock;
 	struct io_wq_work_list	task_list;
+	struct io_wq_work_list	prior_task_list;
 	struct callback_head	task_work;
 	bool			task_running;
 };
@@ -2148,12 +2149,12 @@ static void tctx_task_work(struct callback_head *cb)
 	while (1) {
 		struct io_wq_work_node *node;
 
-		if (!tctx->task_list.first && locked)
+		if (!tctx->prior_task_list.first &&
+		    !tctx->task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		node = tctx->task_list.first;
-		INIT_WQ_LIST(&tctx->task_list);
+		node= wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
 		if (!node)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
@@ -2182,7 +2183,7 @@ static void tctx_task_work(struct callback_head *cb)
 	ctx_flush_and_put(ctx, &locked);
 }
 
-static void io_req_task_work_add(struct io_kiocb *req)
+static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 {
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
@@ -2194,7 +2195,10 @@ static void io_req_task_work_add(struct io_kiocb *req)
 	WARN_ON_ONCE(!tctx);
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+	if (priority)
+		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
+	else
+		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
 	running = tctx->task_running;
 	if (!running)
 		tctx->task_running = true;
@@ -2219,8 +2223,7 @@ static void io_req_task_work_add(struct io_kiocb *req)
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
 	tctx->task_running = false;
-	node = tctx->task_list.first;
-	INIT_WQ_LIST(&tctx->task_list);
+	node = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
 	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	while (node) {
@@ -2257,19 +2260,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 {
 	req->result = ret;
 	req->io_task_work.func = io_req_task_cancel;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_req_task_submit;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue_reissue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_queue_async_work;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static inline void io_queue_next(struct io_kiocb *req)
@@ -2374,7 +2377,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req)
 {
 	if (req_ref_put_and_test(req)) {
 		req->io_task_work.func = io_free_req_work;
-		io_req_task_work_add(req);
+		io_req_task_work_add(req, false);
 	}
 }
 
@@ -2677,7 +2680,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 		return;
 	req->result = res;
 	req->io_task_work.func = io_req_task_complete;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
 }
 
 static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
@@ -5248,7 +5251,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
 	 * of executing it. We can't safely execute it anyway, as we may not
 	 * have the needed state needed for it anyway.
 	 */
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return 1;
 }
 
@@ -5914,7 +5917,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -6880,7 +6883,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_link_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -8584,6 +8587,7 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
 	task->io_uring = tctx;
 	spin_lock_init(&tctx->task_lock);
 	INIT_WQ_LIST(&tctx->task_list);
+	INIT_WQ_LIST(&tctx->prior_task_list);
 	init_task_work(&tctx->task_work, tctx_task_work);
 	return 0;
 }
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/6] io_uring: add helper for task work execution code
  2021-11-24 12:21 [PATCH v5 0/6] task work optimization Hao Xu
  2021-11-24 12:21 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
  2021-11-24 12:21 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu
@ 2021-11-24 12:21 ` Hao Xu
  2021-11-24 12:22 ` [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper Hao Xu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Hao Xu @ 2021-11-24 12:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Add a helper for task work execution code. We will use it later.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index aca68c4a7964..4f7ec0a4f080 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2139,6 +2139,25 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 	percpu_ref_put(&ctx->refs);
 }
 
+static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
+{
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		if (req->ctx != *ctx) {
+			ctx_flush_and_put(*ctx, locked);
+			*ctx = req->ctx;
+			/* if not contended, grab and improve batching */
+			*locked = mutex_trylock(&(*ctx)->uring_lock);
+			percpu_ref_get(&(*ctx)->refs);
+		}
+		req->io_task_work.func(req, locked);
+		node = next;
+	} while (node);
+}
+
 static void tctx_task_work(struct callback_head *cb)
 {
 	bool locked = false;
@@ -2161,22 +2180,7 @@ static void tctx_task_work(struct callback_head *cb)
 		if (!node)
 			break;
 
-		do {
-			struct io_wq_work_node *next = node->next;
-			struct io_kiocb *req = container_of(node, struct io_kiocb,
-							    io_task_work.node);
-
-			if (req->ctx != ctx) {
-				ctx_flush_and_put(ctx, &locked);
-				ctx = req->ctx;
-				/* if not contended, grab and improve batching */
-				locked = mutex_trylock(&ctx->uring_lock);
-				percpu_ref_get(&ctx->refs);
-			}
-			req->io_task_work.func(req, &locked);
-			node = next;
-		} while (node);
-
+		handle_tw_list(node, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper
  2021-11-24 12:21 [PATCH v5 0/6] task work optimization Hao Xu
                   ` (2 preceding siblings ...)
  2021-11-24 12:21 ` [PATCH 3/6] io_uring: add helper for task work execution code Hao Xu
@ 2021-11-24 12:22 ` Hao Xu
  2021-11-24 12:22 ` [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Hao Xu @ 2021-11-24 12:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Split io_req_complete_post(), this is a prep for the next patch.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4f7ec0a4f080..a6eb75f95c80 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1818,12 +1818,11 @@ static noinline bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data
 	return __io_cqring_fill_event(ctx, user_data, res, cflags);
 }
 
-static void io_req_complete_post(struct io_kiocb *req, s32 res,
+static void __io_req_complete_post(struct io_kiocb *req, s32 res,
 				 u32 cflags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
-	spin_lock(&ctx->completion_lock);
 	__io_cqring_fill_event(ctx, req->user_data, res, cflags);
 	/*
 	 * If we're the last reference to this request, add to our locked
@@ -1844,6 +1843,15 @@ static void io_req_complete_post(struct io_kiocb *req, s32 res,
 		wq_list_add_head(&req->comp_list, &ctx->locked_free_list);
 		ctx->locked_free_nr++;
 	}
+}
+
+static void io_req_complete_post(struct io_kiocb *req, long res,
+				 unsigned int cflags)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	spin_lock(&ctx->completion_lock);
+	__io_req_complete_post(req, res, cflags);
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
 	io_cqring_ev_posted(ctx);
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
  2021-11-24 12:21 [PATCH v5 0/6] task work optimization Hao Xu
                   ` (3 preceding siblings ...)
  2021-11-24 12:22 ` [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper Hao Xu
@ 2021-11-24 12:22 ` Hao Xu
  2021-11-24 12:22 ` [PATCH 6/6] io_uring: batch completion in prior_task_list Hao Xu
  2021-11-24 21:41 ` [PATCH v5 0/6] task work optimization Pavel Begunkov
  6 siblings, 0 replies; 11+ messages in thread
From: Hao Xu @ 2021-11-24 12:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Move them up to avoid explicit declaration. We will use them in later
patches.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a6eb75f95c80..e9d12fb0501a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2147,6 +2147,24 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 	percpu_ref_put(&ctx->refs);
 }
 
+static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
+{
+	unsigned int cflags;
+
+	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
+	cflags |= IORING_CQE_F_BUFFER;
+	req->flags &= ~REQ_F_BUFFER_SELECTED;
+	kfree(kbuf);
+	return cflags;
+}
+
+static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
+{
+	if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
+		return 0;
+	return io_put_kbuf(req, req->kbuf);
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2408,24 +2426,6 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)
 	return smp_load_acquire(&rings->sq.tail) - ctx->cached_sq_head;
 }
 
-static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
-{
-	unsigned int cflags;
-
-	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
-	cflags |= IORING_CQE_F_BUFFER;
-	req->flags &= ~REQ_F_BUFFER_SELECTED;
-	kfree(kbuf);
-	return cflags;
-}
-
-static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
-{
-	if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
-		return 0;
-	return io_put_kbuf(req, req->kbuf);
-}
-
 static inline bool io_run_task_work(void)
 {
 	if (test_thread_flag(TIF_NOTIFY_SIGNAL) || current->task_works) {
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 6/6] io_uring: batch completion in prior_task_list
  2021-11-24 12:21 [PATCH v5 0/6] task work optimization Hao Xu
                   ` (4 preceding siblings ...)
  2021-11-24 12:22 ` [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
@ 2021-11-24 12:22 ` Hao Xu
  2021-11-24 21:41 ` [PATCH v5 0/6] task work optimization Pavel Begunkov
  6 siblings, 0 replies; 11+ messages in thread
From: Hao Xu @ 2021-11-24 12:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch regardless uring lock. For instance, we are doing simple
direct read, most task work will be io_req_task_complete(), with this
patch we don't need to hold uring lock there for long time.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e9d12fb0501a..8bfef48041d8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2165,6 +2165,37 @@ static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
 	return io_put_kbuf(req, req->kbuf);
 }
 
+static void handle_prior_tw_list(struct io_wq_work_node *node)
+{
+	struct io_ring_ctx *ctx = NULL;
+
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+		if (req->ctx != ctx) {
+			if (ctx) {
+				io_commit_cqring(ctx);
+				spin_unlock(&ctx->completion_lock);
+				io_cqring_ev_posted(ctx);
+				percpu_ref_put(&ctx->refs);
+			}
+			ctx = req->ctx;
+			percpu_ref_get(&ctx->refs);
+			spin_lock(&ctx->completion_lock);
+		}
+		__io_req_complete_post(req, req->result, io_put_rw_kbuf(req));
+		node = next;
+	} while (node);
+
+	if (ctx) {
+		io_commit_cqring(ctx);
+		spin_unlock(&ctx->completion_lock);
+		io_cqring_ev_posted(ctx);
+		percpu_ref_put(&ctx->refs);
+	}
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2192,21 +2223,28 @@ static void tctx_task_work(struct callback_head *cb)
 						  task_work);
 
 	while (1) {
-		struct io_wq_work_node *node;
+		struct io_wq_work_node *node1, *node2;
 
-		if (!tctx->prior_task_list.first &&
-		    !tctx->task_list.first && locked)
+		if (!tctx->task_list.first &&
+		    !tctx->prior_task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		node= wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
-		if (!node)
+		node1 = tctx->prior_task_list.first;
+		node2 = tctx->task_list.first;
+		INIT_WQ_LIST(&tctx->task_list);
+		INIT_WQ_LIST(&tctx->prior_task_list);
+		if (!node2 && !node1)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
-		if (!node)
+		if (!node2 && !node1)
 			break;
 
-		handle_tw_list(node, &ctx, &locked);
+		if (node1)
+			handle_prior_tw_list(node1);
+
+		if (node2)
+			handle_tw_list(node2, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 0/6] task work optimization
  2021-11-24 12:21 [PATCH v5 0/6] task work optimization Hao Xu
                   ` (5 preceding siblings ...)
  2021-11-24 12:22 ` [PATCH 6/6] io_uring: batch completion in prior_task_list Hao Xu
@ 2021-11-24 21:41 ` Pavel Begunkov
  2021-11-25 11:37   ` Hao Xu
  6 siblings, 1 reply; 11+ messages in thread
From: Pavel Begunkov @ 2021-11-24 21:41 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 11/24/21 12:21, Hao Xu wrote:
> v4->v5
> - change the implementation of merge_wq_list

They only concern I had was about 6/6 not using inline completion
infra, when it's faster to grab ->uring_lock. i.e.
io_submit_flush_completions(), which should be faster when batching
is good.

Looking again through the code, the only user is SQPOLL

io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));

And with SQPOLL the lock is mostly grabbed by the SQPOLL task only,
IOW for pure block rw there shouldn't be any contention.
Doesn't make much sense, what am I missing?
How many requests are completed on average per tctx_task_work()?


It doesn't apply to for-5.17/io_uring, here is a rebase:
https://github.com/isilence/linux.git haoxu_tw_opt
link: https://github.com/isilence/linux/tree/haoxu_tw_opt

With that first 5 patches look good, so for them:
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>

but I still don't understand how 6/6 is better. Can it be because of
indirect branching? E.g. would something like this give the result?

- req->io_task_work.func(req, locked);
+ INDIRECT_CALL_1(req->io_task_work.func, io_req_task_complete, req, locked);


> Hao Xu (6):
>    io-wq: add helper to merge two wq_lists
>    io_uring: add a priority tw list for irq completion work
>    io_uring: add helper for task work execution code
>    io_uring: split io_req_complete_post() and add a helper
>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>    io_uring: batch completion in prior_task_list
> 
>   fs/io-wq.h    |  22 +++++++
>   fs/io_uring.c | 158 +++++++++++++++++++++++++++++++++-----------------
>   2 files changed, 128 insertions(+), 52 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 0/6] task work optimization
  2021-11-24 21:41 ` [PATCH v5 0/6] task work optimization Pavel Begunkov
@ 2021-11-25 11:37   ` Hao Xu
  2021-11-25 15:27     ` Pavel Begunkov
  0 siblings, 1 reply; 11+ messages in thread
From: Hao Xu @ 2021-11-25 11:37 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/11/25 上午5:41, Pavel Begunkov 写道:
> On 11/24/21 12:21, Hao Xu wrote:
>> v4->v5
>> - change the implementation of merge_wq_list
> 
> They only concern I had was about 6/6 not using inline completion
> infra, when it's faster to grab ->uring_lock. i.e.
> io_submit_flush_completions(), which should be faster when batching
> is good.
> 
> Looking again through the code, the only user is SQPOLL
> 
> io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
> 
> And with SQPOLL the lock is mostly grabbed by the SQPOLL task only,
> IOW for pure block rw there shouldn't be any contention.
There still could be other type of task work, like async buffered reads.
I considered generic situation where different kinds of task works mixed
in the task list, then the inline completion infra always handle the
completions at the end, while in this new batching, we first handle the
completions and commit_cqring then do other task works.
Btw, I'm not sure the inline completion infra is faster than this
batching in pure rw completion(where all the task works are completion)
case, from the code, seems they are similar. Any hints about this?

Regards,
Hao
> Doesn't make much sense, what am I missing?
> How many requests are completed on average per tctx_task_work()?
> 
> 
> It doesn't apply to for-5.17/io_uring, here is a rebase:
> https://github.com/isilence/linux.git haoxu_tw_opt
> link: https://github.com/isilence/linux/tree/haoxu_tw_opt
> 
> With that first 5 patches look good, so for them:
> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
> 
> but I still don't understand how 6/6 is better. Can it be because of
> indirect branching? E.g. would something like this give the result?
> 
> - req->io_task_work.func(req, locked);
> + INDIRECT_CALL_1(req->io_task_work.func, io_req_task_complete, req, 
> locked);
> 
> 
>> Hao Xu (6):
>>    io-wq: add helper to merge two wq_lists
>>    io_uring: add a priority tw list for irq completion work
>>    io_uring: add helper for task work execution code
>>    io_uring: split io_req_complete_post() and add a helper
>>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>>    io_uring: batch completion in prior_task_list
>>
>>   fs/io-wq.h    |  22 +++++++
>>   fs/io_uring.c | 158 +++++++++++++++++++++++++++++++++-----------------
>>   2 files changed, 128 insertions(+), 52 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 0/6] task work optimization
  2021-11-25 11:37   ` Hao Xu
@ 2021-11-25 15:27     ` Pavel Begunkov
  2021-11-26  3:58       ` Hao Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Pavel Begunkov @ 2021-11-25 15:27 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 11/25/21 11:37, Hao Xu wrote:
> 在 2021/11/25 上午5:41, Pavel Begunkov 写道:
>> On 11/24/21 12:21, Hao Xu wrote:
>>> v4->v5
>>> - change the implementation of merge_wq_list
>>
>> They only concern I had was about 6/6 not using inline completion
>> infra, when it's faster to grab ->uring_lock. i.e.
>> io_submit_flush_completions(), which should be faster when batching
>> is good.
>>
>> Looking again through the code, the only user is SQPOLL
>>
>> io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
>>
>> And with SQPOLL the lock is mostly grabbed by the SQPOLL task only,
>> IOW for pure block rw there shouldn't be any contention.
> There still could be other type of task work, like async buffered reads.
> I considered generic situation where different kinds of task works mixed
> in the task list, then the inline completion infra always handle the
> completions at the end, while in this new batching, we first handle the
> completions and commit_cqring then do other task works.

I was talking about 6/6 in particular. The reordering (done by first
2 or 3 patches) sound plausible, but if compare say 1-5 vs same but
+ patch 6/6

> Btw, I'm not sure the inline completion infra is faster than this
> batching in pure rw completion(where all the task works are completion)
> case, from the code, seems they are similar. Any hints about this?

Was looking through, and apparently I placed task_put optimisation
into io_req_complete_post() as well, see io_put_task().

pros of io_submit_flush_completions:
1) batched rsrc refs put
2) a bit better on assembly
3) shorter spin section (separate loop)
4) enqueueing right into ctx->submit_state.free_list, so no
    1 io_flush_cached_reqs() per IO_COMPL_BATCH=32

pros of io_req_complete_post() path:
1) no uring_lock locking (not contended)
2) de-virtualisation
3) no extra (yet another) list traversal and io_req_complete_state()

So, with put_task optimised, indeed not so clear which would win.
Did you use fixed rsrc for testing? (files or buffers)

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v5 0/6] task work optimization
  2021-11-25 15:27     ` Pavel Begunkov
@ 2021-11-26  3:58       ` Hao Xu
  0 siblings, 0 replies; 11+ messages in thread
From: Hao Xu @ 2021-11-26  3:58 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/11/25 下午11:27, Pavel Begunkov 写道:
> On 11/25/21 11:37, Hao Xu wrote:
>> 在 2021/11/25 上午5:41, Pavel Begunkov 写道:
>>> On 11/24/21 12:21, Hao Xu wrote:
>>>> v4->v5
>>>> - change the implementation of merge_wq_list
>>>
>>> They only concern I had was about 6/6 not using inline completion
>>> infra, when it's faster to grab ->uring_lock. i.e.
>>> io_submit_flush_completions(), which should be faster when batching
>>> is good.
>>>
>>> Looking again through the code, the only user is SQPOLL
>>>
>>> io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
>>>
>>> And with SQPOLL the lock is mostly grabbed by the SQPOLL task only,
>>> IOW for pure block rw there shouldn't be any contention.
>> There still could be other type of task work, like async buffered reads.
>> I considered generic situation where different kinds of task works mixed
>> in the task list, then the inline completion infra always handle the
>> completions at the end, while in this new batching, we first handle the
>> completions and commit_cqring then do other task works.
> 
> I was talking about 6/6 in particular. The reordering (done by first
> 2 or 3 patches) sound plausible, but if compare say 1-5 vs same but
> + patch 6/6
Ah, sorry.. misremember the content of 6/6 and the previous ones.
> 
>> Btw, I'm not sure the inline completion infra is faster than this
>> batching in pure rw completion(where all the task works are completion)
>> case, from the code, seems they are similar. Any hints about this?
> 
> Was looking through, and apparently I placed task_put optimisation
> into io_req_complete_post() as well, see io_put_task().
> 
> pros of io_submit_flush_completions:
> 1) batched rsrc refs put
> 2) a bit better on assembly
> 3) shorter spin section (separate loop)
> 4) enqueueing right into ctx->submit_state.free_list, so no
>     1 io_flush_cached_reqs() per IO_COMPL_BATCH=32
> 
> pros of io_req_complete_post() path:
> 1) no uring_lock locking (not contended)
> 2) de-virtualisation
> 3) no extra (yet another) list traversal and io_req_complete_state()
> 
> So, with put_task optimised, indeed not so clear which would win > Did you use fixed rsrc for testing? (files or buffers)
No, I didn't. Let's first play it safe as you said:
if (locked) flush_completions
else new stuff
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-11-26  4:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-24 12:21 [PATCH v5 0/6] task work optimization Hao Xu
2021-11-24 12:21 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
2021-11-24 12:21 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu
2021-11-24 12:21 ` [PATCH 3/6] io_uring: add helper for task work execution code Hao Xu
2021-11-24 12:22 ` [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper Hao Xu
2021-11-24 12:22 ` [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
2021-11-24 12:22 ` [PATCH 6/6] io_uring: batch completion in prior_task_list Hao Xu
2021-11-24 21:41 ` [PATCH v5 0/6] task work optimization Pavel Begunkov
2021-11-25 11:37   ` Hao Xu
2021-11-25 15:27     ` Pavel Begunkov
2021-11-26  3:58       ` Hao Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).