All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-5.16 v3 0/8] task work optimization
@ 2021-10-27 14:02 Hao Xu
  2021-10-27 14:02 ` [PATCH 1/8] io-wq: add helper to merge two wq_lists Hao Xu
                   ` (9 more replies)
  0 siblings, 10 replies; 16+ messages in thread
From: Hao Xu @ 2021-10-27 14:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Tested this patchset by manually replace __io_queue_sqe() in
io_queue_sqe() by io_req_task_queue() to construct 'heavy' task works.
Then test with fio:

ioengine=io_uring
sqpoll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1

2/8 set unlimited priority_task_list, 8/8 set a limitation of
1/3 * (len_prority_list + len_normal_list), data below:
   depth     no 8/8   include 8/8      before this patchset
    1        7.05         7.82              7.10
    2        8.47         8.48              8.60
    4        10.42        9.99              10.42
    8        13.78        13.13             13.22
    16       27.41        27.92             24.33
    32       49.40        46.16             53.08
    64       102.53       105.68            103.36
    128      196.98       202.76            205.61
    256      372.99       375.61            414.88
    512      747.23       763.95            791.30
    1024     1472.59      1527.46           1538.72
    2048     3153.49      3129.22           3329.01
    4096     6387.86      5899.74           6682.54
    8192     12150.25     12433.59          12774.14
    16384    23085.58     24342.84          26044.71

It seems 2/8 is better, haven't tried other choices other than 1/3,
still put 8/8 here for people's further thoughts.

Hao Xu (8):
  io-wq: add helper to merge two wq_lists
  io_uring: add a priority tw list for irq completion work
  io_uring: add helper for task work execution code
  io_uring: split io_req_complete_post() and add a helper
  io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
  io_uring: add nr_ctx to record the number of ctx in a task
  io_uring: batch completion in prior_task_list
  io_uring: add limited number of TWs to priority task list

 fs/io-wq.h    |  21 +++++++
 fs/io_uring.c | 168 +++++++++++++++++++++++++++++++++++---------------
 2 files changed, 138 insertions(+), 51 deletions(-)

-- 
2.24.4


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/8] io-wq: add helper to merge two wq_lists
  2021-10-27 14:02 [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
@ 2021-10-27 14:02 ` Hao Xu
  2021-10-27 14:02 ` [PATCH 2/8] io_uring: add a priority tw list for irq completion work Hao Xu
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Hao Xu @ 2021-10-27 14:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

add a helper to merge two wq_lists, it will be useful in the next
patches.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io-wq.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/io-wq.h b/fs/io-wq.h
index 41bf37674a49..a7b0b505db9d 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -52,6 +52,27 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
 		list->last = node;
 }
 
+/**
+ * wq_list_merge - merge the second list to the first one.
+ * @list0: the first list
+ * @list1: the second list
+ * Return list0 if list1 is NULL, and vice versa.
+ * Otherwise after merge, list0 contains the merged list.
+ */
+static inline struct io_wq_work_list *wq_list_merge(struct io_wq_work_list *list0,
+						    struct io_wq_work_list *list1)
+{
+	if (!list1 || !list1->first)
+		return list0;
+
+	if (!list0 || !list0->first)
+		return list1;
+
+	list0->last->next = list1->first;
+	list0->last = list1->last;
+	return list0;
+}
+
 static inline void wq_list_add_tail(struct io_wq_work_node *node,
 				    struct io_wq_work_list *list)
 {
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/8] io_uring: add a priority tw list for irq completion work
  2021-10-27 14:02 [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
  2021-10-27 14:02 ` [PATCH 1/8] io-wq: add helper to merge two wq_lists Hao Xu
@ 2021-10-27 14:02 ` Hao Xu
  2021-10-27 14:02 ` [PATCH 3/8] io_uring: add helper for task work execution code Hao Xu
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Hao Xu @ 2021-10-27 14:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency. an explanatory case:

origin timeline:
    submit_sqe-->irq-->add completion task_work
    -->run heavy work0~n-->run completion task_work
now timeline:
    submit_sqe-->irq-->add completion task_work
    -->run completion task_work-->run heavy work0~n

Tested this patch(and the following ones) by manually replace
__io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
'heavy' task works. Then test with fio:

ioengine=io_uring
sqpoll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1

Tried various iodepth.
The peak IOPS for this patch is 710K, while the old one is 665K.
For avg latency, difference shows when iodepth grow:
depth and avg latency(usec):
	depth      new          old
	 1        7.05         7.10
	 2        8.47         8.60
	 4        10.42        10.42
	 8        13.78        13.22
	 16       27.41        24.33
	 32       49.40        53.08
	 64       102.53       103.36
	 128      196.98       205.61
	 256      372.99       414.88
         512      747.23       791.30
         1024     1472.59      1538.72
         2048     3153.49      3329.01
         4096     6387.86      6682.54
         8192     12150.25     12774.14
         16384    23085.58     26044.71

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 88c5ee4dc242..9fdac3b9a560 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -467,6 +467,7 @@ struct io_uring_task {
 
 	spinlock_t		task_lock;
 	struct io_wq_work_list	task_list;
+	struct io_wq_work_list	prior_task_list;
 	struct callback_head	task_work;
 	bool			task_running;
 };
@@ -2147,13 +2148,17 @@ static void tctx_task_work(struct callback_head *cb)
 
 	while (1) {
 		struct io_wq_work_node *node;
+		struct io_wq_work_list *merged_list;
 
-		if (!tctx->task_list.first && locked)
+		if (!tctx->prior_task_list.first &&
+		    !tctx->task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		node = tctx->task_list.first;
+		merged_list = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
+		node = merged_list->first;
 		INIT_WQ_LIST(&tctx->task_list);
+		INIT_WQ_LIST(&tctx->prior_task_list);
 		if (!node)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
@@ -2182,19 +2187,23 @@ static void tctx_task_work(struct callback_head *cb)
 	ctx_flush_and_put(ctx, &locked);
 }
 
-static void io_req_task_work_add(struct io_kiocb *req)
+static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 {
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
 	enum task_work_notify_mode notify;
 	struct io_wq_work_node *node;
+	struct io_wq_work_list *merged_list;
 	unsigned long flags;
 	bool running;
 
 	WARN_ON_ONCE(!tctx);
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+	if (priority)
+		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
+	else
+		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
 	running = tctx->task_running;
 	if (!running)
 		tctx->task_running = true;
@@ -2219,8 +2228,10 @@ static void io_req_task_work_add(struct io_kiocb *req)
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
 	tctx->task_running = false;
-	node = tctx->task_list.first;
+	merged_list = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
+	node = merged_list->first;
 	INIT_WQ_LIST(&tctx->task_list);
+	INIT_WQ_LIST(&tctx->prior_task_list);
 	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	while (node) {
@@ -2257,19 +2268,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 {
 	req->result = ret;
 	req->io_task_work.func = io_req_task_cancel;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, true);
 }
 
 static void io_req_task_queue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_req_task_submit;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue_reissue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_queue_async_work;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static inline void io_queue_next(struct io_kiocb *req)
@@ -2374,7 +2385,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req)
 {
 	if (req_ref_put_and_test(req)) {
 		req->io_task_work.func = io_free_req_work;
-		io_req_task_work_add(req);
+		io_req_task_work_add(req, false);
 	}
 }
 
@@ -2677,7 +2688,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 		return;
 	req->result = res;
 	req->io_task_work.func = io_req_task_complete;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, true);
 }
 
 static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
@@ -5248,7 +5259,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
 	 * of executing it. We can't safely execute it anyway, as we may not
 	 * have the needed state needed for it anyway.
 	 */
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return 1;
 }
 
@@ -5916,7 +5927,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -6859,7 +6870,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_link_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -8563,6 +8574,7 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
 	task->io_uring = tctx;
 	spin_lock_init(&tctx->task_lock);
 	INIT_WQ_LIST(&tctx->task_list);
+	INIT_WQ_LIST(&tctx->prior_task_list);
 	init_task_work(&tctx->task_work, tctx_task_work);
 	return 0;
 }
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/8] io_uring: add helper for task work execution code
  2021-10-27 14:02 [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
  2021-10-27 14:02 ` [PATCH 1/8] io-wq: add helper to merge two wq_lists Hao Xu
  2021-10-27 14:02 ` [PATCH 2/8] io_uring: add a priority tw list for irq completion work Hao Xu
@ 2021-10-27 14:02 ` Hao Xu
  2021-10-27 14:02 ` [PATCH 4/8] io_uring: split io_req_complete_post() and add a helper Hao Xu
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Hao Xu @ 2021-10-27 14:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Add a helper for task work execution code. We will use it later.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9fdac3b9a560..98abf94b2015 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2139,6 +2139,25 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 	percpu_ref_put(&ctx->refs);
 }
 
+static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
+{
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		if (req->ctx != *ctx) {
+			ctx_flush_and_put(*ctx, locked);
+			*ctx = req->ctx;
+			/* if not contended, grab and improve batching */
+			*locked = mutex_trylock(&(*ctx)->uring_lock);
+			percpu_ref_get(&(*ctx)->refs);
+		}
+		req->io_task_work.func(req, locked);
+		node = next;
+	} while (node);
+}
+
 static void tctx_task_work(struct callback_head *cb)
 {
 	bool locked = false;
@@ -2165,22 +2184,7 @@ static void tctx_task_work(struct callback_head *cb)
 		if (!node)
 			break;
 
-		do {
-			struct io_wq_work_node *next = node->next;
-			struct io_kiocb *req = container_of(node, struct io_kiocb,
-							    io_task_work.node);
-
-			if (req->ctx != ctx) {
-				ctx_flush_and_put(ctx, &locked);
-				ctx = req->ctx;
-				/* if not contended, grab and improve batching */
-				locked = mutex_trylock(&ctx->uring_lock);
-				percpu_ref_get(&ctx->refs);
-			}
-			req->io_task_work.func(req, &locked);
-			node = next;
-		} while (node);
-
+		handle_tw_list(node, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/8] io_uring: split io_req_complete_post() and add a helper
  2021-10-27 14:02 [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
                   ` (2 preceding siblings ...)
  2021-10-27 14:02 ` [PATCH 3/8] io_uring: add helper for task work execution code Hao Xu
@ 2021-10-27 14:02 ` Hao Xu
  2021-10-27 14:02 ` [PATCH 5/8] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Hao Xu @ 2021-10-27 14:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Split io_req_complete_post(), this is a prep for the next patch.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 98abf94b2015..3c621798dd2f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1818,12 +1818,11 @@ static noinline bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data
 	return __io_cqring_fill_event(ctx, user_data, res, cflags);
 }
 
-static void io_req_complete_post(struct io_kiocb *req, s32 res,
+static void __io_req_complete_post(struct io_kiocb *req, s32 res,
 				 u32 cflags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
-	spin_lock(&ctx->completion_lock);
 	__io_cqring_fill_event(ctx, req->user_data, res, cflags);
 	/*
 	 * If we're the last reference to this request, add to our locked
@@ -1844,6 +1843,15 @@ static void io_req_complete_post(struct io_kiocb *req, s32 res,
 		wq_list_add_head(&req->comp_list, &ctx->locked_free_list);
 		ctx->locked_free_nr++;
 	}
+}
+
+static void io_req_complete_post(struct io_kiocb *req, long res,
+				 unsigned int cflags)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	spin_lock(&ctx->completion_lock);
+	__io_req_complete_post(req, res, cflags);
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
 	io_cqring_ev_posted(ctx);
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/8] io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
  2021-10-27 14:02 [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
                   ` (3 preceding siblings ...)
  2021-10-27 14:02 ` [PATCH 4/8] io_uring: split io_req_complete_post() and add a helper Hao Xu
@ 2021-10-27 14:02 ` Hao Xu
  2021-10-27 14:02 ` [PATCH 6/8] io_uring: add nr_ctx to record the number of ctx in a task Hao Xu
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Hao Xu @ 2021-10-27 14:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Move them up to avoid explicit declaration. We will use them in later
patches.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3c621798dd2f..db5d9189df3a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2147,6 +2147,24 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 	percpu_ref_put(&ctx->refs);
 }
 
+static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
+{
+	unsigned int cflags;
+
+	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
+	cflags |= IORING_CQE_F_BUFFER;
+	req->flags &= ~REQ_F_BUFFER_SELECTED;
+	kfree(kbuf);
+	return cflags;
+}
+
+static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
+{
+	if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
+		return 0;
+	return io_put_kbuf(req, req->kbuf);
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2416,24 +2434,6 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)
 	return smp_load_acquire(&rings->sq.tail) - ctx->cached_sq_head;
 }
 
-static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
-{
-	unsigned int cflags;
-
-	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
-	cflags |= IORING_CQE_F_BUFFER;
-	req->flags &= ~REQ_F_BUFFER_SELECTED;
-	kfree(kbuf);
-	return cflags;
-}
-
-static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
-{
-	if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
-		return 0;
-	return io_put_kbuf(req, req->kbuf);
-}
-
 static inline bool io_run_task_work(void)
 {
 	if (test_thread_flag(TIF_NOTIFY_SIGNAL) || current->task_works) {
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/8] io_uring: add nr_ctx to record the number of ctx in a task
  2021-10-27 14:02 [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
                   ` (4 preceding siblings ...)
  2021-10-27 14:02 ` [PATCH 5/8] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
@ 2021-10-27 14:02 ` Hao Xu
  2021-10-28  6:27   ` Hao Xu
  2021-10-27 14:02 ` [PATCH 7/8] io_uring: batch completion in prior_task_list Hao Xu
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 16+ messages in thread
From: Hao Xu @ 2021-10-27 14:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

This is used in the next patch for task_work batch optimization.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index db5d9189df3a..7c6d90d693b8 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -470,6 +470,7 @@ struct io_uring_task {
 	struct io_wq_work_list	prior_task_list;
 	struct callback_head	task_work;
 	bool			task_running;
+	unsigned int		nr_ctx;
 };
 
 /*
@@ -9655,6 +9656,9 @@ static int __io_uring_add_tctx_node(struct io_ring_ctx *ctx)
 		mutex_lock(&ctx->uring_lock);
 		list_add(&node->ctx_node, &ctx->tctx_list);
 		mutex_unlock(&ctx->uring_lock);
+		spin_lock_irq(&tctx->task_lock);
+		tctx->nr_ctx++;
+		spin_unlock_irq(&tctx->task_lock);
 	}
 	tctx->last = ctx;
 	return 0;
@@ -9692,6 +9696,9 @@ static __cold void io_uring_del_tctx_node(unsigned long index)
 	mutex_lock(&node->ctx->uring_lock);
 	list_del(&node->ctx_node);
 	mutex_unlock(&node->ctx->uring_lock);
+	spin_lock_irq(&tctx->task_lock);
+	tctx->nr_ctx--;
+	spin_unlock_irq(&tctx->task_lock);
 
 	if (tctx->last == node->ctx)
 		tctx->last = NULL;
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 7/8] io_uring: batch completion in prior_task_list
  2021-10-27 14:02 [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
                   ` (5 preceding siblings ...)
  2021-10-27 14:02 ` [PATCH 6/8] io_uring: add nr_ctx to record the number of ctx in a task Hao Xu
@ 2021-10-27 14:02 ` Hao Xu
  2021-10-27 14:02 ` [PATCH 8/8] io_uring: add limited number of TWs to priority task list Hao Xu
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Hao Xu @ 2021-10-27 14:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch. This is better than before in cases where !locked.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 43 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7c6d90d693b8..bf1b730df158 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2166,6 +2166,26 @@ static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
 	return io_put_kbuf(req, req->kbuf);
 }
 
+static void handle_prior_tw_list(struct io_wq_work_node *node)
+{
+	struct io_kiocb *req = container_of(node, struct io_kiocb, io_task_work.node);
+	struct io_ring_ctx *ctx = req->ctx;
+
+	spin_lock(&ctx->completion_lock);
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		__io_req_complete_post(req, req->result, io_put_rw_kbuf(req));
+		node = next;
+	} while (node);
+
+	io_commit_cqring(ctx);
+	spin_unlock(&ctx->completion_lock);
+	io_cqring_ev_posted(ctx);
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2193,25 +2213,34 @@ static void tctx_task_work(struct callback_head *cb)
 						  task_work);
 
 	while (1) {
-		struct io_wq_work_node *node;
-		struct io_wq_work_list *merged_list;
+		unsigned int nr_ctx;
+		struct io_wq_work_node *node1, *node2;
 
 		if (!tctx->prior_task_list.first &&
 		    !tctx->task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		merged_list = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
-		node = merged_list->first;
+		node1 = tctx->prior_task_list.first;
+		node2 = tctx->task_list.first;
 		INIT_WQ_LIST(&tctx->task_list);
 		INIT_WQ_LIST(&tctx->prior_task_list);
-		if (!node)
+		nr_ctx = tctx->nr_ctx;
+		if (!node1 && !node2)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
-		if (!node)
+		if (!node1 && !node2)
 			break;
 
-		handle_tw_list(node, &ctx, &locked);
+		if (node1) {
+			if (nr_ctx == 1)
+				handle_prior_tw_list(node1);
+			else
+				handle_tw_list(node1, &ctx, &locked);
+		}
+
+		if (node2)
+			handle_tw_list(node2, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 8/8] io_uring: add limited number of TWs to priority task list
  2021-10-27 14:02 [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
                   ` (6 preceding siblings ...)
  2021-10-27 14:02 ` [PATCH 7/8] io_uring: batch completion in prior_task_list Hao Xu
@ 2021-10-27 14:02 ` Hao Xu
  2021-10-27 16:39 ` [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
  2021-10-27 18:15 ` Pavel Begunkov
  9 siblings, 0 replies; 16+ messages in thread
From: Hao Xu @ 2021-10-27 14:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

One thing to watch out is sometimes irq completion TWs comes
overwhelmingly, which makes the new tw list grows fast, and TWs in
the old list are starved. So we have to limit the length of the new
tw list. A practical value is 1/3:
    len of new tw list < 1/3 * (len of new + old tw list)

In this way, the new tw list has a limited length and normal task get
there chance to run.
Say MAX_PRIORITY_TW_RATIO is k, the number of TWs in priority list is
x, in non-priority list in is y. Then a TW can be inserted to the
priority list in the condition:
            x <= 1/k * (x + y)
          =>k * x <= x + y
          =>(1 - k) * x + y >= 0

So we just need a variable z = (1 - k) * x + y. Everytime a new TW
comes,
    if z >= 0, we add it to prio list, and z += (1 - k)
    if z < 0, we add it to non-prio list, and z++

So we just one extra operation, and we can simplify the check to:
       if (priority && k >= 0) add to prio list;

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index bf1b730df158..0099decac71d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -471,6 +471,7 @@ struct io_uring_task {
 	struct callback_head	task_work;
 	bool			task_running;
 	unsigned int		nr_ctx;
+	int			factor;
 };
 
 /*
@@ -2225,6 +2226,7 @@ static void tctx_task_work(struct callback_head *cb)
 		node2 = tctx->task_list.first;
 		INIT_WQ_LIST(&tctx->task_list);
 		INIT_WQ_LIST(&tctx->prior_task_list);
+		tctx->factor = 0;
 		nr_ctx = tctx->nr_ctx;
 		if (!node1 && !node2)
 			tctx->task_running = false;
@@ -2247,6 +2249,7 @@ static void tctx_task_work(struct callback_head *cb)
 	ctx_flush_and_put(ctx, &locked);
 }
 
+#define MAX_PRIORITY_TW_RATIO 3
 static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 {
 	struct task_struct *tsk = req->task;
@@ -2260,10 +2263,13 @@ static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 	WARN_ON_ONCE(!tctx);
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	if (priority)
+	if (priority && tctx->factor >= 0) {
 		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
-	else
+		tctx->factor += (1 - MAX_PRIORITY_TW_RATIO);
+	} else {
 		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+		tctx->factor++;
+	}
 	running = tctx->task_running;
 	if (!running)
 		tctx->task_running = true;
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH for-5.16 v3 0/8] task work optimization
  2021-10-27 14:02 [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
                   ` (7 preceding siblings ...)
  2021-10-27 14:02 ` [PATCH 8/8] io_uring: add limited number of TWs to priority task list Hao Xu
@ 2021-10-27 16:39 ` Hao Xu
  2021-10-27 18:15 ` Pavel Begunkov
  9 siblings, 0 replies; 16+ messages in thread
From: Hao Xu @ 2021-10-27 16:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

在 2021/10/27 下午10:02, Hao Xu 写道:
> Tested this patchset by manually replace __io_queue_sqe() in
> io_queue_sqe() by io_req_task_queue() to construct 'heavy' task works.
> Then test with fio:
> 
> ioengine=io_uring
> sqpoll=1
> thread=1
> bs=4k
> direct=1
> rw=randread
> time_based=1
> runtime=600
> randrepeat=0
> group_reporting=1
> filename=/dev/nvme0n1
> 
> 2/8 set unlimited priority_task_list, 8/8 set a limitation of
> 1/3 * (len_prority_list + len_normal_list), data below:
>     depth     no 8/8   include 8/8      before this patchset
>      1        7.05         7.82              7.10
>      2        8.47         8.48              8.60
>      4        10.42        9.99              10.42
>      8        13.78        13.13             13.22
>      16       27.41        27.92             24.33
>      32       49.40        46.16             53.08
>      64       102.53       105.68            103.36
>      128      196.98       202.76            205.61
>      256      372.99       375.61            414.88
>      512      747.23       763.95            791.30
>      1024     1472.59      1527.46           1538.72
>      2048     3153.49      3129.22           3329.01
>      4096     6387.86      5899.74           6682.54
>      8192     12150.25     12433.59          12774.14
>      16384    23085.58     24342.84          26044.71
> 
The data in the previous v2 version seems not right, I may miss
something at that time.
> It seems 2/8 is better, haven't tried other choices other than 1/3,
> still put 8/8 here for people's further thoughts.
> 
> Hao Xu (8):
>    io-wq: add helper to merge two wq_lists
>    io_uring: add a priority tw list for irq completion work
>    io_uring: add helper for task work execution code
>    io_uring: split io_req_complete_post() and add a helper
>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>    io_uring: add nr_ctx to record the number of ctx in a task
>    io_uring: batch completion in prior_task_list
>    io_uring: add limited number of TWs to priority task list
> 
>   fs/io-wq.h    |  21 +++++++
>   fs/io_uring.c | 168 +++++++++++++++++++++++++++++++++++---------------
>   2 files changed, 138 insertions(+), 51 deletions(-)
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-5.16 v3 0/8] task work optimization
  2021-10-27 14:02 [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
                   ` (8 preceding siblings ...)
  2021-10-27 16:39 ` [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
@ 2021-10-27 18:15 ` Pavel Begunkov
  2021-10-28  6:07   ` Hao Xu
  9 siblings, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2021-10-27 18:15 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 10/27/21 15:02, Hao Xu wrote:
> Tested this patchset by manually replace __io_queue_sqe() in
> io_queue_sqe() by io_req_task_queue() to construct 'heavy' task works.
> Then test with fio:

If submissions and completions are done by the same task it doesn't
really matter in which order they're executed because the task won't
get back to userspace execution to see CQEs until tw returns.
Furthermore, it even might be worse because the earlier you submit
the better with everything else equal.

IIRC, that's how it's with fio, right? If so, you may get better
numbers with a test that does submissions and completions in
different threads.

Also interesting to find an explanation for you numbers assuming
they're stable. 7/8 batching? How often it does it go this path?
If only one task submits requests it should already be covered
with existing batching.


> ioengine=io_uring
> sqpoll=1
> thread=1
> bs=4k
> direct=1
> rw=randread
> time_based=1
> runtime=600
> randrepeat=0
> group_reporting=1
> filename=/dev/nvme0n1
> 
> 2/8 set unlimited priority_task_list, 8/8 set a limitation of
> 1/3 * (len_prority_list + len_normal_list), data below:
>     depth     no 8/8   include 8/8      before this patchset
>      1        7.05         7.82              7.10
>      2        8.47         8.48              8.60
>      4        10.42        9.99              10.42
>      8        13.78        13.13             13.22
>      16       27.41        27.92             24.33
>      32       49.40        46.16             53.08
>      64       102.53       105.68            103.36
>      128      196.98       202.76            205.61
>      256      372.99       375.61            414.88
>      512      747.23       763.95            791.30
>      1024     1472.59      1527.46           1538.72
>      2048     3153.49      3129.22           3329.01
>      4096     6387.86      5899.74           6682.54
>      8192     12150.25     12433.59          12774.14
>      16384    23085.58     24342.84          26044.71
> 
> It seems 2/8 is better, haven't tried other choices other than 1/3,
> still put 8/8 here for people's further thoughts.
> 
> Hao Xu (8):
>    io-wq: add helper to merge two wq_lists
>    io_uring: add a priority tw list for irq completion work
>    io_uring: add helper for task work execution code
>    io_uring: split io_req_complete_post() and add a helper
>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>    io_uring: add nr_ctx to record the number of ctx in a task
>    io_uring: batch completion in prior_task_list
>    io_uring: add limited number of TWs to priority task list
> 
>   fs/io-wq.h    |  21 +++++++
>   fs/io_uring.c | 168 +++++++++++++++++++++++++++++++++++---------------
>   2 files changed, 138 insertions(+), 51 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-5.16 v3 0/8] task work optimization
  2021-10-27 18:15 ` Pavel Begunkov
@ 2021-10-28  6:07   ` Hao Xu
  2021-10-28 11:46     ` Hao Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Hao Xu @ 2021-10-28  6:07 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/10/28 上午2:15, Pavel Begunkov 写道:
> On 10/27/21 15:02, Hao Xu wrote:
>> Tested this patchset by manually replace __io_queue_sqe() in
>> io_queue_sqe() by io_req_task_queue() to construct 'heavy' task works.
>> Then test with fio:
> 
> If submissions and completions are done by the same task it doesn't
> really matter in which order they're executed because the task won't
> get back to userspace execution to see CQEs until tw returns.
It may matter, it depends on the time cost of submittion
and the DMA IO time. Pick up sqpoll mode as example,
we submit 10 reqs:
t1          io_submit_sqes
             -->io_req_task_queue
t2          io_task_work_run
we actually do the submittion in t2,  but if the workload
is big engough, the 'irq completion TW' will be inserted
to the TW list after t2 is fully done, then those
'irq completion TW' will be delayed to the next round.
With this patchset, we can handle them first.
> Furthermore, it even might be worse because the earlier you submit
> the better with everything else equal.
> 
> IIRC, that's how it's with fio, right? If so, you may get better
> numbers with a test that does submissions and completions in
> different threads.
Because of the completion cache, I doubt if it works.
For single ctx, it seems we always update the cqring
pointer after all the TWs in the list are done.
> 
> Also interesting to find an explanation for you numbers assuming
The reason may be what I said above, but I don't have a
strict proof now.
> they're stable. 7/8 batching? How often it does it go this path?
> If only one task submits requests it should already be covered
> with existing batching.
the problem of the existing batch is(given there is only
one ctx):
1. we flush it after all the TWs done
2. we batch them if we have uring lock.
the new batch is:
1. don't care about uring lock
2. we can flush the completions in the priority list
    in advance.(which means userland can see it earlier.)
> 
> 
>> ioengine=io_uring
>> sqpoll=1
>> thread=1
>> bs=4k
>> direct=1
>> rw=randread
>> time_based=1
>> runtime=600
>> randrepeat=0
>> group_reporting=1
>> filename=/dev/nvme0n1
>>
>> 2/8 set unlimited priority_task_list, 8/8 set a limitation of
>> 1/3 * (len_prority_list + len_normal_list), data below:
>>     depth     no 8/8   include 8/8      before this patchset
>>      1        7.05         7.82              7.10
>>      2        8.47         8.48              8.60
>>      4        10.42        9.99              10.42
>>      8        13.78        13.13             13.22
>>      16       27.41        27.92             24.33
>>      32       49.40        46.16             53.08
>>      64       102.53       105.68            103.36
>>      128      196.98       202.76            205.61
>>      256      372.99       375.61            414.88
>>      512      747.23       763.95            791.30
>>      1024     1472.59      1527.46           1538.72
>>      2048     3153.49      3129.22           3329.01
>>      4096     6387.86      5899.74           6682.54
>>      8192     12150.25     12433.59          12774.14
>>      16384    23085.58     24342.84          26044.71
>>
>> It seems 2/8 is better, haven't tried other choices other than 1/3,
>> still put 8/8 here for people's further thoughts.
>>
>> Hao Xu (8):
>>    io-wq: add helper to merge two wq_lists
>>    io_uring: add a priority tw list for irq completion work
>>    io_uring: add helper for task work execution code
>>    io_uring: split io_req_complete_post() and add a helper
>>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>>    io_uring: add nr_ctx to record the number of ctx in a task
>>    io_uring: batch completion in prior_task_list
>>    io_uring: add limited number of TWs to priority task list
>>
>>   fs/io-wq.h    |  21 +++++++
>>   fs/io_uring.c | 168 +++++++++++++++++++++++++++++++++++---------------
>>   2 files changed, 138 insertions(+), 51 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/8] io_uring: add nr_ctx to record the number of ctx in a task
  2021-10-27 14:02 ` [PATCH 6/8] io_uring: add nr_ctx to record the number of ctx in a task Hao Xu
@ 2021-10-28  6:27   ` Hao Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Hao Xu @ 2021-10-28  6:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

在 2021/10/27 下午10:02, Hao Xu 写道:
> This is used in the next patch for task_work batch optimization.
> 
> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
haven't done the sqpoll part. nr_ctx works for
non-sqpoll mode for now.
> ---
>   fs/io_uring.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index db5d9189df3a..7c6d90d693b8 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -470,6 +470,7 @@ struct io_uring_task {
>   	struct io_wq_work_list	prior_task_list;
>   	struct callback_head	task_work;
>   	bool			task_running;
> +	unsigned int		nr_ctx;
>   };
>   
>   /*
> @@ -9655,6 +9656,9 @@ static int __io_uring_add_tctx_node(struct io_ring_ctx *ctx)
>   		mutex_lock(&ctx->uring_lock);
>   		list_add(&node->ctx_node, &ctx->tctx_list);
>   		mutex_unlock(&ctx->uring_lock);
> +		spin_lock_irq(&tctx->task_lock);
> +		tctx->nr_ctx++;
> +		spin_unlock_irq(&tctx->task_lock);
>   	}
>   	tctx->last = ctx;
>   	return 0;
> @@ -9692,6 +9696,9 @@ static __cold void io_uring_del_tctx_node(unsigned long index)
>   	mutex_lock(&node->ctx->uring_lock);
>   	list_del(&node->ctx_node);
>   	mutex_unlock(&node->ctx->uring_lock);
> +	spin_lock_irq(&tctx->task_lock);
> +	tctx->nr_ctx--;
> +	spin_unlock_irq(&tctx->task_lock);
>   
>   	if (tctx->last == node->ctx)
>   		tctx->last = NULL;
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-5.16 v3 0/8] task work optimization
  2021-10-28  6:07   ` Hao Xu
@ 2021-10-28 11:46     ` Hao Xu
  2021-10-28 19:08       ` Pavel Begunkov
  0 siblings, 1 reply; 16+ messages in thread
From: Hao Xu @ 2021-10-28 11:46 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/10/28 下午2:07, Hao Xu 写道:
> 在 2021/10/28 上午2:15, Pavel Begunkov 写道:
>> On 10/27/21 15:02, Hao Xu wrote:
>>> Tested this patchset by manually replace __io_queue_sqe() in
>>> io_queue_sqe() by io_req_task_queue() to construct 'heavy' task works.
>>> Then test with fio:
>>
>> If submissions and completions are done by the same task it doesn't
>> really matter in which order they're executed because the task won't
>> get back to userspace execution to see CQEs until tw returns.
> It may matter, it depends on the time cost of submittion
> and the DMA IO time. Pick up sqpoll mode as example,
> we submit 10 reqs:
> t1          io_submit_sqes
>              -->io_req_task_queue
> t2          io_task_work_run
> we actually do the submittion in t2,  but if the workload
> is big engough, the 'irq completion TW' will be inserted
> to the TW list after t2 is fully done, then those
> 'irq completion TW' will be delayed to the next round.
> With this patchset, we can handle them first.
>> Furthermore, it even might be worse because the earlier you submit
>> the better with everything else equal.
>>
>> IIRC, that's how it's with fio, right? If so, you may get better
>> numbers with a test that does submissions and completions in
>> different threads.
> Because of the completion cache, I doubt if it works.
> For single ctx, it seems we always update the cqring
> pointer after all the TWs in the list are done.
I suddenly realized sqpoll mode does submissions and completions
in different threads, and in this situation this patchset always
first commit_cqring() after handling TWs in priority list.
So this is the right test, do I miss something?
>>
>> Also interesting to find an explanation for you numbers assuming
> The reason may be what I said above, but I don't have a
> strict proof now.
>> they're stable. 7/8 batching? How often it does it go this path?
>> If only one task submits requests it should already be covered
>> with existing batching.
> the problem of the existing batch is(given there is only
> one ctx):
> 1. we flush it after all the TWs done
> 2. we batch them if we have uring lock.
> the new batch is:
> 1. don't care about uring lock
> 2. we can flush the completions in the priority list
>     in advance.(which means userland can see it earlier.)
>>
>>
>>> ioengine=io_uring
>>> sqpoll=1
>>> thread=1
>>> bs=4k
>>> direct=1
>>> rw=randread
>>> time_based=1
>>> runtime=600
>>> randrepeat=0
>>> group_reporting=1
>>> filename=/dev/nvme0n1
>>>
>>> 2/8 set unlimited priority_task_list, 8/8 set a limitation of
>>> 1/3 * (len_prority_list + len_normal_list), data below:
>>>     depth     no 8/8   include 8/8      before this patchset
>>>      1        7.05         7.82              7.10
>>>      2        8.47         8.48              8.60
>>>      4        10.42        9.99              10.42
>>>      8        13.78        13.13             13.22
>>>      16       27.41        27.92             24.33
>>>      32       49.40        46.16             53.08
>>>      64       102.53       105.68            103.36
>>>      128      196.98       202.76            205.61
>>>      256      372.99       375.61            414.88
>>>      512      747.23       763.95            791.30
>>>      1024     1472.59      1527.46           1538.72
>>>      2048     3153.49      3129.22           3329.01
>>>      4096     6387.86      5899.74           6682.54
>>>      8192     12150.25     12433.59          12774.14
>>>      16384    23085.58     24342.84          26044.71
>>>
>>> It seems 2/8 is better, haven't tried other choices other than 1/3,
>>> still put 8/8 here for people's further thoughts.
>>>
>>> Hao Xu (8):
>>>    io-wq: add helper to merge two wq_lists
>>>    io_uring: add a priority tw list for irq completion work
>>>    io_uring: add helper for task work execution code
>>>    io_uring: split io_req_complete_post() and add a helper
>>>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>>>    io_uring: add nr_ctx to record the number of ctx in a task
>>>    io_uring: batch completion in prior_task_list
>>>    io_uring: add limited number of TWs to priority task list
>>>
>>>   fs/io-wq.h    |  21 +++++++
>>>   fs/io_uring.c | 168 +++++++++++++++++++++++++++++++++++---------------
>>>   2 files changed, 138 insertions(+), 51 deletions(-)
>>>
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-5.16 v3 0/8] task work optimization
  2021-10-28 11:46     ` Hao Xu
@ 2021-10-28 19:08       ` Pavel Begunkov
  2021-10-29  6:18         ` Hao Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2021-10-28 19:08 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 10/28/21 12:46, Hao Xu wrote:
> 在 2021/10/28 下午2:07, Hao Xu 写道:
>> 在 2021/10/28 上午2:15, Pavel Begunkov 写道:
>>> On 10/27/21 15:02, Hao Xu wrote:
>>>> Tested this patchset by manually replace __io_queue_sqe() in
>>>> io_queue_sqe() by io_req_task_queue() to construct 'heavy' task works.
>>>> Then test with fio:
>>>
>>> If submissions and completions are done by the same task it doesn't
>>> really matter in which order they're executed because the task won't
>>> get back to userspace execution to see CQEs until tw returns.
>> It may matter, it depends on the time cost of submittion
>> and the DMA IO time. Pick up sqpoll mode as example,
>> we submit 10 reqs:
>> t1          io_submit_sqes
>>              -->io_req_task_queue
>> t2          io_task_work_run
>> we actually do the submittion in t2,  but if the workload
>> is big engough, the 'irq completion TW' will be inserted
>> to the TW list after t2 is fully done, then those
>> 'irq completion TW' will be delayed to the next round.
>> With this patchset, we can handle them first.
>>> Furthermore, it even might be worse because the earlier you submit
>>> the better with everything else equal.
>>>
>>> IIRC, that's how it's with fio, right? If so, you may get better
>>> numbers with a test that does submissions and completions in
>>> different threads.
>> Because of the completion cache, I doubt if it works.
>> For single ctx, it seems we always update the cqring
>> pointer after all the TWs in the list are done.
> I suddenly realized sqpoll mode does submissions and completions
> in different threads, and in this situation this patchset always
> first commit_cqring() after handling TWs in priority list.
> So this is the right test, do I miss something?

Yep, should be it. So the scope of the feature is SQPOLL or
completion/submission with different tasks.

>>>
>>> Also interesting to find an explanation for you numbers assuming
>> The reason may be what I said above, but I don't have a
>> strict proof now.
>>> they're stable. 7/8 batching? How often it does it go this path?
>>> If only one task submits requests it should already be covered
>>> with existing batching.
>> the problem of the existing batch is(given there is only
>> one ctx):
>> 1. we flush it after all the TWs done
>> 2. we batch them if we have uring lock.
>> the new batch is:
>> 1. don't care about uring lock
>> 2. we can flush the completions in the priority list
>>     in advance.(which means userland can see it earlier.)
>>>
>>>
>>>> ioengine=io_uring
>>>> sqpoll=1
>>>> thread=1
>>>> bs=4k
>>>> direct=1
>>>> rw=randread
>>>> time_based=1
>>>> runtime=600
>>>> randrepeat=0
>>>> group_reporting=1
>>>> filename=/dev/nvme0n1
>>>>
>>>> 2/8 set unlimited priority_task_list, 8/8 set a limitation of
>>>> 1/3 * (len_prority_list + len_normal_list), data below:
>>>>     depth     no 8/8   include 8/8      before this patchset
>>>>      1        7.05         7.82              7.10
>>>>      2        8.47         8.48              8.60
>>>>      4        10.42        9.99              10.42
>>>>      8        13.78        13.13             13.22
>>>>      16       27.41        27.92             24.33
>>>>      32       49.40        46.16             53.08
>>>>      64       102.53       105.68            103.36
>>>>      128      196.98       202.76            205.61
>>>>      256      372.99       375.61            414.88
>>>>      512      747.23       763.95            791.30
>>>>      1024     1472.59      1527.46           1538.72
>>>>      2048     3153.49      3129.22           3329.01
>>>>      4096     6387.86      5899.74           6682.54
>>>>      8192     12150.25     12433.59          12774.14
>>>>      16384    23085.58     24342.84          26044.71
>>>>
>>>> It seems 2/8 is better, haven't tried other choices other than 1/3,
>>>> still put 8/8 here for people's further thoughts.
>>>>
>>>> Hao Xu (8):
>>>>    io-wq: add helper to merge two wq_lists
>>>>    io_uring: add a priority tw list for irq completion work
>>>>    io_uring: add helper for task work execution code
>>>>    io_uring: split io_req_complete_post() and add a helper
>>>>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>>>>    io_uring: add nr_ctx to record the number of ctx in a task
>>>>    io_uring: batch completion in prior_task_list
>>>>    io_uring: add limited number of TWs to priority task list
>>>>
>>>>   fs/io-wq.h    |  21 +++++++
>>>>   fs/io_uring.c | 168 +++++++++++++++++++++++++++++++++++---------------
>>>>   2 files changed, 138 insertions(+), 51 deletions(-)
>>>>
>>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-5.16 v3 0/8] task work optimization
  2021-10-28 19:08       ` Pavel Begunkov
@ 2021-10-29  6:18         ` Hao Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Hao Xu @ 2021-10-29  6:18 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/10/29 上午3:08, Pavel Begunkov 写道:
> On 10/28/21 12:46, Hao Xu wrote:
>> 在 2021/10/28 下午2:07, Hao Xu 写道:
>>> 在 2021/10/28 上午2:15, Pavel Begunkov 写道:
>>>> On 10/27/21 15:02, Hao Xu wrote:
>>>>> Tested this patchset by manually replace __io_queue_sqe() in
>>>>> io_queue_sqe() by io_req_task_queue() to construct 'heavy' task works.
>>>>> Then test with fio:
>>>>
>>>> If submissions and completions are done by the same task it doesn't
>>>> really matter in which order they're executed because the task won't
>>>> get back to userspace execution to see CQEs until tw returns.
>>> It may matter, it depends on the time cost of submittion
>>> and the DMA IO time. Pick up sqpoll mode as example,
>>> we submit 10 reqs:
>>> t1          io_submit_sqes
>>>              -->io_req_task_queue
>>> t2          io_task_work_run
>>> we actually do the submittion in t2,  but if the workload
>>> is big engough, the 'irq completion TW' will be inserted
>>> to the TW list after t2 is fully done, then those
>>> 'irq completion TW' will be delayed to the next round.
>>> With this patchset, we can handle them first.
>>>> Furthermore, it even might be worse because the earlier you submit
>>>> the better with everything else equal.
>>>>
>>>> IIRC, that's how it's with fio, right? If so, you may get better
>>>> numbers with a test that does submissions and completions in
>>>> different threads.
>>> Because of the completion cache, I doubt if it works.
>>> For single ctx, it seems we always update the cqring
>>> pointer after all the TWs in the list are done.
>> I suddenly realized sqpoll mode does submissions and completions
>> in different threads, and in this situation this patchset always
>> first commit_cqring() after handling TWs in priority list.
>> So this is the right test, do I miss something?
> 
> Yep, should be it. So the scope of the feature is SQPOLL or
> completion/submission with different tasks.
 From the test results, it's a bit risk to apply this feature to
normal mode(no good, but have to ensure no bad), so I'd like to
apply it to sqpoll mode for now. For completion/submission
decoupled situation, maybe we can include it later.
> 
>>>>
>>>> Also interesting to find an explanation for you numbers assuming
>>> The reason may be what I said above, but I don't have a
>>> strict proof now.
>>>> they're stable. 7/8 batching? How often it does it go this path?
>>>> If only one task submits requests it should already be covered
>>>> with existing batching.
>>> the problem of the existing batch is(given there is only
>>> one ctx):
>>> 1. we flush it after all the TWs done
>>> 2. we batch them if we have uring lock.
>>> the new batch is:
>>> 1. don't care about uring lock
>>> 2. we can flush the completions in the priority list
>>>     in advance.(which means userland can see it earlier.)
>>>>
>>>>
>>>>> ioengine=io_uring
>>>>> sqpoll=1
>>>>> thread=1
>>>>> bs=4k
>>>>> direct=1
>>>>> rw=randread
>>>>> time_based=1
>>>>> runtime=600
>>>>> randrepeat=0
>>>>> group_reporting=1
>>>>> filename=/dev/nvme0n1
>>>>>
>>>>> 2/8 set unlimited priority_task_list, 8/8 set a limitation of
>>>>> 1/3 * (len_prority_list + len_normal_list), data below:
>>>>>     depth     no 8/8   include 8/8      before this patchset
>>>>>      1        7.05         7.82              7.10
>>>>>      2        8.47         8.48              8.60
>>>>>      4        10.42        9.99              10.42
>>>>>      8        13.78        13.13             13.22
>>>>>      16       27.41        27.92             24.33
>>>>>      32       49.40        46.16             53.08
>>>>>      64       102.53       105.68            103.36
>>>>>      128      196.98       202.76            205.61
>>>>>      256      372.99       375.61            414.88
>>>>>      512      747.23       763.95            791.30
>>>>>      1024     1472.59      1527.46           1538.72
>>>>>      2048     3153.49      3129.22           3329.01
>>>>>      4096     6387.86      5899.74           6682.54
>>>>>      8192     12150.25     12433.59          12774.14
>>>>>      16384    23085.58     24342.84          26044.71
>>>>>
>>>>> It seems 2/8 is better, haven't tried other choices other than 1/3,
>>>>> still put 8/8 here for people's further thoughts.
>>>>>
>>>>> Hao Xu (8):
>>>>>    io-wq: add helper to merge two wq_lists
>>>>>    io_uring: add a priority tw list for irq completion work
>>>>>    io_uring: add helper for task work execution code
>>>>>    io_uring: split io_req_complete_post() and add a helper
>>>>>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>>>>>    io_uring: add nr_ctx to record the number of ctx in a task
>>>>>    io_uring: batch completion in prior_task_list
>>>>>    io_uring: add limited number of TWs to priority task list
>>>>>
>>>>>   fs/io-wq.h    |  21 +++++++
>>>>>   fs/io_uring.c | 168 
>>>>> +++++++++++++++++++++++++++++++++++---------------
>>>>>   2 files changed, 138 insertions(+), 51 deletions(-)
>>>>>
>>>>
>>
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-10-29  6:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-27 14:02 [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
2021-10-27 14:02 ` [PATCH 1/8] io-wq: add helper to merge two wq_lists Hao Xu
2021-10-27 14:02 ` [PATCH 2/8] io_uring: add a priority tw list for irq completion work Hao Xu
2021-10-27 14:02 ` [PATCH 3/8] io_uring: add helper for task work execution code Hao Xu
2021-10-27 14:02 ` [PATCH 4/8] io_uring: split io_req_complete_post() and add a helper Hao Xu
2021-10-27 14:02 ` [PATCH 5/8] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
2021-10-27 14:02 ` [PATCH 6/8] io_uring: add nr_ctx to record the number of ctx in a task Hao Xu
2021-10-28  6:27   ` Hao Xu
2021-10-27 14:02 ` [PATCH 7/8] io_uring: batch completion in prior_task_list Hao Xu
2021-10-27 14:02 ` [PATCH 8/8] io_uring: add limited number of TWs to priority task list Hao Xu
2021-10-27 16:39 ` [PATCH for-5.16 v3 0/8] task work optimization Hao Xu
2021-10-27 18:15 ` Pavel Begunkov
2021-10-28  6:07   ` Hao Xu
2021-10-28 11:46     ` Hao Xu
2021-10-28 19:08       ` Pavel Begunkov
2021-10-29  6:18         ` Hao Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.