io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] io_task_work optimization
@ 2021-08-23 18:36 Hao Xu
  2021-08-23 18:36 ` [PATCH 1/2] io_uring: run task_work when sqthread is waken up Hao Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Hao Xu @ 2021-08-23 18:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

running task_work may not be a big bottleneck now, but it's never worse
to make it move forward a little bit.
I'm trying to construct tests to prove it is better in some cases where
it should be theoretically.
Currently only prove it is not worse by running fio tests(sometimes a
little bit better). So just put it here for comments and suggestion.

Hao Xu (2):
  io_uring: run task_work when sqthread is waken up
  io_uring: add irq completion work to the head of task_list

 fs/io-wq.h    |  9 +++++++++
 fs/io_uring.c | 23 ++++++++++++++---------
 2 files changed, 23 insertions(+), 9 deletions(-)

-- 
2.24.4


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] io_uring: run task_work when sqthread is waken up
  2021-08-23 18:36 [RFC 0/2] io_task_work optimization Hao Xu
@ 2021-08-23 18:36 ` Hao Xu
  2021-08-23 18:36 ` [PATCH 2/2] io_uring: add irq completion work to the head of task_list Hao Xu
  2021-08-25 15:58 ` [RFC 0/2] io_task_work optimization Jens Axboe
  2 siblings, 0 replies; 11+ messages in thread
From: Hao Xu @ 2021-08-23 18:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Sqthread may be waken up because of task_work added, since we are
now heavily using task_work, let's run it as soon as possible.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io_uring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c3bd2b3fc46b..8172f5f893ad 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6962,6 +6962,8 @@ static int io_sq_thread(void *data)
 		}
 
 		finish_wait(&sqd->wait, &wait);
+		if (current->task_works)
+			io_run_task_work();
 		timeout = jiffies + sqd->sq_thread_idle;
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] io_uring: add irq completion work to the head of task_list
  2021-08-23 18:36 [RFC 0/2] io_task_work optimization Hao Xu
  2021-08-23 18:36 ` [PATCH 1/2] io_uring: run task_work when sqthread is waken up Hao Xu
@ 2021-08-23 18:36 ` Hao Xu
  2021-08-23 18:41   ` Hao Xu
  2021-08-24 12:57   ` Pavel Begunkov
  2021-08-25 15:58 ` [RFC 0/2] io_task_work optimization Jens Axboe
  2 siblings, 2 replies; 11+ messages in thread
From: Hao Xu @ 2021-08-23 18:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work at the head position of the
task_list, so that it can be handled quickly and thus to reduce
avg req latency. an explanatory case:

origin timeline:
    submit_sqe-->irq-->add completion task_work
    -->run heavy work0~n-->run completion task_work
now timeline:
    submit_sqe-->irq-->add completion task_work
    -->run completion task_work-->run heavy work0~n

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
---
 fs/io-wq.h    |  9 +++++++++
 fs/io_uring.c | 21 ++++++++++++---------
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/fs/io-wq.h b/fs/io-wq.h
index 308af3928424..51b4408fd177 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -41,6 +41,15 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
 		list->last = node;
 }
 
+static inline void wq_list_add_head(struct io_wq_work_node *node,
+				    struct io_wq_work_list *list)
+{
+	node->next = list->first;
+	list->first = node;
+	if (!node->next)
+		list->last = node;
+}
+
 static inline void wq_list_add_tail(struct io_wq_work_node *node,
 				    struct io_wq_work_list *list)
 {
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8172f5f893ad..954cd8583945 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2050,7 +2050,7 @@ static void tctx_task_work(struct callback_head *cb)
 	ctx_flush_and_put(ctx);
 }
 
-static void io_req_task_work_add(struct io_kiocb *req)
+static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
 {
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
@@ -2062,7 +2062,10 @@ static void io_req_task_work_add(struct io_kiocb *req)
 	WARN_ON_ONCE(!tctx);
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+	if (emergency)
+		wq_list_add_head(&req->io_task_work.node, &tctx->task_list);
+	else
+		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
 	running = tctx->task_running;
 	if (!running)
 		tctx->task_running = true;
@@ -2122,19 +2125,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 {
 	req->result = ret;
 	req->io_task_work.func = io_req_task_cancel;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, true);
 }
 
 static void io_req_task_queue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_req_task_submit;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue_reissue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_queue_async_work;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static inline void io_queue_next(struct io_kiocb *req)
@@ -2249,7 +2252,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req)
 {
 	if (req_ref_put_and_test(req)) {
 		req->io_task_work.func = io_free_req;
-		io_req_task_work_add(req);
+		io_req_task_work_add(req, false);
 	}
 }
 
@@ -2564,7 +2567,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 		return;
 	req->result = res;
 	req->io_task_work.func = io_req_task_complete;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, true);
 }
 
 static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
@@ -4881,7 +4884,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
 	 * of executing it. We can't safely execute it anyway, as we may not
 	 * have the needed state needed for it anyway.
 	 */
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return 1;
 }
 
@@ -6430,7 +6433,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_link_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] io_uring: add irq completion work to the head of task_list
  2021-08-23 18:36 ` [PATCH 2/2] io_uring: add irq completion work to the head of task_list Hao Xu
@ 2021-08-23 18:41   ` Hao Xu
  2021-08-24 12:57   ` Pavel Begunkov
  1 sibling, 0 replies; 11+ messages in thread
From: Hao Xu @ 2021-08-23 18:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

在 2021/8/24 上午2:36, Hao Xu 写道:
> Now we have a lot of task_work users, some are just to complete a req
> and generate a cqe. Let's put the work at the head position of the
> task_list, so that it can be handled quickly and thus to reduce
> avg req latency. an explanatory case:
> 
> origin timeline:
>      submit_sqe-->irq-->add completion task_work
>      -->run heavy work0~n-->run completion task_work
> now timeline:
>      submit_sqe-->irq-->add completion task_work
>      -->run completion task_work-->run heavy work0~n
> 
> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
sorry, sent this old version by mistake, will send latest version later.
the latest one is to add high priority task_work to the front of
task_list in the time order, not the reverse order.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] io_uring: add irq completion work to the head of task_list
  2021-08-23 18:36 ` [PATCH 2/2] io_uring: add irq completion work to the head of task_list Hao Xu
  2021-08-23 18:41   ` Hao Xu
@ 2021-08-24 12:57   ` Pavel Begunkov
  2021-08-25  3:19     ` Hao Xu
  1 sibling, 1 reply; 11+ messages in thread
From: Pavel Begunkov @ 2021-08-24 12:57 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 8/23/21 7:36 PM, Hao Xu wrote:
> Now we have a lot of task_work users, some are just to complete a req
> and generate a cqe. Let's put the work at the head position of the
> task_list, so that it can be handled quickly and thus to reduce
> avg req latency. an explanatory case:
> 
> origin timeline:
>     submit_sqe-->irq-->add completion task_work
>     -->run heavy work0~n-->run completion task_work
> now timeline:
>     submit_sqe-->irq-->add completion task_work
>     -->run completion task_work-->run heavy work0~n

Might be good. There are not so many hot tw users:
poll, queuing linked requests, and the new IRQ. Could be
BPF in the future.

So, for the test case I'd think about some heavy-ish
submissions linked to your IRQ req. For instance,
keeping a large QD of 

read(IRQ-based) -> linked read_pipe(PAGE_SIZE);

and running it for a while, so they get completely
out of sync and tw works really mix up. It reads
from pipes size<=PAGE_SIZE, so it completes inline,
but the copy takes enough of time.

One thing is that Jens specifically wanted tw's to
be in FIFO order, where IRQ based will be in LIFO.
I don't think it's a real problem though, the
completion handler should be brief enough.

> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
> ---
>  fs/io-wq.h    |  9 +++++++++
>  fs/io_uring.c | 21 ++++++++++++---------
>  2 files changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/io-wq.h b/fs/io-wq.h
> index 308af3928424..51b4408fd177 100644
> --- a/fs/io-wq.h
> +++ b/fs/io-wq.h
> @@ -41,6 +41,15 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
>  		list->last = node;
>  }
>  
> +static inline void wq_list_add_head(struct io_wq_work_node *node,
> +				    struct io_wq_work_list *list)
> +{
> +	node->next = list->first;
> +	list->first = node;
> +	if (!node->next)
> +		list->last = node;
> +}
> +
>  static inline void wq_list_add_tail(struct io_wq_work_node *node,
>  				    struct io_wq_work_list *list)
>  {
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 8172f5f893ad..954cd8583945 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2050,7 +2050,7 @@ static void tctx_task_work(struct callback_head *cb)
>  	ctx_flush_and_put(ctx);
>  }
>  
> -static void io_req_task_work_add(struct io_kiocb *req)
> +static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
>  {
>  	struct task_struct *tsk = req->task;
>  	struct io_uring_task *tctx = tsk->io_uring;
> @@ -2062,7 +2062,10 @@ static void io_req_task_work_add(struct io_kiocb *req)
>  	WARN_ON_ONCE(!tctx);
>  
>  	spin_lock_irqsave(&tctx->task_lock, flags);
> -	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
> +	if (emergency)
> +		wq_list_add_head(&req->io_task_work.node, &tctx->task_list);
> +	else
> +		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>  	running = tctx->task_running;
>  	if (!running)
>  		tctx->task_running = true;
> @@ -2122,19 +2125,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
>  {
>  	req->result = ret;
>  	req->io_task_work.func = io_req_task_cancel;
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, true);
>  }
>  
>  static void io_req_task_queue(struct io_kiocb *req)
>  {
>  	req->io_task_work.func = io_req_task_submit;
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, false);
>  }
>  
>  static void io_req_task_queue_reissue(struct io_kiocb *req)
>  {
>  	req->io_task_work.func = io_queue_async_work;
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, false);
>  }
>  
>  static inline void io_queue_next(struct io_kiocb *req)
> @@ -2249,7 +2252,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req)
>  {
>  	if (req_ref_put_and_test(req)) {
>  		req->io_task_work.func = io_free_req;
> -		io_req_task_work_add(req);
> +		io_req_task_work_add(req, false);
>  	}
>  }
>  
> @@ -2564,7 +2567,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
>  		return;
>  	req->result = res;
>  	req->io_task_work.func = io_req_task_complete;
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, true);
>  }
>  
>  static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
> @@ -4881,7 +4884,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
>  	 * of executing it. We can't safely execute it anyway, as we may not
>  	 * have the needed state needed for it anyway.
>  	 */
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, false);
>  	return 1;
>  }
>  
> @@ -6430,7 +6433,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
>  	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
>  
>  	req->io_task_work.func = io_req_task_link_timeout;
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, false);
>  	return HRTIMER_NORESTART;
>  }
>  
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] io_uring: add irq completion work to the head of task_list
  2021-08-24 12:57   ` Pavel Begunkov
@ 2021-08-25  3:19     ` Hao Xu
  2021-08-25 11:18       ` Pavel Begunkov
  0 siblings, 1 reply; 11+ messages in thread
From: Hao Xu @ 2021-08-25  3:19 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/8/24 下午8:57, Pavel Begunkov 写道:
> On 8/23/21 7:36 PM, Hao Xu wrote:
>> Now we have a lot of task_work users, some are just to complete a req
>> and generate a cqe. Let's put the work at the head position of the
>> task_list, so that it can be handled quickly and thus to reduce
>> avg req latency. an explanatory case:
>>
>> origin timeline:
>>      submit_sqe-->irq-->add completion task_work
>>      -->run heavy work0~n-->run completion task_work
>> now timeline:
>>      submit_sqe-->irq-->add completion task_work
>>      -->run completion task_work-->run heavy work0~n
> 
> Might be good. There are not so many hot tw users:
> poll, queuing linked requests, and the new IRQ. Could be
> BPF in the future.
async buffered reads as well, regarding buffered reads is
hot operation.
> 
> So, for the test case I'd think about some heavy-ish
> submissions linked to your IRQ req. For instance,
> keeping a large QD of
> 
> read(IRQ-based) -> linked read_pipe(PAGE_SIZE);
> 
> and running it for a while, so they get completely
> out of sync and tw works really mix up. It reads
> from pipes size<=PAGE_SIZE, so it completes inline,
> but the copy takes enough of time.
Thanks Pavel, previously I tried
direct read-->buffered read(async buffered read)
didn't see much difference. I'll try the above case
you offered.
> 
> One thing is that Jens specifically wanted tw's to
> be in FIFO order, where IRQ based will be in LIFO.
> I don't think it's a real problem though, the
> completion handler should be brief enough.In my latest code, the IRQ based tw are also FIFO,
only LIFO between IRQ based tw and other tw:
timeline: tw1 tw2 irq1 irq2
task_list: irq1 irq2 tw1 tw2
> 
>> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
>> ---
>>   fs/io-wq.h    |  9 +++++++++
>>   fs/io_uring.c | 21 ++++++++++++---------
>>   2 files changed, 21 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/io-wq.h b/fs/io-wq.h
>> index 308af3928424..51b4408fd177 100644
>> --- a/fs/io-wq.h
>> +++ b/fs/io-wq.h
>> @@ -41,6 +41,15 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
>>   		list->last = node;
>>   }
>>   
>> +static inline void wq_list_add_head(struct io_wq_work_node *node,
>> +				    struct io_wq_work_list *list)
>> +{
>> +	node->next = list->first;
>> +	list->first = node;
>> +	if (!node->next)
>> +		list->last = node;
>> +}
>> +
>>   static inline void wq_list_add_tail(struct io_wq_work_node *node,
>>   				    struct io_wq_work_list *list)
>>   {
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 8172f5f893ad..954cd8583945 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -2050,7 +2050,7 @@ static void tctx_task_work(struct callback_head *cb)
>>   	ctx_flush_and_put(ctx);
>>   }
>>   
>> -static void io_req_task_work_add(struct io_kiocb *req)
>> +static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
>>   {
>>   	struct task_struct *tsk = req->task;
>>   	struct io_uring_task *tctx = tsk->io_uring;
>> @@ -2062,7 +2062,10 @@ static void io_req_task_work_add(struct io_kiocb *req)
>>   	WARN_ON_ONCE(!tctx);
>>   
>>   	spin_lock_irqsave(&tctx->task_lock, flags);
>> -	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>> +	if (emergency)
>> +		wq_list_add_head(&req->io_task_work.node, &tctx->task_list);
>> +	else
>> +		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>>   	running = tctx->task_running;
>>   	if (!running)
>>   		tctx->task_running = true;
>> @@ -2122,19 +2125,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
>>   {
>>   	req->result = ret;
>>   	req->io_task_work.func = io_req_task_cancel;
>> -	io_req_task_work_add(req);
>> +	io_req_task_work_add(req, true);
>>   }
>>   
>>   static void io_req_task_queue(struct io_kiocb *req)
>>   {
>>   	req->io_task_work.func = io_req_task_submit;
>> -	io_req_task_work_add(req);
>> +	io_req_task_work_add(req, false);
>>   }
>>   
>>   static void io_req_task_queue_reissue(struct io_kiocb *req)
>>   {
>>   	req->io_task_work.func = io_queue_async_work;
>> -	io_req_task_work_add(req);
>> +	io_req_task_work_add(req, false);
>>   }
>>   
>>   static inline void io_queue_next(struct io_kiocb *req)
>> @@ -2249,7 +2252,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req)
>>   {
>>   	if (req_ref_put_and_test(req)) {
>>   		req->io_task_work.func = io_free_req;
>> -		io_req_task_work_add(req);
>> +		io_req_task_work_add(req, false);
>>   	}
>>   }
>>   
>> @@ -2564,7 +2567,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
>>   		return;
>>   	req->result = res;
>>   	req->io_task_work.func = io_req_task_complete;
>> -	io_req_task_work_add(req);
>> +	io_req_task_work_add(req, true);
>>   }
>>   
>>   static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
>> @@ -4881,7 +4884,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
>>   	 * of executing it. We can't safely execute it anyway, as we may not
>>   	 * have the needed state needed for it anyway.
>>   	 */
>> -	io_req_task_work_add(req);
>> +	io_req_task_work_add(req, false);
>>   	return 1;
>>   }
>>   
>> @@ -6430,7 +6433,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
>>   	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
>>   
>>   	req->io_task_work.func = io_req_task_link_timeout;
>> -	io_req_task_work_add(req);
>> +	io_req_task_work_add(req, false);
>>   	return HRTIMER_NORESTART;
>>   }
>>   
>>
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] io_uring: add irq completion work to the head of task_list
  2021-08-25  3:19     ` Hao Xu
@ 2021-08-25 11:18       ` Pavel Begunkov
  0 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2021-08-25 11:18 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 8/25/21 4:19 AM, Hao Xu wrote:
> 在 2021/8/24 下午8:57, Pavel Begunkov 写道:
>> On 8/23/21 7:36 PM, Hao Xu wrote:
>>> Now we have a lot of task_work users, some are just to complete a req
>>> and generate a cqe. Let's put the work at the head position of the
>>> task_list, so that it can be handled quickly and thus to reduce
>>> avg req latency. an explanatory case:
>>>
>>> origin timeline:
>>>      submit_sqe-->irq-->add completion task_work
>>>      -->run heavy work0~n-->run completion task_work
>>> now timeline:
>>>      submit_sqe-->irq-->add completion task_work
>>>      -->run completion task_work-->run heavy work0~n
>>
>> Might be good. There are not so many hot tw users:
>> poll, queuing linked requests, and the new IRQ. Could be
>> BPF in the future.
> async buffered reads as well, regarding buffered reads is
> hot operation.

Good case as well, forgot about it. Should be not so hot,
as it's only when reads are served out of the buffer cache.


>> So, for the test case I'd think about some heavy-ish
>> submissions linked to your IRQ req. For instance,
>> keeping a large QD of
>>
>> read(IRQ-based) -> linked read_pipe(PAGE_SIZE);
>>
>> and running it for a while, so they get completely
>> out of sync and tw works really mix up. It reads
>> from pipes size<=PAGE_SIZE, so it completes inline,
>> but the copy takes enough of time.
> Thanks Pavel, previously I tried
> direct read-->buffered read(async buffered read)
> didn't see much difference. I'll try the above case
> you offered.

Hmm, considering that pipes have to be refilled, buffered reads
may be a better option. I'd make them all to read the same page,
+ registered buffer + reg file. And then it'd probably depend on
how fast your main SSD is.

mem = malloc_align(4096);
io_uring_register_buffer(mem, 4096);
// preferably another disk/SSD from the fast one
fd2 = open("./file");
// loop
read(fast_ssd, DIRECT, 512) -> read(fd2, fixed_buf, 4096)

Interesting what it'll yield. Probably with buffered reads
it can be experimented to have 2 * PAGE_SIZE or even slightly
more, to increase the heavy part.

btw, I'd look for latency distribution (90%, 99%) as well, it
may get the worst hit. 

>>
>> One thing is that Jens specifically wanted tw's to
>> be in FIFO order, where IRQ based will be in LIFO.
>> I don't think it's a real problem though, the
>> completion handler should be brief enough.In my latest code, the IRQ based tw are also FIFO,
> only LIFO between IRQ based tw and other tw:
> timeline: tw1 tw2 irq1 irq2
> task_list: irq1 irq2 tw1 tw2
>>
-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 0/2] io_task_work optimization
  2021-08-23 18:36 [RFC 0/2] io_task_work optimization Hao Xu
  2021-08-23 18:36 ` [PATCH 1/2] io_uring: run task_work when sqthread is waken up Hao Xu
  2021-08-23 18:36 ` [PATCH 2/2] io_uring: add irq completion work to the head of task_list Hao Xu
@ 2021-08-25 15:58 ` Jens Axboe
  2021-08-25 16:39   ` Hao Xu
  2 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2021-08-25 15:58 UTC (permalink / raw)
  To: Hao Xu; +Cc: io-uring, Pavel Begunkov, Joseph Qi

On 8/23/21 12:36 PM, Hao Xu wrote:
> running task_work may not be a big bottleneck now, but it's never worse
> to make it move forward a little bit.
> I'm trying to construct tests to prove it is better in some cases where
> it should be theoretically.
> Currently only prove it is not worse by running fio tests(sometimes a
> little bit better). So just put it here for comments and suggestion.

I think this is interesting, particularly for areas where we have a mix
of task_work uses because obviously it won't really matter if the
task_work being run is homogeneous.

That said, would be nice to have some numbers associated with it. We
have a few classes of types of task_work:

1) Work completes really fast, we want to just do those first
2) Work is pretty fast, like async buffered read copy
3) Work is more expensive, might require a full retry of the operation

Might make sense to make this classification explicit. Problem is, with
any kind of scheduling like that, you risk introducing latency bubbles
because the prio1 list grows really fast, for example.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 0/2] io_task_work optimization
  2021-08-25 15:58 ` [RFC 0/2] io_task_work optimization Jens Axboe
@ 2021-08-25 16:39   ` Hao Xu
  2021-08-25 16:46     ` Pavel Begunkov
  0 siblings, 1 reply; 11+ messages in thread
From: Hao Xu @ 2021-08-25 16:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

在 2021/8/25 下午11:58, Jens Axboe 写道:
> On 8/23/21 12:36 PM, Hao Xu wrote:
>> running task_work may not be a big bottleneck now, but it's never worse
>> to make it move forward a little bit.
>> I'm trying to construct tests to prove it is better in some cases where
>> it should be theoretically.
>> Currently only prove it is not worse by running fio tests(sometimes a
>> little bit better). So just put it here for comments and suggestion.
> 
> I think this is interesting, particularly for areas where we have a mix
> of task_work uses because obviously it won't really matter if the
> task_work being run is homogeneous.
> 
> That said, would be nice to have some numbers associated with it. We
> have a few classes of types of task_work:
> 
> 1) Work completes really fast, we want to just do those first
> 2) Work is pretty fast, like async buffered read copy
> 3) Work is more expensive, might require a full retry of the operation
> 
> Might make sense to make this classification explicit. Problem is, with
> any kind of scheduling like that, you risk introducing latency bubbles
> because the prio1 list grows really fast, for example.
Yes, this may intrpduce latency if overwhelming 1) comes in short time.
I'll try more tests to see if the problem exists and if there is a
better way, like put limited number of 1) to the front. Anyway, I'll
update this thread when I get some data.
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 0/2] io_task_work optimization
  2021-08-25 16:39   ` Hao Xu
@ 2021-08-25 16:46     ` Pavel Begunkov
  2021-08-25 17:26       ` Hao Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Pavel Begunkov @ 2021-08-25 16:46 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 8/25/21 5:39 PM, Hao Xu wrote:
> 在 2021/8/25 下午11:58, Jens Axboe 写道:
>> On 8/23/21 12:36 PM, Hao Xu wrote:
>>> running task_work may not be a big bottleneck now, but it's never worse
>>> to make it move forward a little bit.
>>> I'm trying to construct tests to prove it is better in some cases where
>>> it should be theoretically.
>>> Currently only prove it is not worse by running fio tests(sometimes a
>>> little bit better). So just put it here for comments and suggestion.
>>
>> I think this is interesting, particularly for areas where we have a mix
>> of task_work uses because obviously it won't really matter if the
>> task_work being run is homogeneous.
>>
>> That said, would be nice to have some numbers associated with it. We
>> have a few classes of types of task_work:
>>
>> 1) Work completes really fast, we want to just do those first
>> 2) Work is pretty fast, like async buffered read copy
>> 3) Work is more expensive, might require a full retry of the operation
>>
>> Might make sense to make this classification explicit. Problem is, with
>> any kind of scheduling like that, you risk introducing latency bubbles
>> because the prio1 list grows really fast, for example.
> Yes, this may intrpduce latency if overwhelming 1) comes in short time.
> I'll try more tests to see if the problem exists and if there is a
> better way, like put limited number of 1) to the front. Anyway, I'll
> update this thread when I get some data.

Not sure, but it looks that IRQ completion batching is coming to
5.15. With that you may also want to flush completions after the
IRQ sublist is exhausted.

May be worth to consider having 2 lists in the future 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 0/2] io_task_work optimization
  2021-08-25 16:46     ` Pavel Begunkov
@ 2021-08-25 17:26       ` Hao Xu
  0 siblings, 0 replies; 11+ messages in thread
From: Hao Xu @ 2021-08-25 17:26 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/8/26 上午12:46, Pavel Begunkov 写道:
> On 8/25/21 5:39 PM, Hao Xu wrote:
>> 在 2021/8/25 下午11:58, Jens Axboe 写道:
>>> On 8/23/21 12:36 PM, Hao Xu wrote:
>>>> running task_work may not be a big bottleneck now, but it's never worse
>>>> to make it move forward a little bit.
>>>> I'm trying to construct tests to prove it is better in some cases where
>>>> it should be theoretically.
>>>> Currently only prove it is not worse by running fio tests(sometimes a
>>>> little bit better). So just put it here for comments and suggestion.
>>>
>>> I think this is interesting, particularly for areas where we have a mix
>>> of task_work uses because obviously it won't really matter if the
>>> task_work being run is homogeneous.
>>>
>>> That said, would be nice to have some numbers associated with it. We
>>> have a few classes of types of task_work:
>>>
>>> 1) Work completes really fast, we want to just do those first
>>> 2) Work is pretty fast, like async buffered read copy
>>> 3) Work is more expensive, might require a full retry of the operation
>>>
>>> Might make sense to make this classification explicit. Problem is, with
>>> any kind of scheduling like that, you risk introducing latency bubbles
>>> because the prio1 list grows really fast, for example.
>> Yes, this may intrpduce latency if overwhelming 1) comes in short time.
>> I'll try more tests to see if the problem exists and if there is a
>> better way, like put limited number of 1) to the front. Anyway, I'll
>> update this thread when I get some data.
> 
> Not sure, but it looks that IRQ completion batching is coming to
> 5.15. With that you may also want to flush completions after the
> IRQ sublist is exhausted.
> 
> May be worth to consider having 2 lists in the future
I'll think about that, and there may be a way to reduce lock cost if
there are multiple lists.
lists.
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-08-25 17:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-23 18:36 [RFC 0/2] io_task_work optimization Hao Xu
2021-08-23 18:36 ` [PATCH 1/2] io_uring: run task_work when sqthread is waken up Hao Xu
2021-08-23 18:36 ` [PATCH 2/2] io_uring: add irq completion work to the head of task_list Hao Xu
2021-08-23 18:41   ` Hao Xu
2021-08-24 12:57   ` Pavel Begunkov
2021-08-25  3:19     ` Hao Xu
2021-08-25 11:18       ` Pavel Begunkov
2021-08-25 15:58 ` [RFC 0/2] io_task_work optimization Jens Axboe
2021-08-25 16:39   ` Hao Xu
2021-08-25 16:46     ` Pavel Begunkov
2021-08-25 17:26       ` Hao Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).